Evaluating Sovereign AI

Everyone seems to have an opinion on Sarvam on X. And they should. Sovereign AI is a strategic priority, and it is worth evaluating on work that actually matters.

The good news is that most economically useful, verifiable workflows do not need Anthropic/OpenAI-level general intelligence. They need a solid model, a team close enough to the customer's operations, a reliable harness, and sharp evals that show whether the agent can do the job end to end. I am especially bullish on teams that can do this.

The bad news is Sarvam is not primarily a consumer or coding LLM, so most prosumer use cases are not flowing through it. That makes it hard to know how good the models are in the settings they are actually meant for. We tested sarvam-30b and sarvam-105b on ecommerce support: a commercially useful workflow that combines conversation, policy following, backend state tracking, and agentic tool use.

Ecommerce is a good starting point because it looks simple until it meets real-world mess: COD cancellations, partial refunds, multilingual customers, address changes, coupon exceptions, failed deliveries, and conflicting backend states. This simulated environment is our attempt to evaluate sovereign AI on that kind of operational work. Feedback is welcome at arushi@ressl.ai.

TL;DR

The Sarvam models were generally able to understand the customer intent, including multilingual and code-mixed requests, but struggled with operational reliability. The main failure modes were:

Tool execution: the models often knew the right support outcome, but did not consistently call the right tools in the right order.
Conversation control: especially for sarvam-30b, the model sometimes kept asking for information it already had or failed to close the ticket.
Transactional correctness: the models could give a plausible answer while failing to update the backend state correctly.
Policy inhibition: under customer pressure, the models sometimes attempted actions that policy forbids, such as duplicate refunds or invalid escalations.

Why This Eval

Support agents have to verify identity, read backend state, apply policy, and take allowed actions without losing the customer thread.

In Indian ecommerce, that also means code-mixed conversations: Hinglish COD cancellations, Tamil delivery notes, Gujarati coupon exceptions, Bengali escalation requests. This eval tests that mix of language and operations.

Environment Overview

We built a small τ²-style ecommerce support environment. The agent interacts with a simulated customer, calls tools, follows a written policy, and receives a terminal reward from a verifier.

Simulated customer

Starts with a hidden support goal and reveals details over the conversation.

Support agent model

Sends customer-facing messages or calls one tool per turn.

Tools + verifier

Tools read and update synthetic state; the verifier checks terminal reward.

customer agent tools synthetic DB verifier reward

The environment combines a multi-turn conversation loop, tools, an LLM-backed customer simulator, model adapters, and a terminal verifier that checks policy, evidence, and database outcomes.

Each episode starts with a user message. The model can either send a message or call one tool. The environment appends the result to the transcript, asks the simulated customer to respond when appropriate, and continues until the user stops, the agent stops, or the max-turn limit is reached.

Simulated Support Software

The tool surface is meant to resemble a small ecommerce support stack:

Area	Representative Tools
Identity	`verify_identity`, `get_customer`, `get_customer_profile`
Orders	`get_order`, `cancel_order`, `list_order_items`
Delivery	`track_shipment`, `update_delivery_address`, `add_delivery_instructions`, `escalate_delivery_issue`
Payments and refunds	`get_payment`, `get_refund_status`, `initiate_refund`, `initiate_item_refund`
Returns	`check_return_eligibility`, `create_return_ticket`
Coupons and credit	`validate_coupon`, `issue_store_credit`
Escalation	`raise_ticket`, `transfer_to_human`

The tools read and update synthetic data. Some tools require identity verification before showing order, delivery, payment, or refund details.

Data Generation and Task Construction

The database is synthetic, but modeled on ecommerce CRM and support-system records: orders, payments, refunds, delivery, coupons, loyalty, complaints, and language preferences. Domain experts then hand-designed tasks around those patterns.

For example, a single task may depend on:

customer identity: customer ID, email, phone last four digits;
order state: confirmed, packed, shipped, delivered, cancelled, out for delivery;
payment state: UPI, card, wallet, COD, captured, failed, refunded, partially initiated;
policy state: return window, coupon segment eligibility, delivery attempt count;
customer persona: persistent, impatient, loyal, multilingual, code-mixed.

The result is an eval where success requires both conversational competence and backend-state competence.

Multi-turn Customers

We used gpt-4o as the simulated customer. The support agent never sees the hidden task fields. The customer simulator sees them and uses them to play the customer naturally.

Each task includes private customer-side data:

persona
user_instruction
initial_user_message
hidden_slots
clarification_answers
pushback

A simplified example:

{
  "initial_user_message": "Mere O5002 earbuds damaged aaye hain. Main bahut purana customer hoon, help karo.",
  "hidden_slots": {
    "customer_id": "C1001",
    "phone_last4": "7788",
    "email": "meera@example.com",
    "order_id": "O5002"
  },
  "pushback": "Replacement theek hai, but inconvenience ke liye kuch credit mil sakta hai?"
}

The customer simulator reveals hidden slots only when needed, may push back once, and then accepts a clear policy-compliant resolution.

Task Types

The task set was designed around common patterns in ecommerce support data:

order, cancellation, shipped-COD, and address-change edge cases;
delivery tracking, failed attempts, and delivery-instruction updates;
payment failures, refund status, duplicate refunds, and item-level partial refunds;
returns, replacements, return-window exceptions, and goodwill credit;
coupon validity, segment eligibility, discount refusal, and customer pushback.

The multilingual layer included English, Hindi, Hinglish, Tamil, Telugu, Bengali, Urdu, Gujarati, Malayalam, and Marathi-style code-mixing. The eval did not primarily score fluency; language checks were usually diagnostic. But multilingual handling still mattered because misunderstanding a customer request often caused wrong tool use or bad closure behavior.

Policy

The support policy is injected into every agent observation and defines what the agent is allowed to do.

At a high level, the policy covers:

identity and privacy rules before exposing account or order details;
support policies for orders, delivery, refunds, returns, coupons, and goodwill credit;
when to take tool-backed action versus explain a policy outcome;
forbidden actions such as duplicate refunds, invalid cancellations, and premature escalations;
exact handling of user-provided details like item names, delivery instructions, and refund references.

This policy is intentionally operational. It does not just say "be helpful"; it defines specific tool sequences and forbidden actions.

Harness and Scoring

The harness runs each task as a multi-turn episode: initialize the task, let the agent alternate between messages and tool calls, stop at a terminal state, then score and export the trajectory.

The verifier gives one pass/fail result for each episode. A pass requires identity and privacy compliance, policy-safe tool use, the expected database state, no unexpected mutations, and a customer-facing response that resolves the case.

A plausible final answer is not enough; the trajectory also has to respect policy, use evidence appropriately, and leave the backend in the right state.

Free-text DB fields such as delivery instructions and delivery addresses are matched semantically. Structured states, references, amounts, and statuses are exact.

Models Evaluated

We evaluated sarvam-30b, sarvam-105b, qwen/qwen3-32b through OpenRouter, and gpt-5.5 through OpenAI.

Because the Sarvam models were the focus, gpt-5.5 is included only as a difficulty calibration reference. Its 90.5% pass rate suggests the eval was hard, but still solvable by a stronger model; the failure-mode analysis below focuses on the Sarvam runs.

The user simulator was gpt-4o, and runs used a bounded multi-turn budget to expose looping behavior.

Evaluation Results

pass@3 means the model solved a task in at least one run; pass^3 means it solved the task consistently across all runs.

Model	Pass Rate	pass@3	pass^3	Notes
`qwen/qwen3-32b`	73.0%	95.2%	42.9%	Strong, but many passes had warnings
`sarvam-30b`	34.9%	61.9%	4.8%	Frequent loops and incomplete workflows
`sarvam-105b`	65.1%	76.2%	47.6%	Much stronger, but still brittle in transactional workflows
`gpt-5.5`	90.5%	90.5%	90.5%	Used as the difficulty-calibration reference

The jump from Sarvam 30B to Sarvam 105B is substantial. sarvam-105b nearly doubled pass rate and almost eliminated max-turn loops. But both Sarvam models exposed the same broad weakness: they were more conversationally capable than operationally reliable.

The most important weakness patterns are below.

Failure Mode 1: Tool/action Reliability

This was the broadest and most important weakness. The model often knew what to say, but failed to execute the required tool path.

In a damaged-item goodwill-credit task, the expected flow was:

Verify identity

Confirm the customer before exposing order or refund details.

Create return ticket

Use order evidence to request the replacement workflow.

Issue goodwill credit

Apply the allowed credit and close with the replacement outcome.

verify_identity get_order create_return_ticket issue_store_credit

The model talked about replacement and Rs 50 goodwill credit, but did not call create_return_ticket or issue_store_credit. The final DB state stayed unchanged:

orders.O5002.return_status = not_requested
customers.C1001.wallet_credit = 0

The expected DB state was:

orders.O5002.return_status = replacement_requested
customers.C1001.wallet_credit = 50

Another example was delivery tracking. In simple_track_delivery_tamil, sarvam-30b called get_order, saw out_for_delivery, and told the customer the order was out for delivery. The answer was plausible, but the task required track_shipment. The verifier marked this as an evidence failure.

In multi-item refund tasks, the correct tool path was:

verify_identity -> get_order -> list_order_items -> initiate_item_refund -> get_refund_status

Several Sarvam traces failed before completing this sequence.

For sarvam-105b, a recurring version of this was an EmptyAssistantMessage immediately after a tool result. The model would verify identity or fetch an order, then produce no valid assistant message or next tool call.

The broader conclusion: the base models are not reliably agentic. They can often describe the correct support action, but they do not consistently convert intent into the correct sequence of tool calls.

Failure Mode 2: Looping and Failure to Close

This was especially severe for sarvam-30b, where step-limit loops were one of the dominant failure modes.

A representative example:

User: Mere O5002 earbuds damaged aaye hain. Main bahut purana customer hoon, help karo.
Assistant: Please provide customer ID, phone last 4, and email.
User: customer ID: C1001, phone last 4: 7788, email: meera@example.com
Assistant: Identity verified. Please provide the order ID.
User: Replacement theek hai.
Assistant: Please provide the order ID.
User: Haan, Rs 50 goodwill credit help karega.
Assistant: Please provide the order ID.
...
max_turns

The order ID O5002 was present in the first user message. The model failed to retain it, got stuck in clarification mode, and never acted.

This happened in several forms:

asking for information already provided;
asking the same clarification after the user repeated a vague request;
offering optional next steps after the case was already resolved;
failing to stop after giving a policy-compliant final answer.

This is a conversation-state weakness. The model does not reliably know when it has enough information, when to act, or when to stop.

Failure Mode 3: Transactional State-change Weakness

Support automation is not just conversation. The model must mutate backend state correctly.

Examples:

Task	Expected	Failed Actual
Damaged earbuds	`orders.O5002.return_status = replacement_requested`	`not_requested`
Goodwill credit	`customers.C1001.wallet_credit = 50`	`0`
Repeated delivery failure	`Call customer before delivery and come after 6 PM`	overwritten with unrelated pushback
Delivery instruction	`Call before delivery`	paraphrased as `Please call the customer before delivering the order.`
Multi-item refund	item-level refund for damaged Desk Lamps	refund workflow incomplete

The delivery-instruction case is especially revealing. The policy says not to invent delivery instructions and to copy the customer's exact instruction. The model sometimes paraphrased it or overwrote it with a later user pushback.

In one repeated-delivery case, the expected instruction was:

Call customer before delivery and come after 6 PM

The final DB value became:

Aaj hi delivery chahiye, cancel mat karna.

The model preserved the latest user sentiment, but not the operational instruction that needed to be stored.

This weakness matters because a customer may be satisfied by the message, but the ticket is still unresolved if the backend state is wrong.

Failure Mode 4: Policy Inhibition

This was less frequent than tool failures or loops, but it is high risk.

In an already-refunded cancelled-order task, the expected behavior was:

verify_identity -> get_order -> get_refund_status -> tell user refund RF-9014 is already initiated

The policy explicitly forbids duplicate refunds. But sarvam-105b attempted:

initiate_refund(payment_id=PAY9014)

The tool rejected it because a refund already existed. The model later recovered and told the user the refund was already initiated, but the forbidden action had already been attempted.

Another example involved a defective webcam in a multi-item order. The webcam refund had already been requested under reference IR-9015-42, but the model attempted initiate_item_refund again before checking and reporting the existing refund state.

This is a "do-not-do" weakness. Under user pressure, the model sometimes acts before enforcing policy constraints.

User pressure

The customer asks the agent to refund an order again.

Policy check

Look up refund status before taking any irreversible action.

Safe closure

Report the existing refund reference instead of creating a duplicate refund.

correct: get_refund_status forbidden: initiate_refund again

The correct behavior is to route pressure through policy, not through the user's requested action.

Conclusion

For medium-complexity ecommerce support, the Sarvam models need stronger tool sequencing, state tracking, clean termination, and guardrails around irreversible or policy-sensitive actions.