Evaluations and notes on AI agents across real-world use cases.

17th June, 2026By Arushi Gandhi

Evaluating Sovereign AI

Testing Sarvam models on multilingual Indian ecommerce support tasks with tools, policy constraints, and backend state.
6th June, 2026By Abhishek Eswaran

Customer Support Environment

The Tham Luang Cave
26th May, 2026By Arushi Gandhi

SalesforceBench

Can agents actually work inside a simulated Salesforce org?
12th May, 2026By Abhishek Eswaran

Editing is Hard

Can LLMs edit PPTX reliably?