Evaluations and notes on AI agents across real-world use cases.
-
Evaluating Sovereign AI
Testing Sarvam models on multilingual Indian ecommerce support tasks with tools, policy constraints, and backend state.
-
Customer Support Environment
The Tham Luang Cave
-
SalesforceBench
Can agents actually work inside a simulated Salesforce org?
-
Editing is Hard
Can LLMs edit PPTX reliably?