Evaluations and notes on AI agents across real-world use cases.

  1. Evaluating Sovereign AI

    Testing Sarvam models on multilingual Indian ecommerce support tasks with tools, policy constraints, and backend state.

  2. Customer Support Environment

    The Tham Luang Cave

  3. SalesforceBench

    Can agents actually work inside a simulated Salesforce org?

  4. Editing is Hard

    Can LLMs edit PPTX reliably?