Using Simulators to Evaluate Multi-Turn AI Agents
Building a multi-turn conversational AI is surprisingly easy right now. Evaluating it is incredibly hard. For single-turn tasks, a standard static dataset works fine: you just feed in a prompt and assert the output against a spreadsheet of expected answers. However, that approach completely falls apart in multi-turn chat because conversations are stateful and branch in unpredictable ways.
If your agent decides to ask a clarifying question instead of giving an immediate answer, a static dataset has no way to respond, and the test just breaks. Because of this, developers often default to manual human QA, which is painfully slow, or they rely almost entirely on online testing.
Testing in production by shipping it and monitoring live interactions is tempting, but it's incredibly risky. You don't want to discover that a minor prompt tweak broke your fallback routing just because it frustrated hundreds of real users first. The feedback loop is too slow, and burning real user goodwill is expensive.
My solution to this lately is building a "Simulation User"—an LLM specifically prompted to act as a human talking to my AI agent. This accelerates the evaluation loop dramatically, lets me test specific personas, and solves the headache of managing mock data for tool integrations.
Bootstrapping personas from real data
You can dictate a simulator's behavior, goals, and communication style just by tweaking its system prompt. But instead of guessing what your users will say, I always prefer to bootstrap these personas directly from actual historical chat logs. You can mine anonymized data to extract common intents, weird phrasings, and actual edge cases, then use those insights to automatically generate scenario prompts for the simulator.
I like setting up a few distinct personas to really stress-test the agent. For example, there's the "Happy Path" user who is clear and concise, contrasted with the "Chaotic" user who uses slang, gives partial info, and constantly changes the subject. I also throw in a "Frustrated" customer to specifically test the agent's empathy, de-escalation, and fallback routing. By combining historical data with defined personas, I can deterministically test the agent across thousands of highly realistic scenarios.
Grading the results with LLM-as-a-judge
Once your Simulator and Agent are chatting, you can automate the process to spin up 100 concurrent conversations in minutes. But this introduces a new bottleneck: manual QA. Every time I tweak an agent's system prompt, I risk breaking something else in a regression, and I absolutely do not want to read 100 simulated transcripts manually to see if they worked.
Instead, I take the completed conversation logs between the agent and the simulator and pipe them straight into an LLM-as-a-judge workflow for evaluation. I do this because it's the only practical way to scale complex, qualitative grading across hundreds of test runs without blocking the release cycle. I just hand the judge model the transcript and a strict rubric to evaluate task completion (did they reach the goal?), turn count (was it too slow or redundant?), and tone (did the agent stay polite and within guardrails?). Hooking this up to a CI pipeline means developers get instant, quantitative metrics every time they push code.
Solving the tool integration headache
Evaluating the back-and-forth chat is only half the battle. Real AI agents actually take action by executing database lookups and hitting APIs. Testing this without spamming production databases or writing brittle mock servers is a huge pain.
My simulation environment solves this by intercepting the agent's tool calls at runtime—like lookup_order_status(123). Instead of hitting a real database, I have the framework use a fast, cheap LLM to generate a plausible mock JSON response on the fly.