Using Simulators to Evaluate Multi-Turn AI Agents
Building a multi-turn conversational AI is surprisingly easy right now. Evaluating it is incredibly hard. For single-turn tasks, a standard static dataset works fine: you just feed in a prompt and ass