Next, let’s lock in our agent behavior with evals. Think of Evals as regression tests for your agents. Same prompts, same agents, run on a schedule. Notify when behavior drifts. When we runDocumentation Index
Fetch the complete documentation index at: https://agno-v2-ab-home-page-updates-5-16.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
docs/improve-agent.md, we’re looking for out-of-distribution improvements. Evals make sure in-distribution cases continue to pass. The two work together.
Cases
Cases live inevals/cases.py. Each case sends one input to an agent and (optionally) checks two things:
- judge —
AgentAsJudgeEvalscores the response againstcriteria(binary pass/fail) using an LLM. - reliability —
ReliabilityEvalchecks which tools fired againstexpected_tool_calls.
eval_db (visible at os.agno.com).
A case looks like this:
evals/cases.py
Run the suite
Create a virtual environment
To run the eval suite, let’s create a local virtual environmentActivate it
Eval Summary table.
Results write to Postgres via eval_db. You can view the Eval history on os.agno.com alongside your sessions and traces. You can see when a case started failing and what changed.
Diagnose failures with Claude Code
Open Claude Code and paste:When to run evals
| Trigger | Frequency |
|---|---|
| Before deploying a change to an agent | Every time |
| As part of CI | Every PR |
| Against production | On a weekly cron |
| After bumping a model version | Every time |
What good cases look like
- Specific. “Returns a JSON object with
tickerandprice” beats “Returns the right answer”. - Stable. Avoid prompts whose correct answer changes daily. Use phrasing like “describes a real, recent…” instead of locking in a specific result.
- Scoped to one behavior. One case per behavior makes failures easy to read.
- Anchored to tools.
expected_tool_callscatches the failure mode where the agent confidently makes things up instead of calling a tool.