Skip to main content

Documentation Index

Fetch the complete documentation index at: https://agno-v2-ab-home-page-updates-5-16.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Next, let’s lock in our agent behavior with evals. Think of Evals as regression tests for your agents. Same prompts, same agents, run on a schedule. Notify when behavior drifts. When we run docs/improve-agent.md, we’re looking for out-of-distribution improvements. Evals make sure in-distribution cases continue to pass. The two work together.

Cases

Cases live in evals/cases.py. Each case sends one input to an agent and (optionally) checks two things:
  • judgeAgentAsJudgeEval scores the response against criteria (binary pass/fail) using an LLM.
  • reliabilityReliabilityEval checks which tools fired against expected_tool_calls.
Results are stored in your database via eval_db (visible at os.agno.com). A case looks like this:
evals/cases.py
CASES: tuple[Case, ...] = (
    Case(
        name="web_search_recent_anthropic_research",
        agent=web_search,
        input="What did Anthropic publish about agents recently?",
        criteria=(
            "Answers the question by citing at least one real Anthropic URL "
            "(anthropic.com domain). The response is grounded in fetched content."
        ),
        expected_tool_calls=(_WEB_SEARCH_TOOL,),
    ),
    # add more cases here
)
A case can use either check or both. If both are set, the agent runs once and feeds the same response into both.

Run the suite

1

Create a virtual environment

To run the eval suite, let’s create a local virtual environment
./scripts/venv_setup.sh
Activate it
source .venv/bin/activate
2

Run the eval suite

python -m evals                # full suite
Other options:
python -m evals -v             # stream the agent run with full panels
python -m evals --case <name>  # single case while iterating
Each case prints the response, the judge verdict, and the reliability verdict. The run ends with an Eval Summary table. Results write to Postgres via eval_db. You can view the Eval history on os.agno.com alongside your sessions and traces. You can see when a case started failing and what changed.

Diagnose failures with Claude Code

Open Claude Code and paste:
Run docs/eval-and-improve.md
Claude runs the full suite, triages every failure (bad criteria, real regression, flaky LLM judge), and proposes in-scope fixes. It edits the agent or the case, re-runs, and shows you the diff.

When to run evals

TriggerFrequency
Before deploying a change to an agentEvery time
As part of CIEvery PR
Against productionOn a weekly cron
After bumping a model versionEvery time
The weekly production cron is the most valuable one. Wire it into your platform’s scheduler. See scheduling for the cron API.

What good cases look like

  • Specific. “Returns a JSON object with ticker and price” beats “Returns the right answer”.
  • Stable. Avoid prompts whose correct answer changes daily. Use phrasing like “describes a real, recent…” instead of locking in a specific result.
  • Scoped to one behavior. One case per behavior makes failures easy to read.
  • Anchored to tools. expected_tool_calls catches the failure mode where the agent confidently makes things up instead of calling a tool.

Next

Next steps →