Skip to main content

Evaluation Strategy for Google ADK Agents

The official ADK evaluate guide makes the case bluntly: if you ship a Google ADK agent without an automated eval suite, every prompt change becomes a roll of the dice.

This page describes a four-tier evaluation cadence and maps each tier to TraptureIQ's Eval modules.


What ADK gives you

ADK ships an eval framework with two evaluation modes:

  1. Final-response evaluation — does the agent's last message match the expected output?
  2. Trajectory evaluation — did the agent take the expected decision path? Which tools did it call? Which sub-agents did it delegate to?

Test cases live in .test.json files alongside your agent code. The framework also supports custom metrics and environment simulation for tools that have side effects.


The four eval tiers

TierWhen it runsWhat it catchesTraptureIQ module
SmokeOn every prompt change"Did I break the basics?"Custom Eval
TrajectoryBefore each releaseDid the agent follow the right reasoning path?Custom Eval — trajectory mode
SecurityWeekly + before each releasePrompt injection, jailbreaks, RAI violationsSecurity Eval
LoadBefore traffic spikesLatency, throughput, p95 under loadLoad Test

Checklist

1. Maintain a smoke test set of 10-20 cases

Every Google ADK agent needs at least 10 hand-picked "this must work" cases:

  • 3-5 happy-path examples
  • 3-5 edge cases (empty input, very long input, unicode, code blocks)
  • 2-3 known regressions from prior incidents
  • 2-3 cross-language or accessibility cases if applicable

Run these on every prompt change. Failures block deploy.

2. Add trajectory eval for multi-tool agents

If your agent uses more than one tool, final-response evaluation isn't enough — the same answer might come from different (and sometimes wrong) reasoning paths. Use trajectory evaluation to assert:

  • Which tools were called
  • In what order
  • With what arguments

A failing trajectory eval on a passing final-response test is a strong signal of fragility.

3. Run security eval weekly

Security Eval tests against a curated suite of:

  • Known prompt injection patterns
  • Jailbreak attempts
  • PII exfiltration probes
  • RAI category violations

Schedule it weekly and on any prompt change. Compare results week-over-week to spot drift.

4. Load-test before campaign launches

Any agent that may see traffic spikes (marketing launch, batch job, integration go-live) needs a load test. Measure:

  • p95 latency at projected peak QPS
  • Error rate as concurrency rises
  • Token usage per request — surprise here usually means context isn't caching

5. Add custom metrics for domain quality

ADK's eval framework lets you define custom metrics — for example "did the response cite at least one source?" or "did the JSON response validate against the schema?" Use them when your domain has measurable correctness criteria beyond exact-string match.

6. Track eval scores over time

Every release should record:

  • Smoke pass rate (should be 100 %)
  • Trajectory pass rate (should be ≥ 95 %)
  • Security pass rate (should be 100 %)
  • p95 latency

A 5-percentage-point drop in trajectory rate is louder than any "users seem to like it" anecdote.


Anti-patterns

  • Vibes-based testing — Manual chat in dev is fine for prototyping. It is not a substitute for an automated eval suite.
  • Eval only at deploy, never on prompt change — Most regressions come from prompt edits. Run smoke on every prompt change, not just every release.
  • Same eval set for security and quality — Security needs adversarial cases the quality set shouldn't contain. Keep them separate.
  • Skipping load tests "because it works in staging" — Cold-start latency under burst traffic is a different failure mode than steady-state.

Where to configure


References