Evaluation Strategy for Google ADK Agents

The official ADK evaluate guide makes the case bluntly: if you ship a Google ADK agent without an automated eval suite, every prompt change becomes a roll of the dice.

This page describes a four-tier evaluation cadence and maps each tier to TraptureIQ's Eval modules.

What ADK gives you

ADK ships an eval framework with two evaluation modes:

Final-response evaluation — does the agent's last message match the expected output?
Trajectory evaluation — did the agent take the expected decision path? Which tools did it call? Which sub-agents did it delegate to?

Test cases live in .test.json files alongside your agent code. The framework also supports custom metrics and environment simulation for tools that have side effects.

The four eval tiers

Tier	When it runs	What it catches	TraptureIQ module
Smoke	On every prompt change	"Did I break the basics?"	Custom Eval
Trajectory	Before each release	Did the agent follow the right reasoning path?	Custom Eval — trajectory mode
Security	Weekly + before each release	Prompt injection, jailbreaks, RAI violations	Security Eval
Load	Before traffic spikes	Latency, throughput, p95 under load	Load Test

Checklist

1. Maintain a smoke test set of 10-20 cases

Every Google ADK agent needs at least 10 hand-picked "this must work" cases:

3-5 happy-path examples
3-5 edge cases (empty input, very long input, unicode, code blocks)
2-3 known regressions from prior incidents
2-3 cross-language or accessibility cases if applicable

Run these on every prompt change. Failures block deploy.

2. Add trajectory eval for multi-tool agents

If your agent uses more than one tool, final-response evaluation isn't enough — the same answer might come from different (and sometimes wrong) reasoning paths. Use trajectory evaluation to assert:

Which tools were called
In what order
With what arguments

A failing trajectory eval on a passing final-response test is a strong signal of fragility.

3. Run security eval weekly

Security Eval tests against a curated suite of:

Known prompt injection patterns
Jailbreak attempts
PII exfiltration probes
RAI category violations

Schedule it weekly and on any prompt change. Compare results week-over-week to spot drift.

4. Load-test before campaign launches

Any agent that may see traffic spikes (marketing launch, batch job, integration go-live) needs a load test. Measure:

p95 latency at projected peak QPS
Error rate as concurrency rises
Token usage per request — surprise here usually means context isn't caching

5. Add custom metrics for domain quality

ADK's eval framework lets you define custom metrics — for example "did the response cite at least one source?" or "did the JSON response validate against the schema?" Use them when your domain has measurable correctness criteria beyond exact-string match.

6. Track eval scores over time

Every release should record:

Smoke pass rate (should be 100 %)
Trajectory pass rate (should be ≥ 95 %)
Security pass rate (should be 100 %)
p95 latency

A 5-percentage-point drop in trajectory rate is louder than any "users seem to like it" anecdote.

Anti-patterns

Vibes-based testing — Manual chat in dev is fine for prototyping. It is not a substitute for an automated eval suite.
Eval only at deploy, never on prompt change — Most regressions come from prompt edits. Run smoke on every prompt change, not just every release.
Same eval set for security and quality — Security needs adversarial cases the quality set shouldn't contain. Keep them separate.
Skipping load tests "because it works in staging" — Cold-start latency under burst traffic is a different failure mode than steady-state.

Where to configure

Custom & trajectory evals → Custom Eval
Adversarial security suite → Security Eval
Concurrency / latency under load → Load Test
All eval results in one view → Eval Dashboard

What ADK gives you​

The four eval tiers​

Checklist​

1. Maintain a smoke test set of 10-20 cases​

2. Add trajectory eval for multi-tool agents​

3. Run security eval weekly​

4. Load-test before campaign launches​

5. Add custom metrics for domain quality​

6. Track eval scores over time​

Anti-patterns​

Where to configure​

References​