Skip to main content

AI Agent Observability Strategy for Google ADK

When a Google ADK agent fails in production, you have three minutes to answer "what happened?" before the user moves on or the on-call escalates. Without proper observability, those three minutes turn into three days of print() archaeology.

This page maps the four observability primitives in the official ADK docs — logging, traces, metrics, and callbacks — to TraptureIQ modules.


The four ADK observability primitives

PrimitiveWhat it tells youADK reference
LoggingWhat the agent did, in chronological orderadk docs / logging
TracesThe decision path: which sub-agents, tools, and LLM calls fired, with timingadk docs / traces
MetricsAggregates: error rate, latency p95, token usageadk docs / metrics
CallbacksHook into lifecycle events for custom instrumentationadk docs / callbacks

ADK emits structured OpenTelemetry GenAI signals by default. TraptureIQ consumes these and presents them in dedicated modules.


Checklist

1. Set logging levels intentionally

In ADK, use:

  • DEBUG — full prompts and responses (development only — bloats logs and risks logging PII)
  • INFO — lifecycle events (recommended default for production)
  • WARNING — only when something is recoverably wrong

View structured logs in TraptureIQ's Logs module. Use the filter sidebar to scope by agent, session, severity, and timestamp.

2. Use traces, not logs, for "which path did it take?"

Logs are a stream of events. Traces are a tree of spans showing causality — which sub-agent invoked which tool which called which LLM. Use Traces for any incident involving multi-step reasoning or sub-agents.

3. Wire lifecycle callbacks for cross-cutting concerns

ADK's callbacks let you hook the before/after of model, tool, and agent invocations. Use them — not the agent's main code — for:

  • PII redaction before sending to Geminibefore_model_callback
  • Caching expensive tool resultsbefore_tool_callback
  • Custom metrics emissionafter_model_callback
  • Authorization gatesbefore_agent_callback

Code that lives in callbacks stays out of your agent's prompt and decision-making logic.

4. Build a per-agent observability dashboard

For every production agent, pin these in your team's view:

  • Error rate — should sit under 1 % steady-state
  • Latency p95 — Gemini Pro typically 2-6 s; Flash 0.5-2 s
  • Token usage trend — rising trend = compaction not enabled or prompt grew
  • Session abandonment — % of sessions ending mid-conversation

The Analytics Dashboard and Agent Intelligence modules give you all four out of the box.

5. Drill from analytics → sessions → traces

The investigation flow:

  1. Spot an anomaly in the Analytics dashboard (latency spike, error spike)
  2. Filter Sessions by the same time window and agent → find affected sessions
  3. Open a Trace for one bad session → see exactly which span failed
  4. Open Logs scoped to that session → read the structured payload

This four-click path is the difference between a five-minute incident review and a two-hour one.


Anti-patterns

  • Logging at DEBUG in production — Floods storage, slows ingest, and risks leaking PII into logs. Use INFO.
  • print() instead of structured logging — Loses the ability to filter by agent, session, severity.
  • Relying on traces alone with no metrics — Traces tell you about one request; metrics tell you about your fleet.
  • No callbacks — Means PII filtering, auth, and metrics live in the agent's main code, polluting prompts and complicating debugging.

Where to configure


References