Skip to main content

Evaluation & Testing — Measure and Improve Your Agents

Access: Users with Eval permission enabled by their Admin

The Evaluation module helps you systematically test your AI agents to ensure they are accurate, safe, and performant. Instead of manually checking every response, you can create structured test suites that automatically score your agents across multiple dimensions.

Eval Dashboard Overview

Why Evaluate Your Agents?

Building an AI agent is the easy part. Ensuring it provides the correct answer 99% of the time, doesn't leak sensitive data, and handles high traffic — that's the hard part.

Without evaluation:

  • You discover bugs only when users report them
  • Security vulnerabilities remain hidden until exploited
  • Performance issues only surface during peak traffic

With evaluation:

  • You catch quality regressions before users see them
  • You proactively find and fix security weaknesses
  • You know your infrastructure limits before going to production

Demo Video


Three Types of Evaluation

TraptureIQ provides three complementary evaluation types, each testing a different aspect of your agent:

TypeWhat It TestsWhen to Use ItAnalogy
Custom EvaluationResponse quality — accuracy, coherence, fluency, groundedness, hallucinationsBefore deploying a new agent or after changing its system promptLike a school exam — you provide questions with expected answers and grade the agent
Security EvaluationSecurity robustness — jailbreak resistance, data leak prevention, responsible AI complianceBefore deploying to production, and periodically thereafterLike a penetration test — you attack your own agent to find weaknesses
Load TestingPerformance under pressure — latency, throughput, error rates under concurrent loadBefore launching to a large user base or handling expected traffic spikesLike a stress test — you simulate many users hitting your agent at the same time

Core Metric Types

Across all evaluation types, metrics fall into four categories:

CategoryMetricsGoal
AccuracyGroundedness, Correctness, Response MatchEnsure the agent isn't hallucinating facts or giving wrong answers
LogicCoherence, Tool Usage, Reasoning QualityEnsure multi-step reasoning is sound and tools are used correctly
SafetyToxic Content, Jailbreak Resistance, PII LeakPrevent the agent from being manipulated by users or generating harmful content
PerformanceLatency, Token Count, Throughput, Error RateOptimize for speed and infrastructure cost

Getting Started with Evaluations

Recommended order for a new agent:

  1. Start with Custom Eval — Create a small test set (5–10 questions with expected answers) to verify basic quality. Go to Custom Eval guide

  2. Run Security Eval — Use the pre-built Jailbreak and DLP templates to check if your agent has obvious vulnerabilities. Go to Security Eval guide

  3. Run a Load Test — Simulate 5–10 concurrent users to verify your agent handles moderate traffic. Go to Load Test guide

  4. Iterate — Review the results, improve your agent's prompt or configuration, and re-run the evaluations to measure improvement.


Understanding Eval Results

All evaluation types share a common results structure:

  • Run History — A list of all previous evaluation runs, so you can track improvement over time
  • Overall Score — A summary metric indicating how well the agent performed
  • Per-Case Breakdown — Individual results for each test case, showing exactly where the agent succeeded or failed
  • Status Badges — Visual indicators showing the current state of each run:
StatusWhat It Means
PendingEvaluation is queued but hasn't started yet
RunningEvaluation is in progress — test cases are being sent to the agent
CompletedAll test cases have been processed and scored
FailedThe evaluation encountered an error (e.g., agent unreachable)

Tips for Beginners

  • Start small — Create eval sets with 5–10 test cases first. You can always add more later.
  • Cover edge cases — Include unusual inputs, boundary conditions, and adversarial prompts in your test cases.
  • Run evaluations regularly — Every time you change an agent's system prompt or configuration, re-run your evaluations to check for regressions.
  • Compare runs — Use the run history to compare scores over time. Are your changes improving or degrading quality?