Skip to main content

Custom Evaluation — Test Agent Response Quality

Access: Users with Eval permission enabled by their Admin

Custom Evaluation lets you build structured test sets ("golden datasets") and measure how well your agents respond. You provide questions and optionally expected answers, select quality metrics, and the system scores every response automatically.

Custom Eval

What is Custom Evaluation?

Think of it as creating a final exam for your AI agent. You write the questions, define what a good answer looks like, and the system grades the agent's responses across multiple dimensions (accuracy, coherence, safety, etc.).

Why it matters:

  • Catch quality regressions before your users do
  • Measure improvement over time as you refine your agent's prompt
  • Create repeatable, objective quality benchmarks
  • Validate that your agent handles domain-specific questions correctly

Demo Video


How to Create an Eval Set

Step 1: Navigate to Custom Eval

Go to Eval in the sidebar → click the Custom Eval tab.

Step 2: Create a New Eval Set

  1. Click New Eval Set.
  2. Enter a name for your eval set (e.g., "Support Bot Quality Test", "Sales FAQ Validation").
  3. Select the target agent — the agent whose responses will be evaluated.

Step 3: Add Test Cases

Each test case consists of:

  • Prompt (required) — The question or input to send to the agent
  • Expected Answer (optional) — The "golden" answer that represents the ideal response. Some metrics require this; others don't.

Example test cases:

PromptExpected Answer
"What are your business hours?""Our business hours are Monday to Friday, 9 AM to 5 PM EST."
"How do I reset my password?""Go to Settings > Security > Reset Password, then follow the email instructions."
"Tell me a joke"(leave blank — this tests open-ended response quality)

You can add and delete test cases within the eval set at any time.

Step 4: Select Evaluation Metrics

Custom Eval Metrics Selection

Choose one or more metrics to score the agent's responses. Each metric measures a different aspect of quality:

MetricWhat It MeasuresRequires Expected Answer?Best For
response_match_scoreROUGE-1 text similarity between the agent's response and the expected answer. Fast, no LLM needed.YesWhen exact phrasing matters (FAQs, factual queries)
final_response_match_v2Semantic similarity — checks if the meaning is the same, even if the wording differs. Uses an LLM.YesWhen the meaning matters more than exact words
response_evaluation_scoreOverall response quality rating. Uses an LLM to judge coherence, helpfulness, and completeness.NoGeneral quality assessment for any type of response
coherenceIs the response logically structured and easy to follow?NoComplex, multi-part answers
fluencyIs the response grammatically correct and natural-sounding?NoCustomer-facing agents where tone matters
groundednessIs the response grounded in source material, not making up facts?NoAgents that retrieve and summarize external data
hallucinations_v1Does the response contain unsupported claims or fabrications?NoFactual agents where accuracy is critical
safety_v1Is the response free from harmful, hateful, or dangerous content?NoAll agents — safety should always be tested
tool_trajectory_avg_scoreDid the agent call the correct tools in the correct order?Yes (tool expectations)Agents that use tools (search, database, calculators)
rubric_based_final_response_quality_v1Score the response against a custom rubric you define.Yes (rubric)When you have specific, domain-specific quality criteria

Step 5: Build a Rubric (Optional)

For rubric-based metrics, use the Rubric Builder to define your own pass/fail rules. For example:

  • "The response must mention the refund policy"
  • "The response must not include any specific dollar amounts"
  • "The response must recommend contacting support for complex issues"

Step 6: Save the Eval Set

Click Save. Your eval set is stored and ready to run.


How to Run an Eval Set

  1. Open your eval set from the Custom Eval page.
  2. Click Run Eval.
  3. What happens: The system sends every test case prompt to the agent, collects the responses, and scores each one against your selected metrics.
  4. How long it takes: Depends on the number of test cases and the agent's response time. A 10-case eval set typically takes 1–3 minutes.

Viewing Results

Results appear in the Run History panel:

Custom Eval Run History

Custom Eval Results

Overall Score:

  • A summary score (typically 0–1 or 0–100 depending on the metric) showing how well the agent performed across all test cases.
  • Displayed with a progress bar for visual clarity.

Per-Case Breakdown: Click any run to see individual results:

ColumnWhat It Shows
Test CaseThe prompt that was sent
Agent ResponseWhat the agent actually replied
Expected AnswerYour golden answer (if provided)
ScoreThe metric score for this specific test case
Pass/FailWhether the score meets the threshold

Understanding Your Scores

Score RangeQuality LevelWhat to Do
0.9–1.0 (or 90–100)ExcellentYour agent handles these cases well. Consider adding harder test cases.
0.7–0.9 (or 70–90)GoodSolid performance with room for improvement. Review the lower-scoring cases.
0.5–0.7 (or 50–70)Needs ImprovementSignificant quality gaps. Analyze which cases failed and why.
Below 0.5 (or below 50)PoorThe agent is struggling with these cases. Review the system prompt and agent configuration.

Common Use Cases

ScenarioHow to Set It Up
"I want to test if my FAQ bot gives correct answers"Create test cases with FAQ questions and expected answers. Use response_match_score or final_response_match_v2.
"I want to check if responses are coherent and well-written"Create test cases with various prompts (no expected answers needed). Use coherence + fluency.
"I want to ensure my agent doesn't hallucinate"Create test cases with factual questions. Use groundedness + hallucinations_v1.
"I want to verify my agent uses tools correctly"Create test cases that require tool use. Use tool_trajectory_avg_score.
"I changed my agent's prompt and want to check for regressions"Re-run your existing eval set and compare the new scores against previous runs in the history.

Tips for Beginners

  • Start with 5–10 test cases — Cover the most common and most important use cases for your agent.
  • Include both easy and hard cases — Easy cases verify basic functionality; hard cases reveal the agent's limits.
  • Use multiple metrics — Don't rely on a single metric. Use a combination (e.g., response_match + coherence + safety) for a complete picture.
  • Re-run after changes — Every time you update your agent's system prompt, re-run your eval set to check for regressions.
  • Grow your eval set over time — When you find a bug in production, add it as a new test case to prevent regressions.