Skip to main content

Load Test — Performance Under Pressure

Access: Users with Eval permission enabled by their Admin

Load Testing simulates multiple users sending messages to your agent simultaneously, helping you understand how your agent's infrastructure performs under concurrent traffic.


What is Load Testing?

Think of it as a stress test for your agent's infrastructure. In production, multiple users may chat with your agent at the same time. Load testing answers:

  • Can my agent handle 10/50/100 simultaneous users?
  • How does response time change under load?
  • At what point does the infrastructure start failing?
  • Are there rate limit issues I need to address?

Why it matters:

  • An agent that responds in 2 seconds with one user might take 30 seconds with 50 concurrent users
  • LLM APIs have rate limits that can cause failures under high load
  • Users abandon conversations if response times exceed 5–10 seconds
  • It's far better to discover these limits in testing than in production

Demo Video


How to Configure a Load Test

Step 1: Navigate to Load Test

Go to Eval in the sidebar → click the Load Test tab.

Step 2: Set Up the Test

  1. Select the agent to test from the dropdown.
  2. Enter test prompts — one or more messages that will be sent by virtual users. Use prompts that represent typical usage:
    • "What are your business hours?"
    • "Help me debug this Python error: IndexError..."
    • "Summarize the Q3 earnings report"
  3. Set concurrent users — the number of virtual users sending messages simultaneously (e.g., 5, 10, 25, 50).
  4. Set total requests — the total number of messages to send across all virtual users (e.g., 50, 100, 200).

Load Test Configuration

Step 3: Run the Test

Click Start Test.

What happens: The system creates the specified number of virtual users, each sending messages from your prompt list simultaneously. It records timing, success/failure, and response details for every request.

How long it takes: Depends on your concurrent users, total requests, and agent response time. A typical test with 10 users and 50 requests takes 1–5 minutes.


Understanding the Results

Summary Metrics

MetricWhat It MeansWhat to Look For
Total RequestsNumber of messages sentShould match your configuration
Successful RequestsMessages that got a valid responseIdeally equals total requests (no failures)
Failed RequestsMessages that errored or timed outAny failures need investigation
Success RatePercentage of successful requestsTarget: 99%+ for production readiness
Requests/SecondThroughput — how many requests your agent handles per secondHigher is better
Avg LatencyAverage response time across all requestsTarget: under 5 seconds for good UX
Min LatencyFastest responseYour agent's best-case performance
Max LatencySlowest responseYour agent's worst-case performance — high values indicate bottlenecks

Latency Distribution Chart

A bar chart showing how response times are distributed across buckets:

0-1s:  ████████████ (30 requests)
1-2s: ████████ (20 requests)
2-5s: ████ (10 requests)
5-10s: ██ (5 requests)
10s+: █ (2 requests)

What to look for:

  • A tight distribution (most requests in the 0–2s range) means consistent performance
  • A long tail (many requests in 5s+ range) means performance degrades under load
  • A bimodal distribution (two peaks) may indicate intermittent infrastructure issues

Load Test Results

Per-Request Table

An expandable table showing individual request results:

ColumnWhat It Shows
Request #Sequential request number
StatusHTTP status code (200 = success, 429 = rate limited, 500 = server error)
LatencyResponse time for this specific request
ResponseThe agent's response (expandable)
ErrorError details if the request failed

Common Results and What They Mean

Result PatternDiagnosisFix
All requests succeed, latency stays lowYour agent handles this load wellTry increasing concurrent users to find the limit
Success rate drops as load increasesYour infrastructure is at capacityScale up your agent's deployment (more instances, more CPU/memory)
Many 429 errors (rate limited)Your LLM API quota is too lowRequest a quota increase from your LLM provider, or reduce concurrent users
Latency increases linearly with usersYour agent processes requests sequentiallyEnable horizontal scaling (multiple agent instances)
Some requests timeoutAgent infrastructure is overwhelmedAdd more compute resources or implement request queuing

Run History

Previous load test runs are saved and accessible from the history list. This lets you:

  • Compare performance across runs — Did scaling up your infrastructure help?
  • Track improvements over time — Is latency improving as you optimize?
  • Establish baselines — Know your agent's performance at different load levels

Tips for Beginners

  • Start small — Begin with 5 concurrent users and 20 total requests. Increase gradually.
  • Use realistic prompts — Test with the kinds of messages your real users will send. Short questions and long documents behave very differently.
  • Watch for 429 errors — These mean your LLM API is rate-limiting you. This is the most common issue and is solved by requesting a higher quota.
  • Run before launch — Always load test before opening your agent to a large user base.
  • Test at 2x expected load — If you expect 20 concurrent users, test with 40 to have a safety margin.

Tip: If you see many rate-limit errors (429s), reduce the number of concurrent users. Your agent's underlying LLM API has its own quota limits that TraptureIQ cannot control.