Stochastic Testing Guide

LLMs are non-deterministic. Your tests should account for that.


The Problem: "It Worked When I Tried It"

You run a test. It passes. You run it again. It fails. Welcome to LLM testing.

Run 1: ✅ Pass (score: 95)
Run 2: ✅ Pass (score: 92)
Run 3: ❌ Fail (score: 68)
Run 4: ✅ Pass (score: 88)
Run 5: ✅ Pass (score: 91)

Was that a bug? Random variation? How do you know if your agent "works"?

Traditional unit testing assumes determinism. LLM testing requires statistical thinking.


The Solution: Stochastic Testing

AgentEval's StochasticRunner handles LLM non-determinism properly:

var stochasticRunner = new StochasticRunner(harness, testOptions);

var result = await stochasticRunner.RunStochasticTestAsync(
    agent, 
    testCase, 
    new StochasticOptions(
        Runs: 10,                    // Run the test 10 times
        SuccessRateThreshold: 0.8    // Expect 80%+ success
    ));

// Assert on statistical behavior
result.Statistics.SuccessRate.Should().BeGreaterThan(0.85);
result.Statistics.MeanScore.Should().BeGreaterThan(90.0);

Instead of asking "did it pass?", ask "how often does it pass?"


Quick Start

Basic Stochastic Test

using AgentEval.Comparison;
using AgentEval.Core;

[Fact]
public async Task Agent_ShouldHaveHighSuccessRate()
{
    // Arrange
    var harness = new MAFTestHarness(chatClient, tools, testOptions);
    var stochasticRunner = new StochasticRunner(harness, testOptions);
    
    var testCase = new TestCase
    {
        Input = "What's the weather in Seattle?",
        ExpectedOutput = "Contains temperature and conditions"
    };
    
    var options = new StochasticOptions(
        Runs: 10,
        SuccessRateThreshold: 0.8
    );
    
    // Act
    var result = await stochasticRunner.RunStochasticTestAsync(
        agent, testCase, options);
    
    // Assert
    result.Passed.Should().BeTrue();
    result.Statistics.SuccessRate.Should().BeGreaterThan(0.8);
}

Accessing Full Statistics

var result = await stochasticRunner.RunStochasticTestAsync(
    agent, testCase, options);

var stats = result.Statistics;

// Central tendency
Console.WriteLine($"Mean Score: {stats.MeanScore:F1}");
Console.WriteLine($"Median Score: {stats.MedianScore:F1}");

// Variability
Console.WriteLine($"Std Dev: {stats.StandardDeviation:F2}");
Console.WriteLine($"Min: {stats.MinScore:F1}");
Console.WriteLine($"Max: {stats.MaxScore:F1}");

// Percentiles
Console.WriteLine($"25th Percentile: {stats.Percentile25:F1}");
Console.WriteLine($"75th Percentile: {stats.Percentile75:F1}");
Console.WriteLine($"95th Percentile: {stats.Percentile95:F1}");

// Success tracking
Console.WriteLine($"Success Rate: {stats.SuccessRate:P0}");
Console.WriteLine($"Runs: {stats.TotalRuns}");
Console.WriteLine($"Successes: {stats.SuccessCount}");

StochasticOptions Reference

var options = new StochasticOptions(
    Runs: 10,                         // Number of test iterations
    SuccessRateThreshold: 0.8,        // Minimum success rate (0.0-1.0)
    ScoreThreshold: 70.0,             // Minimum score to count as "success"
    ParallelExecution: false,         // Run tests in parallel?
    ContinueOnFailure: true,          // Continue after first failure?
    WarmupRuns: 1                     // Warm-up runs (not counted)
);

Parameters Explained

Parameter Default Description
Runs 10 Total test iterations
SuccessRateThreshold 0.8 Pass if success rate ≥ this
ScoreThreshold 70.0 Score ≥ this counts as success
ParallelExecution false Run iterations concurrently
ContinueOnFailure true Run all iterations even if some fail
WarmupRuns 0 Initial runs to discard (cache warming)

Statistical Assertions

Success Rate Assertions

// Basic success rate check
result.Statistics.SuccessRate.Should().BeGreaterThan(0.8);

// At least N successes
result.Statistics.SuccessCount.Should().BeGreaterOrEqualTo(8);

// Exact success rate
result.Statistics.SuccessRate.Should().BeApproximately(0.9, 0.05);

Score Distribution Assertions

// Mean score threshold
result.Statistics.MeanScore.Should().BeGreaterThan(85.0);

// Low variability (consistent behavior)
result.Statistics.StandardDeviation.Should().BeLessThan(10.0);

// No catastrophic failures
result.Statistics.MinScore.Should().BeGreaterThan(50.0);

// High ceiling
result.Statistics.MaxScore.Should().BeGreaterThan(95.0);

Percentile Assertions

// 95th percentile (worst common case)
result.Statistics.Percentile95.Should().BeGreaterThan(75.0);

// Median (typical case)
result.Statistics.MedianScore.Should().BeGreaterThan(88.0);

// Interquartile range (middle 50%)
var iqr = result.Statistics.Percentile75 - result.Statistics.Percentile25;
iqr.Should().BeLessThan(15.0);  // Tight distribution

Visual Output

Console Table

result.PrintTable("Weather Agent Stochastic Results");

Output:

┌────────────────────────────────────────────────────────┐
│          Weather Agent Stochastic Results              │
├────────────────────────────────────────────────────────┤
│ Metric              │ Value                            │
├─────────────────────┼──────────────────────────────────┤
│ Total Runs          │ 10                               │
│ Success Count       │ 9                                │
│ Success Rate        │ 90.0%                            │
│ Mean Score          │ 88.3                             │
│ Median Score        │ 90.5                             │
│ Std Deviation       │ 7.2                              │
│ Min Score           │ 68.0                             │
│ Max Score           │ 97.0                             │
│ 95th Percentile     │ 96.1                             │
│ Status              │ ✅ PASSED                        │
└─────────────────────┴──────────────────────────────────┘

Individual Run Details

foreach (var run in result.IndividualRuns)
{
    var status = run.Passed ? "✅" : "❌";
    Console.WriteLine($"Run {run.RunNumber}: {status} Score={run.Score:F1}");
}

Advanced Patterns

Stochastic Testing Across Test Cases

var testCases = new[]
{
    new TestCase { Input = "Weather in Seattle", /* ... */ },
    new TestCase { Input = "Weather in Tokyo", /* ... */ },
    new TestCase { Input = "Weather in London", /* ... */ },
};

var results = new List<StochasticResult>();

foreach (var testCase in testCases)
{
    var result = await stochasticRunner.RunStochasticTestAsync(
        agent, testCase, options);
    results.Add(result);
}

// Overall statistics
var overallSuccessRate = results.Average(r => r.Statistics.SuccessRate);
overallSuccessRate.Should().BeGreaterThan(0.85);

Comparing Stochastic Results

// Before optimization
var beforeResult = await stochasticRunner.RunStochasticTestAsync(
    oldAgent, testCase, options);

// After optimization
var afterResult = await stochasticRunner.RunStochasticTestAsync(
    newAgent, testCase, options);

// Assert improvement
afterResult.Statistics.SuccessRate.Should()
    .BeGreaterThan(beforeResult.Statistics.SuccessRate);

afterResult.Statistics.MeanScore.Should()
    .BeGreaterThan(beforeResult.Statistics.MeanScore);

Regression Detection

// Load baseline from saved results
var baseline = LoadBaseline("weather-agent-baseline.json");

var current = await stochasticRunner.RunStochasticTestAsync(
    agent, testCase, options);

// Detect regression (>5% drop in success rate)
var successRateDrop = baseline.SuccessRate - current.Statistics.SuccessRate;
successRateDrop.Should().BeLessThan(0.05, 
    because: "we should not regress more than 5%");

// Detect score degradation
var meanDrop = baseline.MeanScore - current.Statistics.MeanScore;
meanDrop.Should().BeLessThan(5.0,
    because: "mean score should not drop more than 5 points");

CI/CD Integration

Setting Appropriate Thresholds

// Development: lenient thresholds
var devOptions = new StochasticOptions(
    Runs: 5,
    SuccessRateThreshold: 0.6
);

// Staging: stricter thresholds
var stagingOptions = new StochasticOptions(
    Runs: 10,
    SuccessRateThreshold: 0.8
);

// Production: strictest thresholds
var prodOptions = new StochasticOptions(
    Runs: 20,
    SuccessRateThreshold: 0.95
);

GitHub Actions Example

jobs:
  stochastic-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '9.0.x'
      
      - name: Run Stochastic Tests
        env:
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
          STOCHASTIC_RUNS: 10
          SUCCESS_THRESHOLD: 0.8
        run: |
          dotnet test --filter "Category=Stochastic" \
            --logger "trx;LogFileName=stochastic-results.trx"
      
      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: stochastic-results
          path: '**/stochastic-results.trx'

Cost Control in CI

// Fewer runs in CI to control costs
var ciRuns = int.Parse(
    Environment.GetEnvironmentVariable("STOCHASTIC_RUNS") ?? "5");

var options = new StochasticOptions(
    Runs: ciRuns,
    SuccessRateThreshold: 0.8
);

Combining with Trace Replay

Run stochastic tests without API costs by recording once, replaying many times:

// RECORD: Capture 10 real executions
var traces = new List<AgentTrace>();
for (int i = 0; i < 10; i++)
{
    var recorder = new TraceRecordingAgent(realAgent);
    await recorder.ExecuteAsync(testCase.Input);
    traces.Add(recorder.GetTrace());
}
TraceSerializer.SaveMany(traces, "stochastic-traces.json");

// REPLAY: Run stochastic analysis without API calls
var savedTraces = TraceSerializer.LoadMany("stochastic-traces.json");
var replayResults = new List<TestResult>();

foreach (var trace in savedTraces)
{
    var replayer = new TraceReplayingAgent(trace);
    var response = await replayer.ReplayNextAsync();
    // Evaluate the replayed response
    replayResults.Add(EvaluateResponse(response));
}

// Calculate statistics
var stats = StatisticsCalculator.Calculate(replayResults);
Console.WriteLine($"Success Rate: {stats.SuccessRate:P0}");

When to Use Stochastic Testing

✅ Use Stochastic Testing For:

Scenario Why
Critical user-facing features Know actual reliability, not lucky-run rate
LLM/model upgrades Detect regressions with statistical confidence
Prompt changes Measure impact across multiple runs
A/B testing agents Compare with proper statistics
SLA validation "95% of requests succeed" needs measurement

❌ Don't Use Stochastic Testing For:

Scenario Why
Deterministic code No benefit (same result every time)
Trace replay tests Already deterministic
Quick feedback loops Too slow (run unit tests instead)
Cost-sensitive CI API costs multiply by run count

Best Practices

1. Choose the Right Number of Runs

5 runs    → Quick feedback, high variance
10 runs   → Good balance (default)
20 runs   → Reliable statistics
50+ runs  → Research/benchmarking

2. Set Realistic Thresholds

Don't set 100% success rate as threshold—LLMs have inherent variability.

// ❌ Unrealistic
new StochasticOptions(Runs: 10, SuccessRateThreshold: 1.0)

// ✅ Realistic
new StochasticOptions(Runs: 10, SuccessRateThreshold: 0.8)
// Store results over time
var historicalResults = LoadHistoricalResults();
var current = await RunStochasticTest();

// Plot trend
var trend = new SuccessRateTrend(historicalResults.Append(current));
if (trend.IsDecreasing)
{
    Console.WriteLine("⚠️ Success rate trending downward");
}

4. Use Warm-up Runs for Cold Starts

var options = new StochasticOptions(
    Runs: 10,
    WarmupRuns: 2,  // First 2 runs not counted
    SuccessRateThreshold: 0.8
);

Troubleshooting

High Variance in Results

Symptom: Standard deviation > 15

Causes:

  • Prompt is ambiguous
  • Edge cases in test input
  • Model temperature too high

Solutions:

// Lower temperature if possible
var chatClient = new AzureOpenAIChatClient(
    endpoint, credential, deployment,
    new ChatCompletionOptions { Temperature = 0.3f });

// More specific prompts
var testCase = new TestCase
{
    Input = "What is the current temperature in Seattle, WA in Fahrenheit?"
    // Not: "What's the weather like?"
};

Inconsistent Tool Calls

Symptom: Sometimes calls tool, sometimes doesn't

Solution: Use tool-specific assertions in stochastic context

var toolCallRates = result.IndividualRuns
    .Where(r => r.ToolUsage?.ToolCalls.Any(t => t.Name == "WeatherAPI") == true)
    .Count() / (double)result.Statistics.TotalRuns;

toolCallRates.Should().BeGreaterThan(0.9, 
    because: "WeatherAPI should be called 90%+ of the time");

Summary

Concept Traditional Testing Stochastic Testing
Question "Did it pass?" "How often does it pass?"
Result Boolean Statistics
Threshold Pass/Fail Success Rate
Confidence Low (1 sample) High (N samples)
Cost 1 API call N API calls

Stochastic testing transforms LLM testing from "hope it works" to "know how often it works."


Next Steps


Stop guessing. Start measuring.

Get Started →{ .md-button .md-button--primary }