ADR-008: Calibrated Judge for Multi-Model LLM Evaluation
Status: Accepted
Created: January 12, 2026
Author: AgentEval Team
Supersedes: None
Context
LLM-as-judge evaluations are inherently non-deterministic. A single LLM judge may:
- Give inconsistent scores across runs (variance)
- Have systematic biases toward certain response styles
- Hallucinate or make evaluation errors
Single-judge evaluations limit reliability for enterprise use cases where audit trails and reproducibility matter.
Decision
We implement a CalibratedJudge system that wraps multiple LLM judges to provide higher-confidence evaluations through voting, statistical analysis, and graceful degradation.
Core Design Decisions
1. Factory Pattern for Metric Instantiation
Each judge needs its own metric instance with its own IChatClient:
// ✅ Factory pattern - each judge gets its own metric with its own client
var result = await judge.EvaluateAsync(context,
judgeName => new FaithfulnessMetric(judges[judgeName]));
// ❌ Shared metric - would reuse same client for all judges (wrong)
var result = await judge.EvaluateAsync(metric, context);
Rationale: Metrics like FaithfulnessMetric are stateful - they hold an IChatClient reference. To evaluate with multiple judges (GPT-4o, Claude, Gemini), each must have its own metric instance.
2. Voting Strategies
We support four aggregation strategies via VotingStrategy enum:
| Strategy | Formula | Use Case |
|---|---|---|
Median |
Middle value | Default - robust to outliers |
Mean |
Average | When all judges are equally reliable |
Unanimous |
Require consensus | High-stakes decisions |
Weighted |
Weighted average | When judges have known reliability scores |
Rationale: Different use cases need different aggregation. Median is default because it's robust against a single biased judge.
3. Agreement Calculation
Agreement is calculated as the inverse of the coefficient of variation:
Agreement = 100 - (StdDev / Mean × 100)
This produces a 0-100% score where:
- 100% = All judges gave identical scores
- 0% = Maximum disagreement (StdDev equals Mean)
Rationale: Simple, intuitive metric that maps naturally to a percentage.
4. Confidence Intervals
We use t-distribution approximation for small samples:
var marginOfError = tValue × (stdDev / √n);
var lower = mean - marginOfError;
var upper = mean + marginOfError;
Rationale: With 2-5 judges (small n), t-distribution is more appropriate than z-distribution.
5. Graceful Degradation
If one judge fails (timeout, error), evaluation continues with remaining judges:
if (judgeScores.Count < options.MinimumJudgesRequired)
throw new InvalidOperationException("Not enough judges succeeded");
Rationale: Enterprise systems need resilience. A single judge timeout shouldn't fail the entire evaluation.
6. Parallel Execution with Limits
Judges run in parallel with configurable concurrency:
var semaphore = new SemaphoreSlim(options.MaxParallelJudges);
var tasks = judges.Select(async judge => { ... });
var results = await Task.WhenAll(tasks);
Rationale: Parallel execution reduces latency; semaphore prevents overwhelming rate limits.
Interface Design
ICalibratedJudge
public interface ICalibratedJudge
{
IReadOnlyList<string> JudgeNames { get; }
CalibratedJudgeOptions Options { get; }
Task<CalibratedResult> EvaluateAsync(
EvaluationContext context,
Func<string, IMetric> metricFactory,
CancellationToken cancellationToken = default);
Task<CalibratedResult> EvaluateAsync<TMetric>(
TMetric metric,
EvaluationContext context,
CancellationToken cancellationToken = default) where TMetric : IMetric;
}
CalibratedResult
public record CalibratedResult
{
public required double Score { get; init; }
public required double Agreement { get; init; }
public required IReadOnlyDictionary<string, double> JudgeScores { get; init; }
public double? ConfidenceLower { get; init; }
public double? ConfidenceUpper { get; init; }
public double StandardDeviation { get; init; }
public VotingStrategy Strategy { get; init; }
public bool HasConsensus { get; init; }
}
Alternatives Considered
A. Single Judge with Multiple Runs
Run the same judge N times and aggregate.
Rejected: Same biases would be amplified. Doesn't address systematic model biases.
B. Judge Chain (Sequential)
Run judges sequentially, with later judges seeing earlier scores.
Rejected: Creates dependencies and ordering effects. Earlier judges influence later ones.
C. Ensemble Weighting Based on Past Performance
Automatically learn judge weights from historical accuracy.
Deferred: Good idea for v2, but adds complexity. Manual weights via JudgeWeights option suffices for now.
Consequences
Positive
- Higher reliability: Multi-judge voting reduces variance
- Audit trail:
JudgeScoresdictionary shows exactly how each judge voted - Confidence quantification: CI provides uncertainty bounds
- Enterprise-ready: Graceful degradation handles partial failures
Negative
- Increased cost: 3× API calls for 3 judges
- Increased latency: Even with parallelism, overall time increases
- Complexity: Factory pattern is more complex than simple metric evaluation
Mitigations
- Cost: Use
CalibratedJudgeonly for high-stakes evaluations; use single judge for CI - Latency: Parallel execution minimizes overhead
- Complexity: Provide both factory pattern and simplified
EvaluateAsync<TMetric>overload
File Locations
| File | Purpose |
|---|---|
src/AgentEval/Calibration/CalibratedJudge.cs |
Main implementation |
src/AgentEval/Calibration/ICalibratedJudge.cs |
Interface |
src/AgentEval/Calibration/CalibratedResult.cs |
Result record |
src/AgentEval/Calibration/VotingStrategy.cs |
Enum |
src/AgentEval/Calibration/CalibratedJudgeOptions.cs |
Options |
tests/AgentEval.Tests/Calibration/CalibratedJudgeTests.cs |
Unit tests |
Related ADRs
- ADR-001: Metric Naming Prefixes -
llm_prefix for judge-based metrics - ADR-005: Model Comparison & Stochastic Testing - Related multi-run testing
- ADR-006: Service-Based Architecture & DI - DI patterns