ADR-005: Model Comparison and Stochastic Testing Architecture
Status: Accepted
Date: January 8, 2026
Decision Makers: AgentEval Core Team
Context
AgentEval needs to support two closely related features:
Stochastic Pass Criteria — Run tests multiple times to handle LLM non-determinism and report statistical results (pass rate, confidence intervals, latency percentiles)
Model Comparison — Run the same test suite across multiple LLM models (e.g., GPT-4o, Claude 3.5, GPT-4o-mini) and recommend the best model based on accuracy, latency, and cost
These features are tightly coupled because:
- Model comparison requires running tests multiple times per model (stochastic)
- Both features share infrastructure for parallel execution and result aggregation
- Statistical significance testing between models requires stochastic run data
The Model Swapping Challenge
Current agent creation in Microsoft Agent Framework (MAF) tightly binds the model:
var chatClient = azureClient.GetChatClient("gpt-4o").AsIChatClient();
var agent = new ChatClientAgent(chatClient, options);
The model deployment cannot be changed after the agent is created. We need a pattern to create agents with different models while preserving the same behavior (instructions, tools, configuration).
Decision
Primary Decision: Agent Factory Pattern
We will implement the Agent Factory Pattern where IAgentFactory creates fresh agent instances with specific model configurations.
public interface IAgentFactory
{
string ModelId { get; }
string ModelName { get; }
ITestableAgent CreateAgent();
ModelConfiguration? Configuration { get; }
}
Secondary Decision: Additive Interface for Model Identification
To avoid breaking existing ITestableAgent implementations, we introduce a separate optional interface:
public interface IModelIdentifiable
{
string? ModelId { get; }
string? ModelName { get; }
}
Adapters that know their model can implement this interface. Existing implementations continue to work unchanged.
Tertiary Decision: Unified Stochastic/Comparison Architecture
Stochastic testing and model comparison share the same result aggregation infrastructure:
Test Cases → Stochastic Runner → Statistical Aggregation → Results
↑
(per model via factory)
↑
Model Comparer orchestrates
Alternatives Considered
Alternative 1: IChatClient Injection/Swapping
Approach: Inject different IChatClient instances into the same agent structure at runtime.
Pros:
- Single agent instance, swap the underlying client
- More memory efficient
Cons:
- Requires MAF-specific knowledge of agent internals
- Not all agent frameworks support client swapping
- Breaks encapsulation
- Would require changes to
ChatClientAgentor reflection
Decision: Rejected — Too tightly coupled to MAF internals.
Alternative 2: Reflection-Based Model Swapping
Approach: Use reflection to modify the deployment name in the existing chat client.
Pros:
- No new interfaces needed
- Works with existing agents
Cons:
- Fragile — depends on internal implementation details
- Will break when MAF updates
- Not type-safe
- Hard to test
Decision: Rejected — Too fragile and unmaintainable.
Alternative 3: Configuration-Based Model Selection
Approach: Pass model configuration via environment variables or config files that the agent reads.
Pros:
- Simple to implement
- No code changes for users
Cons:
- Less type-safe
- Harder to test programmatically
- Requires re-initialization per model
- Can't compare models in a single process run
Decision: Rejected — Not suitable for programmatic comparison.
Alternative 4: Extend ITestableAgent with ModelId
Approach: Add string? ModelId property directly to ITestableAgent.
Pros:
- Single interface
- All agents must report their model
Cons:
- Breaking change for all existing
ITestableAgentimplementations - Many agents don't know their model (generic adapters)
- Violates Interface Segregation Principle
Decision: Rejected — Breaking change, ISP violation.
Consequences
Positive
- Framework Agnostic — Factory pattern works with any agent framework, not just MAF
- Non-Breaking — Existing
ITestableAgentimplementations continue to work - Testable — Easy to create mock factories for unit testing
- Clear Separation — Agent logic is separated from model configuration
- Extensible — New providers (Anthropic, Google) just need new factory implementations
- Composable — Stochastic runner works standalone or within model comparer
Negative
- Factory per Provider — Each cloud provider needs its own factory implementation
- Learning Curve — Users must understand factory pattern for model comparison
- More Types — Additional interfaces and classes to maintain
Neutral
- Optional Feature — Simple single-model testing doesn't require factories
- MAF-Specific Factory — We provide
AzureOpenAIAgentFactoryfor common case
Implementation Notes
File Structure
src/AgentEval/
├── Comparison/ # New folder
│ ├── IAgentFactory.cs
│ ├── IStochasticRunner.cs
│ ├── IModelComparer.cs
│ └── ... (implementations)
├── Core/
│ └── IModelIdentifiable.cs # New interface
└── MAF/
└── AzureOpenAIAgentFactory.cs # MAF-specific factory
Key Interfaces
// Factory creates agents with specific model
public interface IAgentFactory
{
string ModelId { get; }
string ModelName { get; }
ITestableAgent CreateAgent();
}
// Optional: Agents can report their model
public interface IModelIdentifiable
{
string? ModelId { get; }
string? ModelName { get; }
}
// Stochastic runner handles multiple runs
public interface IStochasticRunner
{
Task<StochasticResult> RunAsync(
ITestableAgent agent,
TestCase testCase,
StochasticOptions? options = null);
}
// Model comparer orchestrates cross-model comparison
public interface IModelComparer
{
IModelComparer AddModel(IAgentFactory factory);
Task<ModelComparisonResult> CompareAsync(
IEnumerable<TestCase> testCases,
ModelComparisonOptions? options = null);
}
Validation
This decision will be validated by:
- Unit tests — Factory pattern produces correct agents
- Integration tests — Stochastic runner aggregates results correctly
- Sample code — Sample14 (Stochastic) and Sample15 (Model Comparison) demonstrate usage
- User feedback — Monitor GitHub issues for usability concerns
Related Documents
- Model Comparison Guide
- Stochastic Testing Guide
- Sample14: Stochastic Testing
- Sample15: Model Comparison
ADR maintained by AgentEval team. Status changes require team review.