AgentEval Architecture
Understanding the component structure and design patterns of AgentEval
Overview
AgentEval is designed with a layered architecture that separates concerns and enables extensibility. The framework follows SOLID principles, with interface segregation being particularly important for the metric hierarchy.
Component Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│ AgentEval │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Core Layer │ │
│ ├────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ Interfaces: │ │
│ │ ┌─────────────┐ ┌───────────────┐ ┌──────────────────┐ ┌──────────┐│ │
│ │ │ IMetric │ │IEvaluableAgent│ │IEvaluationHarness│ │IEvaluator│ │ │
│ │ └─────────────┘ └───────────────┘ └──────────────────┘ └──────────┘│ │
│ │ ┌─────────────────┐ │ │
│ │ │IExporterRegistry│ │ │
│ │ └─────────────────┘ │ │
│ │ │ │
│ │ Utilities: │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐ │ │
│ │ │MetricRegistry│ │ScoreNormalizer│ │LlmJsonParser│ │ RetryPolicy │ │ │
│ │ └─────────────┘ └──────────────┘ └─────────────┘ └──────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Metrics Layer │ │
│ ├────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ RAG Metrics: Agentic Metrics: Embedding Metrics: │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │ │
│ │ │ Faithfulness │ │ ToolSelection │ │AnswerSimilarity│ │ │
│ │ │ Relevance │ │ ToolArguments │ │ContextSimilarity│ │ │
│ │ │ ContextPrecision│ │ ToolSuccess │ │ QuerySimilarity│ │ │
│ │ │ ContextRecall │ │ TaskCompletion │ └────────────────┘ │ │
│ │ │ AnswerCorrectness│ │ ToolEfficiency │ │ │
│ │ └─────────────────┘ └─────────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Assertions Layer │ │
│ ├────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────┐ │ │
│ │ │ToolUsageAssertions │ │PerformanceAssertions│ │ResponseAssertions│ │ │
│ │ │ .HaveCalledTool() │ │ .HaveDurationUnder()│ │ .Contain() │ │ │
│ │ │ .BeforeTool() │ │ .HaveTTFTUnder() │ │ .MatchPattern()│ │ │
│ │ │ .WithArguments() │ │ .HaveCostUnder() │ │ .HaveLength() │ │ │
│ │ └─────────────────────┘ └─────────────────────┘ └─────────────────┘ │ │
│ │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │
│ │ WorkflowAssertions │ │ │
│ │ .HaveStepCount() .ForExecutor() .HaveGraphStructure() │ │ │
│ │ .HaveExecutedInOrder() .HaveCompletedWithin() .HaveTraversedEdge() │ │ │
│ │ .HaveNoErrors() .HaveNonEmptyOutput() .HaveExecutionPath() │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │ │
│ │ │
└────────────────────────────────────────────────────────────────────────┘ │
│
┌────────────────────────────────────────────────────────────────────────┐ │
│ Workflow Evaluation Layer │ │
├────────────────────────────────────────────────────────────────────────┤ │
│ │ │
│ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────┐ │ │
│ │ WorkflowEvaluationHarness │ │ MAFWorkflowAdapter │ │ MAFWorkflowEventBridge │ │ │
│ │ .RunWorkflowTestAsync() │ │ .FromMAFWorkflow() │ │ .ProcessEventsAsync() │ │ │
│ │ .WithTimeout() │ │ .ExtractGraph() │ │ .HandleTimeout() │ │ │
│ │ .WithAssertions() │ │ .TrackPerformance() │ │ .StreamEvents() │ │ │
│ └─────────────────────┘ └─────────────────────┘ └─────────────────┘ │ │
│ │ │
│ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────┐ │ │
│ │WorkflowTraceRecorder│ │ WorkflowBuilder │ │WorkflowAssemblyBinder│ │ │
│ │ .RecordStep() │ │ .BindAsExecutor() │ │ .BuildFromAssembly()│ │ │
│ │ .ToAgentTrace() │ │ .UseEventStreaming() │ │ .DiscoverAgents() │ │ │
│ │ .Serialize() │ │ .WithTimeout() │ │ .ValidateBinding() │ │ │
│ └─────────────────────┘ └─────────────────────┘ └─────────────────┘ │ │
│ │ ││ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Benchmarks Layer │ │
│ ├────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────────────┐ ┌─────────────────────────────────────┐ │ │
│ │ │ PerformanceBenchmark │ │ AgenticBenchmark │ │ │
│ │ │ • Latency │ │ • ToolAccuracy │ │ │
│ │ │ • Throughput │ │ • TaskCompletion │ │ │
│ │ │ • Cost │ │ • MultiStepReasoning │ │ │
│ │ └─────────────────────────┘ └─────────────────────────────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Integration Layer │ │
│ ├────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────────┐ ┌────────────────────────┐ ┌─────────────────┐ │ │
│ │ │ MAFEvaluationHarness │ │MicrosoftEvaluatorAdapter│ │ChatClientAdapter│ │ │
│ │ │ (MAF support) │ │(MS.Extensions.AI.Eval) │ │ (Generic) │ │ │
│ │ └─────────────────┘ └────────────────────────┘ └─────────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Production Infrastructure │ │
│ ├────────────────────────────────────────────────────────────────────────┤ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │IResultExporter│ │IDatasetLoader│ │ Tracing/ │ │ │
│ │ │JUnit/MD/JSON │ │JSONL/YAML/CSV │ │Record+Replay│ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ RedTeam/ │ │ResponsibleAI│ │ Calibration │ │ Comparison │ │ │
│ │ │ Attack+Eval │ │Safety Metrics│ │Multi-Judge │ │Stochastic │ │ │
│ │ │IAttackType- │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ Registry │ │ │
│ │ └─────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Metric Hierarchy
AgentEval uses interface segregation to organize metrics by their requirements:
IMetric (base interface)
│
├── Properties:
│ ├── Name: string
│ └── Description: string
│
├── Methods:
│ └── EvaluateAsync(EvaluationContext, CancellationToken) -> MetricResult
│
├── IRAGMetric : IMetric
│ ├── RequiresContext: bool
│ ├── RequiresGroundTruth: bool
│ │
│ └── Implementations:
│ ├── FaithfulnessMetric - Is response supported by context?
│ ├── RelevanceMetric - Is response relevant to query?
│ ├── ContextPrecisionMetric - Was context useful for the answer?
│ ├── ContextRecallMetric - Does context cover ground truth?
│ └── AnswerCorrectnessMetric - Is response factually correct?
│
├── IAgenticMetric : IMetric
│ ├── RequiresToolUsage: bool
│ │
│ └── Implementations:
│ ├── ToolSelectionMetric - Were correct tools called?
│ ├── ToolArgumentsMetric - Were tool arguments correct?
│ ├── ToolSuccessMetric - Did tool calls succeed?
│ ├── ToolEfficiencyMetric - Were tools used efficiently?
│ └── TaskCompletionMetric - Was the task completed?
│
└── IEmbeddingMetric : IMetric (implicit)
├── RequiresEmbeddings: bool
│
└── Implementations:
├── AnswerSimilarityMetric - Response vs ground truth similarity
├── ResponseContextSimilarityMetric - Response vs context similarity
└── QueryContextSimilarityMetric - Query vs context similarity
Data Flow
Single Agent Evaluation
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ Test Case │───▶│ IEvaluationHarness │───▶│ Agent Under │───▶│ Response │
│ (Input) │ │ │ │ Test │ │ (Output) │
└─────────────┘ └──────────────┘ └─────────────┘ └──────────────┘
│ │
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│Tool Tracking │ │ Evaluation │
│ (timeline, │ │ Context │
│ arguments) │ │ │
└──────────────┘ └──────────────┘
│ │
└───────────────────┬───────────────────┘
│
▼
┌──────────────────┐
│ Metric Runner │
│ (evaluates all │
│ configured │
│ metrics) │
└──────────────────┘
│
▼
┌──────────────────┐
│ Test Result │
│ • Score │
│ • Passed/Failed │
│ • ToolUsage │
│ • Performance │
│ • FailureReport │
└──────────────────┘
│
▼
┌──────────────────┐
│ Result Exporter │
│ • JUnit XML │
│ • Markdown │
│ • JSON │
└──────────────────┘
Workflow Evaluation
┌─────────────────┐ ┌────────────────────┐ ┌─────────────────┐
│ WorkflowTestCase│───▶│WorkflowEvaluationHarness │───▶│ MAFWorkflow │
│ (Agents+Graph) │ │ │ │ (Multi-Agent) │
└─────────────────┘ └────────────────────┘ └─────────────────┘
│ │
│ ▼
│ ┌─────────────────┐
│ │ WorkflowExecution│
│ │ • Agent 1 │
│ │ • Agent 2 │
│ │ • Agent N │
│ │ • Event Stream │
│ │ • Graph Traversal│
│ └─────────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌────────────────────┐
│ MAFWorkflowEventBridge │ │WorkflowExecutionResult│
│ • Event Processing │ │ • Per-Executor Data│
│ • Timeout Handling │ │ • Graph Definition │
│ • Tool Aggregation │ │ • Tool Usage │
│ • Performance Tracking│ │ • Performance │
└─────────────────────┘ └────────────────────┘
│ │
└─────────────┬─────────────┘
│
▼
┌──────────────────────┐
│ Workflow Assertions │
│ • Structure validation│
│ • Per-executor checks│
│ • Graph verification │
│ • Tool chain analysis│
│ • Performance bounds │
└──────────────────────┘
│
▼
┌──────────────────────┐
│ WorkflowTestResult │
│ • Overall Pass/Fail │
│ • Per-Executor Results│
│ • Graph Visualization│
│ • Tool Usage Report │
│ • Performance Summary│
└──────────────────────┘
Key Models
EvaluationContext
The central data structure passed to all metrics:
public class EvaluationContext
{
// Identification
public string EvaluationId { get; init; }
public DateTimeOffset StartedAt { get; init; }
// Core data
public required string Input { get; init; } // User query
public required string Output { get; init; } // Agent response
// RAG-specific
public string? Context { get; init; } // Retrieved context
public string? GroundTruth { get; init; } // Expected answer
// Agentic-specific
public ToolUsageReport? ToolUsage { get; init; } // Tool calls made
public IReadOnlyList<string>? ExpectedTools { get; init; }
// Performance
public PerformanceMetrics? Performance { get; init; }
public ToolCallTimeline? Timeline { get; init; } // Execution trace
// Extensibility
public IDictionary<string, object?> Properties { get; }
}
MetricResult
The result of evaluating a single metric:
public class MetricResult
{
public required string MetricName { get; init; }
public required double Score { get; init; } // 0-100 scale
public bool Passed { get; init; }
public string? Explanation { get; init; }
public IDictionary<string, object>? Details { get; init; }
// Factory methods
public static MetricResult Pass(string name, double score, string? explanation = null);
public static MetricResult Fail(string name, string explanation, double score = 0);
}
ToolUsageReport
Tracks all tool calls made during an agent run:
public class ToolUsageReport
{
public IReadOnlyList<ToolCallRecord> Calls { get; }
public int Count { get; }
public int SuccessCount { get; }
public int FailureCount { get; }
public TimeSpan TotalDuration { get; }
// Fluent assertions
public ToolUsageAssertions Should();
}
PerformanceMetrics
Captures timing and cost information:
public class PerformanceMetrics
{
public TimeSpan TotalDuration { get; set; }
public TimeSpan? TimeToFirstToken { get; set; }
public TokenUsage? Tokens { get; set; }
public decimal? EstimatedCost { get; set; }
// Fluent assertions
public PerformanceAssertions Should();
}
WorkflowExecutionResult
Result of workflow evaluation with multi-agent data:
public class WorkflowExecutionResult
{
public required string WorkflowId { get; init; }
public required DateTimeOffset StartedAt { get; init; }
public required TimeSpan Duration { get; init; }
// Graph structure
public WorkflowGraphDefinition? GraphDefinition { get; init; }
// Per-executor results
public IReadOnlyDictionary<string, ExecutorResult> ExecutorResults { get; init; }
// Aggregated data
public ToolUsageReport? ToolUsage { get; init; } // All tool calls
public PerformanceMetrics? Performance { get; init; } // Total cost/timing
public string? FinalOutput { get; init; } // Workflow output
// Assertions
public WorkflowResultAssertions Should();
}
ExecutorResult
Individual agent performance within a workflow:
public class ExecutorResult
{
public required string ExecutorId { get; init; }
public required string AgentName { get; init; }
public string? Input { get; init; }
public string? Output { get; init; }
public DateTimeOffset? StartedAt { get; init; }
public TimeSpan? Duration { get; init; }
public ToolUsageReport? ToolUsage { get; init; }
public PerformanceMetrics? Performance { get; init; }
public bool HasError { get; init; }
public string? ErrorMessage { get; init; }
}
WorkflowGraphDefinition
Represents the workflow structure and execution path:
public class WorkflowGraphDefinition
{
public IReadOnlyList<WorkflowNode> Nodes { get; init; }
public IReadOnlyList<WorkflowEdge> Edges { get; init; }
public string? EntryPoint { get; init; }
public string? ExitPoint { get; init; }
public IReadOnlyList<string>? ExecutionPath { get; init; }
// Validation helpers
public bool HasNode(string nodeId);
public bool HasEdge(string source, string target);
public IEnumerable<string> GetExecutionOrder();
}
Design Patterns
1. Interface Segregation (ISP)
Metrics only require what they need:
// RAG metrics need context
public interface IRAGMetric : IMetric
{
bool RequiresContext { get; }
bool RequiresGroundTruth { get; }
}
// Agentic metrics need tool usage
public interface IAgenticMetric : IMetric
{
bool RequiresToolUsage { get; }
}
2. Adapter Pattern
Enables integration with different frameworks:
// Adapt any IChatClient to IEvaluableAgent
public class ChatClientAgentAdapter : IEvaluableAgent
{
private readonly IChatClient _chatClient;
public async Task<AgentResponse> InvokeAsync(string input, CancellationToken ct)
{
var response = await _chatClient.GetResponseAsync(
new[] { new ChatMessage(ChatRole.User, input) }, ct);
return new AgentResponse { Text = response.Message.Text };
}
}
// Wrap Microsoft's evaluators for AgentEval
public class MicrosoftEvaluatorAdapter : IMetric
{
private readonly IEvaluator _msEvaluator;
public async Task<MetricResult> EvaluateAsync(EvaluationContext context, CancellationToken ct)
{
var msResult = await _msEvaluator.EvaluateAsync(...);
return new MetricResult
{
Score = ScoreNormalizer.From1To5(msResult.Score),
...
};
}
}
3. Fluent API
Intuitive assertion chaining:
result.ToolUsage!
.Should()
.HaveCalledTool("SearchTool")
.BeforeTool("AnalyzeTool")
.WithArguments(args => args.ContainsKey("query"))
.And()
.HaveNoErrors()
.And()
.HaveToolCountBetween(1, 5);
result.Performance!
.Should()
.HaveTotalDurationUnder(TimeSpan.FromSeconds(5))
.HaveTimeToFirstTokenUnder(TimeSpan.FromSeconds(1))
.HaveEstimatedCostUnder(0.10m);
4. Registry Pattern
Centralized metric management:
var registry = new MetricRegistry();
registry.Register(new FaithfulnessMetric(chatClient));
registry.Register(new ToolSelectionMetric(expectedTools));
// Run all registered metrics
foreach (var metric in registry.GetAll())
{
var result = await metric.EvaluateAsync(context);
}
The registry pattern extends to exporters and attack types:
// Exporter registry (auto-populated via DI)
var exporters = serviceProvider.GetRequiredService<IExporterRegistry>();
var jsonExporter = exporters.GetRequired("Json");
var allFormats = exporters.GetRegisteredFormats(); // Json, Junit, Markdown, Csv, Trx, ...
// Attack type registry (pre-populated with 9 built-in + DI-registered)
var attacks = serviceProvider.GetRequiredService<IAttackTypeRegistry>();
var promptInjection = attacks.GetRequired("PromptInjection");
var llm01 = attacks.GetByOwaspId("LLM01"); // All attacks for OWASP LLM01
Package Structure
The codebase is organized into 6 internal projects (single NuGet package):
src/
├── AgentEval.Abstractions/ # Public contracts
│ ├── Core/ # IMetric, IEvaluableAgent, IEvaluationHarness, etc.
│ ├── Models/ # TestCase, TestResult, ToolCallRecord, PerformanceMetrics
│ ├── Embeddings/ # IAgentEvalEmbeddings
│ ├── Snapshots/ # ISnapshotComparer, ISnapshotStore
│ └── DependencyInjection/ # AgentEvalServiceOptions
│
├── AgentEval.Core/ # Implementations
│ ├── Assertions/ # ToolUsageAssertions, PerformanceAssertions, ResponseAssertions
│ ├── Metrics/ # RAG/, Agentic/, Retrieval/, Safety/, Embedding
│ ├── Comparison/ # StochasticRunner, ModelComparer, StatisticsCalculator
│ ├── Tracing/ # TraceRecordingAgent, TraceReplayingAgent, ChatTraceRecorder
│ ├── Calibration/ # CalibratedJudge, VotingStrategy
│ ├── Benchmarks/ # PerformanceBenchmark, AgenticBenchmark
│ ├── Adapters/ # MicrosoftEvaluatorAdapter, ChatClientAgentAdapter
│ ├── Testing/ # FakeChatClient
│ └── DependencyInjection/ # AddAgentEval()
│
├── AgentEval.DataLoaders/ # Data loading and export
│ ├── DataLoaders/ # JSON, JSONL, YAML, CSV loaders
│ ├── Exporters/ # JUnit XML, Markdown, JSON, CSV, TRX exporters
│ ├── Output/ # TableFormatter, AgentEvalTestBase, TimeTravelTrace
│ └── DependencyInjection/ # AddAgentEvalDataLoaders()
│
├── AgentEval.MAF/ # Microsoft Agent Framework
│ ├── MAFAgentAdapter.cs # Wraps AIAgent → IStreamableAgent
│ ├── MAFEvaluationHarness.cs # MAF-specific evaluation harness
│ ├── MAFWorkflowAdapter.cs # Workflow integration
│ └── WorkflowEvaluationHarness.cs
│
├── AgentEval.RedTeam/ # Security testing
│ ├── RedTeamRunner.cs # Orchestrator
│ ├── AttackPipeline.cs # Attack execution
│ ├── Attacks/ # 9 built-in attack types
│ ├── Evaluators/ # Probe evaluators
│ ├── ResponsibleAI/ # Toxicity, Bias, Misinformation metrics
│ └── DependencyInjection/ # AddAgentEvalRedTeam()
│
└── AgentEval/ # Umbrella (packaging only)
└── AddAgentEvalAll() # Registers all services from all sub-projects
Metrics Taxonomy
AgentEval organizes metrics into a clear taxonomy to aid discovery and selection. See ADR-007 for the formal decision.
Categorization by Computation Method
| Prefix | Method | Cost | Use Case |
|---|---|---|---|
llm_ |
LLM-as-judge | API cost | High-accuracy quality assessment |
code_ |
Code logic | Free | CI/CD, high-volume testing |
embed_ |
Embedding similarity | Low API cost | Cost-effective semantic checks |
Categorization by Evaluation Domain
| Domain | Interface | Examples |
|---|---|---|
| RAG | IRAGMetric |
Faithfulness, Relevance, Context Precision |
| Agentic | IAgenticMetric |
Tool Selection, Tool Success, Task Completion |
| Conversation | Special | ConversationCompleteness |
| Safety | ISafetyMetric |
Toxicity, Groundedness |
Category Flags (ADR-007)
Metrics can declare multiple categories via MetricCategory flags:
public override MetricCategory Categories =>
MetricCategory.RAG |
MetricCategory.RequiresContext |
MetricCategory.LLMBased;
For complete metric documentation, see:
- Metrics Reference - Complete catalog
- Evaluation Guide - How to choose metrics
Calibration Layer
AgentEval provides judge calibration for reliable LLM-as-judge evaluations. See ADR-008 for design decisions.
CalibratedJudge Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ CalibratedJudge │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Input: │
│ ┌─────────────────┐ ┌─────────────────────────────────────────────────┐ │
│ │EvaluationContext│───▶│ Factory Pattern: Func<string, IMetric> │ │
│ └─────────────────┘ │ Each judge gets its own metric with its client │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ Parallel Execution: ▼ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Judge 1 │ │ Judge 2 │ │ Judge 3 │ │
│ │ (GPT-4o) │ │ (Claude) │ │ (Gemini) │ │
│ │ Score: 85 │ │ Score: 88 │ │ Score: 82 │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ Aggregation: ┌─────────────────────────────────┐ │
│ │ VotingStrategy │ │
│ │ • Median (default, robust) │ │
│ │ • Mean (equal weight) │ │
│ │ • Unanimous (require consensus) │ │
│ │ • Weighted (custom weights) │ │
│ └─────────────────────────────────┘ │
│ │ │
│ Output: ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CalibratedResult │ │
│ │ • Score: 85.0 (median) │ │
│ │ • Agreement: 96.2% │ │
│ │ • JudgeScores: {GPT-4o: 85, Claude: 88, Gemini: 82} │ │
│ │ • ConfidenceInterval: [81.5, 88.5] │ │
│ │ • StandardDeviation: 3.0 │ │
│ │ • HasConsensus: true │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Classes
| Class | Purpose |
|---|---|
CalibratedJudge |
Coordinates multiple judges with parallel execution |
CalibratedResult |
Result with score, agreement, CI, per-judge scores |
VotingStrategy |
Aggregation method enum |
CalibratedJudgeOptions |
Configuration for timeout, parallelism, consensus |
ICalibratedJudge |
Interface for testability |
Model Comparison Markdown Export
AgentEval provides rich Markdown export for model comparison results:
// Full report with all sections
var markdown = result.ToMarkdown();
// Compact table with medals
var table = result.ToRankingsTable();
// GitHub PR comment with collapsible details
var comment = result.ToGitHubComment();
// Save to file
await result.SaveToMarkdownAsync("comparison.md");
Export Options
// Full report (default)
result.ToMarkdown(MarkdownExportOptions.Default);
// Minimal (rankings only)
result.ToMarkdown(MarkdownExportOptions.Minimal);
// Custom
result.ToMarkdown(new MarkdownExportOptions
{
IncludeStatistics = true,
IncludeScoringWeights = false,
HeaderEmoji = "🔬"
});
Behavioral Policy Assertions
Safety-critical assertions for enterprise compliance:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Behavioral Policy Assertions │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ NeverCallTool("DeleteDatabase", because: "admin only") │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Scans all tool calls for forbidden tool name │ │
│ │ Throws BehavioralPolicyViolationException with audit details │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ NeverPassArgumentMatching(@"\d{3}-\d{2}-\d{4}", because: "SSN is PII") │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Scans all tool arguments with regex pattern │ │
│ │ Auto-redacts matched values in exception (e.g., "1***9") │ │
│ │ Throws BehavioralPolicyViolationException with RedactedValue │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ MustConfirmBefore("TransferFunds", because: "requires consent") │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Checks that confirmation tool was called before action │ │
│ │ Default confirmation tools: "get_confirmation", "confirm" │ │
│ │ Throws if action was called without prior confirmation │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
BehavioralPolicyViolationException
Structured exception for audit trails:
catch (BehavioralPolicyViolationException ex)
{
// Structured properties for logging/audit
Console.WriteLine($"Policy: {ex.PolicyName}"); // "NeverCallTool(DeleteDB)"
Console.WriteLine($"Type: {ex.ViolationType}"); // "ForbiddenTool"
Console.WriteLine($"Action: {ex.ViolatingAction}"); // "Called DeleteDB 1 time(s)"
Console.WriteLine($"Because: {ex.Because}"); // Developer's reason
// For PII detection
Console.WriteLine($"Pattern: {ex.MatchedPattern}"); // @"\d{3}-\d{2}-\d{4}"
Console.WriteLine($"Value: {ex.RedactedValue}"); // "1***9" (auto-redacted)
// Actionable suggestions
foreach (var s in ex.Suggestions ?? [])
Console.WriteLine($" → {s}");
}
Internal Project Structure
AgentEval ships as a single NuGet package (AgentEval) but is internally organized into 6 projects for maintainability and compile-time dependency enforcement (see ADR-016).
Dependency Graph
AgentEval (NuGet package — umbrella)
├── AgentEval.Abstractions → M.E.AI.Abstractions
├── AgentEval.Core → Abstractions + M.E.AI + M.E.AI.Eval.Quality + S.N.Tensors
├── AgentEval.DataLoaders → Abstractions + Core + YamlDotNet
├── AgentEval.MAF → Abstractions + Core + M.Agents.AI + M.Agents.AI.Workflows
└── AgentEval.RedTeam → Abstractions + Core + M.E.AI + M.E.DI + PdfSharp-MigraDoc
Project Responsibilities
| Project | Files | Purpose |
|---|---|---|
AgentEval.Abstractions |
~48 | Public contracts: IMetric, IEvaluableAgent, models, enums |
AgentEval.Core |
~63 | Implementations: metrics, assertions, comparison, tracing, testing |
AgentEval.DataLoaders |
~23 | Data loaders (JSON/YAML/CSV/JSONL), exporters, output formatting |
AgentEval.MAF |
7 | Microsoft Agent Framework adapters and harnesses |
AgentEval.RedTeam |
61 | Security scanning, attack types, evaluators, compliance reports |
AgentEval (umbrella) |
1 | Packaging + AddAgentEvalAll() DI convenience method |
All projects use RootNamespace=AgentEval so consumers see no namespace changes.
See Also
- Extensibility Guide - Creating custom metrics and plugins
- Embedding Metrics - Semantic similarity evaluation
- Benchmarks Guide - Running standard benchmarks
- Metrics Reference - Complete metric catalog
- Evaluation Guide - Metric selection guidance