Table of Contents

AgentEval Architecture

Understanding the component structure and design patterns of AgentEval


Overview

AgentEval is designed with a layered architecture that separates concerns and enables extensibility. The framework follows SOLID principles, with interface segregation being particularly important for the metric hierarchy.


Component Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              AgentEval                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                           Core Layer                                    │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │                                                                         │ │
│  │  Interfaces:                                                            │ │
│  │  ┌─────────────┐  ┌───────────────┐  ┌──────────────────┐  ┌──────────┐│ │
│  │  │   IMetric   │  │IEvaluableAgent│  │IEvaluationHarness│  │IEvaluator│ │ │
│  │  └─────────────┘  └───────────────┘  └──────────────────┘  └──────────┘│ │
│  │  ┌─────────────────┐                                                   │ │
│  │  │IExporterRegistry│                                                   │ │
│  │  └─────────────────┘                                                   │ │
│  │                                                                         │ │
│  │  Utilities:                                                             │ │
│  │  ┌─────────────┐  ┌──────────────┐  ┌─────────────┐  ┌──────────────┐  │ │
│  │  │MetricRegistry│ │ScoreNormalizer│ │LlmJsonParser│  │ RetryPolicy  │  │ │
│  │  └─────────────┘  └──────────────┘  └─────────────┘  └──────────────┘  │ │
│  │                                                                         │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                          Metrics Layer                                  │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │                                                                         │ │
│  │  RAG Metrics:              Agentic Metrics:         Embedding Metrics:  │ │
│  │  ┌─────────────────┐       ┌─────────────────┐      ┌────────────────┐  │ │
│  │  │  Faithfulness   │       │  ToolSelection  │      │AnswerSimilarity│  │ │
│  │  │  Relevance      │       │  ToolArguments  │      │ContextSimilarity│ │ │
│  │  │  ContextPrecision│      │  ToolSuccess    │      │ QuerySimilarity│  │ │
│  │  │  ContextRecall  │       │  TaskCompletion │      └────────────────┘  │ │
│  │  │  AnswerCorrectness│     │  ToolEfficiency │                          │ │
│  │  └─────────────────┘       └─────────────────┘                          │ │
│  │                                                                         │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                        Assertions Layer                                 │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │                                                                         │ │
│  │  ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────┐  │ │
│  │  │ToolUsageAssertions  │  │PerformanceAssertions│  │ResponseAssertions│ │ │
│  │  │  .HaveCalledTool()  │  │  .HaveDurationUnder()│ │  .Contain()     │  │ │
│  │  │  .BeforeTool()      │  │  .HaveTTFTUnder()   │  │  .MatchPattern()│  │ │
│  │  │  .WithArguments()   │  │  .HaveCostUnder()   │  │  .HaveLength()  │  │ │
│  │  └─────────────────────┘  └─────────────────────┘  └─────────────────┘  │ │
│  │                                                                         │ │  │  ┌─────────────────────────────────────────────────────────────────────┐  │ │
  │  │                  WorkflowAssertions                                  │ │ │
  │  │  .HaveStepCount()      .ForExecutor()        .HaveGraphStructure()  │ │ │
  │  │  .HaveExecutedInOrder() .HaveCompletedWithin() .HaveTraversedEdge() │ │ │
  │  │  .HaveNoErrors()       .HaveNonEmptyOutput() .HaveExecutionPath()   │ │ │
  │  └─────────────────────────────────────────────────────────────────────┘  │ │
  │                                                                         │ │
  └────────────────────────────────────────────────────────────────────────┘ │
                                                                              │
  ┌────────────────────────────────────────────────────────────────────────┐ │
  │                     Workflow Evaluation Layer                          │ │
  ├────────────────────────────────────────────────────────────────────────┤ │
  │                                                                         │ │
  │  ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────┐  │ │
  │  │ WorkflowEvaluationHarness │ │  MAFWorkflowAdapter │ │ MAFWorkflowEventBridge │ │ │
  │  │  .RunWorkflowTestAsync() │ │  .FromMAFWorkflow()  │ │ .ProcessEventsAsync() │ │ │
  │  │  .WithTimeout()        │ │  .ExtractGraph()     │ │ .HandleTimeout()    │ │ │
  │  │  .WithAssertions()     │ │  .TrackPerformance() │ │ .StreamEvents()     │ │ │
  │  └─────────────────────┘  └─────────────────────┘  └─────────────────┘  │ │
  │                                                                         │ │
  │  ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────┐  │ │
  │  │WorkflowTraceRecorder│ │   WorkflowBuilder    │ │WorkflowAssemblyBinder│ │ │  
  │  │ .RecordStep()        │ │ .BindAsExecutor()    │ │ .BuildFromAssembly()│ │ │
  │  │ .ToAgentTrace()      │ │ .UseEventStreaming() │ │ .DiscoverAgents()   │ │ │
  │  │ .Serialize()         │ │ .WithTimeout()       │ │ .ValidateBinding()  │ │ │
  │  └─────────────────────┘  └─────────────────────┘  └─────────────────┘  │ │
  │                                                                         │ ││  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                        Benchmarks Layer                                 │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │                                                                         │ │
│  │  ┌─────────────────────────┐  ┌─────────────────────────────────────┐   │ │
│  │  │   PerformanceBenchmark  │  │        AgenticBenchmark             │   │ │
│  │  │   • Latency             │  │   • ToolAccuracy                    │   │ │
│  │  │   • Throughput          │  │   • TaskCompletion                  │   │ │
│  │  │   • Cost                │  │   • MultiStepReasoning              │   │ │
│  │  └─────────────────────────┘  └─────────────────────────────────────┘   │ │
│  │                                                                         │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                       Integration Layer                                 │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │                                                                         │ │
│  │  ┌─────────────────┐  ┌────────────────────────┐  ┌─────────────────┐   │ │
│  │  │  MAFEvaluationHarness │  │MicrosoftEvaluatorAdapter│ │ChatClientAdapter│   │ │
│  │  │  (MAF support)  │  │(MS.Extensions.AI.Eval) │  │ (Generic)       │   │ │
│  │  └─────────────────┘  └────────────────────────┘  └─────────────────┘   │ │
│  │                                                                         │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                    Production Infrastructure                            │ │
│  ├────────────────────────────────────────────────────────────────────────┤ │
│  │                                                                         │ │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                      │ │
│  │  │IResultExporter│ │IDatasetLoader│ │  Tracing/   │                      │ │
│  │  │JUnit/MD/JSON │  │JSONL/YAML/CSV │  │Record+Replay│                      │ │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                      │ │
│  │                                                                         │ │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │ │
│  │  │  RedTeam/   │  │ResponsibleAI│  │ Calibration │  │ Comparison  │    │ │
│  │  │ Attack+Eval │  │Safety Metrics│  │Multi-Judge  │  │Stochastic   │    │ │
│  │  │IAttackType- │  └─────────────┘  └─────────────┘  └─────────────┘    │ │
│  │  │  Registry   │                                                        │ │
│  │  └─────────────┘                                                        │ │
│  │                                                                         │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Metric Hierarchy

AgentEval uses interface segregation to organize metrics by their requirements:

IMetric (base interface)
│
├── Properties:
│   ├── Name: string
│   └── Description: string
│
├── Methods:
│   └── EvaluateAsync(EvaluationContext, CancellationToken) -> MetricResult
│
├── IRAGMetric : IMetric
│   ├── RequiresContext: bool
│   ├── RequiresGroundTruth: bool
│   │
│   └── Implementations:
│       ├── FaithfulnessMetric      - Is response supported by context?
│       ├── RelevanceMetric         - Is response relevant to query?
│       ├── ContextPrecisionMetric  - Was context useful for the answer?
│       ├── ContextRecallMetric     - Does context cover ground truth?
│       └── AnswerCorrectnessMetric - Is response factually correct?
│
├── IAgenticMetric : IMetric
│   ├── RequiresToolUsage: bool
│   │
│   └── Implementations:
│       ├── ToolSelectionMetric   - Were correct tools called?
│       ├── ToolArgumentsMetric   - Were tool arguments correct?
│       ├── ToolSuccessMetric     - Did tool calls succeed?
│       ├── ToolEfficiencyMetric  - Were tools used efficiently?
│       └── TaskCompletionMetric  - Was the task completed?
│
└── IEmbeddingMetric : IMetric (implicit)
    ├── RequiresEmbeddings: bool
    │
    └── Implementations:
        ├── AnswerSimilarityMetric         - Response vs ground truth similarity
        ├── ResponseContextSimilarityMetric - Response vs context similarity
        └── QueryContextSimilarityMetric    - Query vs context similarity

Data Flow

Single Agent Evaluation

┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐
│  Test Case  │───▶│ IEvaluationHarness │───▶│ Agent Under │───▶│   Response   │
│   (Input)   │    │              │    │    Test     │    │   (Output)   │
└─────────────┘    └──────────────┘    └─────────────┘    └──────────────┘
                          │                                       │
                          │                                       │
                          ▼                                       ▼
                   ┌──────────────┐                       ┌──────────────┐
                   │Tool Tracking │                       │  Evaluation  │
                   │ (timeline,   │                       │   Context    │
                   │  arguments)  │                       │              │
                   └──────────────┘                       └──────────────┘
                          │                                       │
                          └───────────────────┬───────────────────┘
                                              │
                                              ▼
                                    ┌──────────────────┐
                                    │  Metric Runner   │
                                    │  (evaluates all  │
                                    │   configured     │
                                    │   metrics)       │
                                    └──────────────────┘
                                              │
                                              ▼
                                    ┌──────────────────┐
                                    │   Test Result    │
                                    │  • Score         │
                                    │  • Passed/Failed │
                                    │  • ToolUsage     │
                                    │  • Performance   │
                                    │  • FailureReport │
                                    └──────────────────┘
                                              │
                                              ▼
                                    ┌──────────────────┐
                                    │  Result Exporter │
                                    │  • JUnit XML     │
                                    │  • Markdown      │
                                    │  • JSON          │
                                    └──────────────────┘

Workflow Evaluation

┌─────────────────┐    ┌────────────────────┐    ┌─────────────────┐
│ WorkflowTestCase│───▶│WorkflowEvaluationHarness │───▶│  MAFWorkflow    │
│ (Agents+Graph)  │    │                    │    │ (Multi-Agent)   │
└─────────────────┘    └────────────────────┘    └─────────────────┘
                              │                           │
                              │                           ▼
                              │                  ┌─────────────────┐
                              │                  │ WorkflowExecution│
                              │                  │ • Agent 1       │
                              │                  │ • Agent 2       │
                              │                  │ • Agent N       │
                              │                  │ • Event Stream  │
                              │                  │ • Graph Traversal│
                              │                  └─────────────────┘
                              │                           │
                              ▼                           ▼
                   ┌─────────────────────┐       ┌────────────────────┐
                   │ MAFWorkflowEventBridge │       │WorkflowExecutionResult│
                   │ • Event Processing  │       │ • Per-Executor Data│
                   │ • Timeout Handling  │       │ • Graph Definition │
                   │ • Tool Aggregation  │       │ • Tool Usage       │
                   │ • Performance Tracking│      │ • Performance      │
                   └─────────────────────┘       └────────────────────┘
                              │                           │
                              └─────────────┬─────────────┘
                                            │
                                            ▼
                                  ┌──────────────────────┐
                                  │ Workflow Assertions  │
                                  │ • Structure validation│
                                  │ • Per-executor checks│
                                  │ • Graph verification │
                                  │ • Tool chain analysis│
                                  │ • Performance bounds │
                                  └──────────────────────┘
                                            │
                                            ▼
                                  ┌──────────────────────┐
                                  │ WorkflowTestResult   │
                                  │ • Overall Pass/Fail  │
                                  │ • Per-Executor Results│
                                  │ • Graph Visualization│
                                  │ • Tool Usage Report  │
                                  │ • Performance Summary│
                                  └──────────────────────┘

Key Models

EvaluationContext

The central data structure passed to all metrics:

public class EvaluationContext
{
    // Identification
    public string EvaluationId { get; init; }
    public DateTimeOffset StartedAt { get; init; }

    // Core data
    public required string Input { get; init; }      // User query
    public required string Output { get; init; }     // Agent response
    
    // RAG-specific
    public string? Context { get; init; }            // Retrieved context
    public string? GroundTruth { get; init; }        // Expected answer
    
    // Agentic-specific
    public ToolUsageReport? ToolUsage { get; init; } // Tool calls made
    public IReadOnlyList<string>? ExpectedTools { get; init; }
    
    // Performance
    public PerformanceMetrics? Performance { get; init; }
    public ToolCallTimeline? Timeline { get; init; } // Execution trace
    
    // Extensibility
    public IDictionary<string, object?> Properties { get; }
}

MetricResult

The result of evaluating a single metric:

public class MetricResult
{
    public required string MetricName { get; init; }
    public required double Score { get; init; }       // 0-100 scale
    public bool Passed { get; init; }
    public string? Explanation { get; init; }
    public IDictionary<string, object>? Details { get; init; }
    
    // Factory methods
    public static MetricResult Pass(string name, double score, string? explanation = null);
    public static MetricResult Fail(string name, string explanation, double score = 0);
}

ToolUsageReport

Tracks all tool calls made during an agent run:

public class ToolUsageReport
{
    public IReadOnlyList<ToolCallRecord> Calls { get; }
    public int Count { get; }
    public int SuccessCount { get; }
    public int FailureCount { get; }
    public TimeSpan TotalDuration { get; }
    
    // Fluent assertions
    public ToolUsageAssertions Should();
}

PerformanceMetrics

Captures timing and cost information:

public class PerformanceMetrics
{
    public TimeSpan TotalDuration { get; set; }
    public TimeSpan? TimeToFirstToken { get; set; }
    public TokenUsage? Tokens { get; set; }
    public decimal? EstimatedCost { get; set; }
    
    // Fluent assertions
    public PerformanceAssertions Should();
}

WorkflowExecutionResult

Result of workflow evaluation with multi-agent data:

public class WorkflowExecutionResult
{
    public required string WorkflowId { get; init; }
    public required DateTimeOffset StartedAt { get; init; }
    public required TimeSpan Duration { get; init; }
    
    // Graph structure
    public WorkflowGraphDefinition? GraphDefinition { get; init; }
    
    // Per-executor results
    public IReadOnlyDictionary<string, ExecutorResult> ExecutorResults { get; init; }
    
    // Aggregated data
    public ToolUsageReport? ToolUsage { get; init; }        // All tool calls
    public PerformanceMetrics? Performance { get; init; }   // Total cost/timing
    public string? FinalOutput { get; init; }               // Workflow output
    
    // Assertions
    public WorkflowResultAssertions Should();
}

ExecutorResult

Individual agent performance within a workflow:

public class ExecutorResult
{
    public required string ExecutorId { get; init; }
    public required string AgentName { get; init; }
    public string? Input { get; init; }
    public string? Output { get; init; }
    public DateTimeOffset? StartedAt { get; init; }
    public TimeSpan? Duration { get; init; }
    public ToolUsageReport? ToolUsage { get; init; }
    public PerformanceMetrics? Performance { get; init; }
    public bool HasError { get; init; }
    public string? ErrorMessage { get; init; }
}

WorkflowGraphDefinition

Represents the workflow structure and execution path:

public class WorkflowGraphDefinition
{
    public IReadOnlyList<WorkflowNode> Nodes { get; init; }
    public IReadOnlyList<WorkflowEdge> Edges { get; init; }
    public string? EntryPoint { get; init; }
    public string? ExitPoint { get; init; }
    public IReadOnlyList<string>? ExecutionPath { get; init; }
    
    // Validation helpers
    public bool HasNode(string nodeId);
    public bool HasEdge(string source, string target);
    public IEnumerable<string> GetExecutionOrder();
}

Design Patterns

1. Interface Segregation (ISP)

Metrics only require what they need:

// RAG metrics need context
public interface IRAGMetric : IMetric
{
    bool RequiresContext { get; }
    bool RequiresGroundTruth { get; }
}

// Agentic metrics need tool usage
public interface IAgenticMetric : IMetric
{
    bool RequiresToolUsage { get; }
}

2. Adapter Pattern

Enables integration with different frameworks:

// Adapt any IChatClient to IEvaluableAgent
public class ChatClientAgentAdapter : IEvaluableAgent
{
    private readonly IChatClient _chatClient;
    
    public async Task<AgentResponse> InvokeAsync(string input, CancellationToken ct)
    {
        var response = await _chatClient.GetResponseAsync(
            new[] { new ChatMessage(ChatRole.User, input) }, ct);
        return new AgentResponse { Text = response.Message.Text };
    }
}

// Wrap Microsoft's evaluators for AgentEval
public class MicrosoftEvaluatorAdapter : IMetric
{
    private readonly IEvaluator _msEvaluator;
    
    public async Task<MetricResult> EvaluateAsync(EvaluationContext context, CancellationToken ct)
    {
        var msResult = await _msEvaluator.EvaluateAsync(...);
        return new MetricResult
        {
            Score = ScoreNormalizer.From1To5(msResult.Score),
            ...
        };
    }
}

3. Fluent API

Intuitive assertion chaining:

result.ToolUsage!
    .Should()
    .HaveCalledTool("SearchTool")
        .BeforeTool("AnalyzeTool")
        .WithArguments(args => args.ContainsKey("query"))
    .And()
    .HaveNoErrors()
    .And()
    .HaveToolCountBetween(1, 5);

result.Performance!
    .Should()
    .HaveTotalDurationUnder(TimeSpan.FromSeconds(5))
    .HaveTimeToFirstTokenUnder(TimeSpan.FromSeconds(1))
    .HaveEstimatedCostUnder(0.10m);

4. Registry Pattern

Centralized metric management:

var registry = new MetricRegistry();
registry.Register(new FaithfulnessMetric(chatClient));
registry.Register(new ToolSelectionMetric(expectedTools));

// Run all registered metrics
foreach (var metric in registry.GetAll())
{
    var result = await metric.EvaluateAsync(context);
}

The registry pattern extends to exporters and attack types:

// Exporter registry (auto-populated via DI)
var exporters = serviceProvider.GetRequiredService<IExporterRegistry>();
var jsonExporter = exporters.GetRequired("Json");
var allFormats = exporters.GetRegisteredFormats(); // Json, Junit, Markdown, Csv, Trx, ...

// Attack type registry (pre-populated with 9 built-in + DI-registered)
var attacks = serviceProvider.GetRequiredService<IAttackTypeRegistry>();
var promptInjection = attacks.GetRequired("PromptInjection");
var llm01 = attacks.GetByOwaspId("LLM01"); // All attacks for OWASP LLM01

Package Structure

The codebase is organized into 6 internal projects (single NuGet package):

src/
├── AgentEval.Abstractions/       # Public contracts
│   ├── Core/                     # IMetric, IEvaluableAgent, IEvaluationHarness, etc.
│   ├── Models/                   # TestCase, TestResult, ToolCallRecord, PerformanceMetrics
│   ├── Embeddings/               # IAgentEvalEmbeddings
│   ├── Snapshots/                # ISnapshotComparer, ISnapshotStore
│   └── DependencyInjection/      # AgentEvalServiceOptions
│
├── AgentEval.Core/               # Implementations
│   ├── Assertions/               # ToolUsageAssertions, PerformanceAssertions, ResponseAssertions
│   ├── Metrics/                  # RAG/, Agentic/, Retrieval/, Safety/, Embedding
│   ├── Comparison/               # StochasticRunner, ModelComparer, StatisticsCalculator
│   ├── Tracing/                  # TraceRecordingAgent, TraceReplayingAgent, ChatTraceRecorder
│   ├── Calibration/              # CalibratedJudge, VotingStrategy
│   ├── Benchmarks/               # PerformanceBenchmark, AgenticBenchmark
│   ├── Adapters/                 # MicrosoftEvaluatorAdapter, ChatClientAgentAdapter
│   ├── Testing/                  # FakeChatClient
│   └── DependencyInjection/      # AddAgentEval()
│
├── AgentEval.DataLoaders/        # Data loading and export
│   ├── DataLoaders/              # JSON, JSONL, YAML, CSV loaders
│   ├── Exporters/                # JUnit XML, Markdown, JSON, CSV, TRX exporters
│   ├── Output/                   # TableFormatter, AgentEvalTestBase, TimeTravelTrace
│   └── DependencyInjection/      # AddAgentEvalDataLoaders()
│
├── AgentEval.MAF/                # Microsoft Agent Framework
│   ├── MAFAgentAdapter.cs        # Wraps AIAgent → IStreamableAgent
│   ├── MAFEvaluationHarness.cs   # MAF-specific evaluation harness
│   ├── MAFWorkflowAdapter.cs     # Workflow integration
│   └── WorkflowEvaluationHarness.cs
│
├── AgentEval.RedTeam/            # Security testing
│   ├── RedTeamRunner.cs          # Orchestrator
│   ├── AttackPipeline.cs         # Attack execution
│   ├── Attacks/                  # 9 built-in attack types
│   ├── Evaluators/               # Probe evaluators
│   ├── ResponsibleAI/            # Toxicity, Bias, Misinformation metrics
│   └── DependencyInjection/      # AddAgentEvalRedTeam()
│
└── AgentEval/                    # Umbrella (packaging only)
    └── AddAgentEvalAll()         # Registers all services from all sub-projects

Metrics Taxonomy

AgentEval organizes metrics into a clear taxonomy to aid discovery and selection. See ADR-007 for the formal decision.

Categorization by Computation Method

Prefix Method Cost Use Case
llm_ LLM-as-judge API cost High-accuracy quality assessment
code_ Code logic Free CI/CD, high-volume testing
embed_ Embedding similarity Low API cost Cost-effective semantic checks

Categorization by Evaluation Domain

Domain Interface Examples
RAG IRAGMetric Faithfulness, Relevance, Context Precision
Agentic IAgenticMetric Tool Selection, Tool Success, Task Completion
Conversation Special ConversationCompleteness
Safety ISafetyMetric Toxicity, Groundedness

Category Flags (ADR-007)

Metrics can declare multiple categories via MetricCategory flags:

public override MetricCategory Categories => 
    MetricCategory.RAG | 
    MetricCategory.RequiresContext | 
    MetricCategory.LLMBased;

For complete metric documentation, see:


Calibration Layer

AgentEval provides judge calibration for reliable LLM-as-judge evaluations. See ADR-008 for design decisions.

CalibratedJudge Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           CalibratedJudge                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Input:                                                                      │
│  ┌─────────────────┐    ┌─────────────────────────────────────────────────┐ │
│  │EvaluationContext│───▶│ Factory Pattern: Func<string, IMetric>          │ │
│  └─────────────────┘    │ Each judge gets its own metric with its client  │ │
│                         └─────────────────────────────────────────────────┘ │
│                                              │                               │
│  Parallel Execution:                         ▼                               │
│  ┌───────────────┐   ┌───────────────┐   ┌───────────────┐                  │
│  │  Judge 1      │   │  Judge 2      │   │  Judge 3      │                  │
│  │  (GPT-4o)     │   │  (Claude)     │   │  (Gemini)     │                  │
│  │  Score: 85    │   │  Score: 88    │   │  Score: 82    │                  │
│  └───────────────┘   └───────────────┘   └───────────────┘                  │
│         │                   │                   │                            │
│         └───────────────────┼───────────────────┘                            │
│                             ▼                                                │
│  Aggregation:    ┌─────────────────────────────────┐                        │
│                  │ VotingStrategy                  │                        │
│                  │ • Median (default, robust)      │                        │
│                  │ • Mean (equal weight)           │                        │
│                  │ • Unanimous (require consensus) │                        │
│                  │ • Weighted (custom weights)     │                        │
│                  └─────────────────────────────────┘                        │
│                             │                                                │
│  Output:                    ▼                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ CalibratedResult                                                     │    │
│  │ • Score: 85.0 (median)                                               │    │
│  │ • Agreement: 96.2%                                                   │    │
│  │ • JudgeScores: {GPT-4o: 85, Claude: 88, Gemini: 82}                 │    │
│  │ • ConfidenceInterval: [81.5, 88.5]                                   │    │
│  │ • StandardDeviation: 3.0                                             │    │
│  │ • HasConsensus: true                                                 │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Classes

Class Purpose
CalibratedJudge Coordinates multiple judges with parallel execution
CalibratedResult Result with score, agreement, CI, per-judge scores
VotingStrategy Aggregation method enum
CalibratedJudgeOptions Configuration for timeout, parallelism, consensus
ICalibratedJudge Interface for testability

Model Comparison Markdown Export

AgentEval provides rich Markdown export for model comparison results:

// Full report with all sections
var markdown = result.ToMarkdown();

// Compact table with medals
var table = result.ToRankingsTable();

// GitHub PR comment with collapsible details
var comment = result.ToGitHubComment();

// Save to file
await result.SaveToMarkdownAsync("comparison.md");

Export Options

// Full report (default)
result.ToMarkdown(MarkdownExportOptions.Default);

// Minimal (rankings only)
result.ToMarkdown(MarkdownExportOptions.Minimal);

// Custom
result.ToMarkdown(new MarkdownExportOptions
{
    IncludeStatistics = true,
    IncludeScoringWeights = false,
    HeaderEmoji = "🔬"
});

Behavioral Policy Assertions

Safety-critical assertions for enterprise compliance:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      Behavioral Policy Assertions                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  NeverCallTool("DeleteDatabase", because: "admin only")                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ Scans all tool calls for forbidden tool name                        │    │
│  │ Throws BehavioralPolicyViolationException with audit details        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  NeverPassArgumentMatching(@"\d{3}-\d{2}-\d{4}", because: "SSN is PII")    │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ Scans all tool arguments with regex pattern                         │    │
│  │ Auto-redacts matched values in exception (e.g., "1***9")            │    │
│  │ Throws BehavioralPolicyViolationException with RedactedValue        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  MustConfirmBefore("TransferFunds", because: "requires consent")            │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ Checks that confirmation tool was called before action              │    │
│  │ Default confirmation tools: "get_confirmation", "confirm"           │    │
│  │ Throws if action was called without prior confirmation              │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

BehavioralPolicyViolationException

Structured exception for audit trails:

catch (BehavioralPolicyViolationException ex)
{
    // Structured properties for logging/audit
    Console.WriteLine($"Policy: {ex.PolicyName}");       // "NeverCallTool(DeleteDB)"
    Console.WriteLine($"Type: {ex.ViolationType}");      // "ForbiddenTool"
    Console.WriteLine($"Action: {ex.ViolatingAction}");  // "Called DeleteDB 1 time(s)"
    Console.WriteLine($"Because: {ex.Because}");         // Developer's reason
    
    // For PII detection
    Console.WriteLine($"Pattern: {ex.MatchedPattern}");  // @"\d{3}-\d{2}-\d{4}"
    Console.WriteLine($"Value: {ex.RedactedValue}");     // "1***9" (auto-redacted)
    
    // Actionable suggestions
    foreach (var s in ex.Suggestions ?? [])
        Console.WriteLine($"  → {s}");
}

Internal Project Structure

AgentEval ships as a single NuGet package (AgentEval) but is internally organized into 6 projects for maintainability and compile-time dependency enforcement (see ADR-016).

Dependency Graph

AgentEval (NuGet package — umbrella)
├── AgentEval.Abstractions     → M.E.AI.Abstractions
├── AgentEval.Core             → Abstractions + M.E.AI + M.E.AI.Eval.Quality + S.N.Tensors
├── AgentEval.DataLoaders      → Abstractions + Core + YamlDotNet
├── AgentEval.MAF              → Abstractions + Core + M.Agents.AI + M.Agents.AI.Workflows
└── AgentEval.RedTeam          → Abstractions + Core + M.E.AI + M.E.DI + PdfSharp-MigraDoc

Project Responsibilities

Project Files Purpose
AgentEval.Abstractions ~48 Public contracts: IMetric, IEvaluableAgent, models, enums
AgentEval.Core ~63 Implementations: metrics, assertions, comparison, tracing, testing
AgentEval.DataLoaders ~23 Data loaders (JSON/YAML/CSV/JSONL), exporters, output formatting
AgentEval.MAF 7 Microsoft Agent Framework adapters and harnesses
AgentEval.RedTeam 61 Security scanning, attack types, evaluators, compliance reports
AgentEval (umbrella) 1 Packaging + AddAgentEvalAll() DI convenience method

All projects use RootNamespace=AgentEval so consumers see no namespace changes.


See Also