Evaluation Guide

How to choose the right metrics for your AI evaluation needs

Quick Start Decision Tree

What are you evaluating?
│
├─► RAG System (retrieval + generation)
│   │
│   ├─► Is retrieval finding relevant documents?
│   │   └─► Use: code_recall_at_k, code_mrr (FREE!)
│   │
│   ├─► Is the response grounded in context?
│   │   └─► Use: llm_faithfulness, embed_response_context
│   │
│   ├─► Is the retrieved context good?
│   │   └─► Use: llm_context_precision, llm_context_recall
│   │
│   └─► Is the answer correct?
│       └─► Use: llm_answer_correctness, embed_answer_similarity
│
├─► AI Agent (tool-using)
│   │
│   ├─► Are the right tools being selected?
│   │   └─► Use: code_tool_selection
│   │
│   ├─► Are tools called correctly?
│   │   └─► Use: code_tool_arguments, code_tool_success
│   │
│   ├─► Is the agent efficient?
│   │   └─► Use: code_tool_efficiency
│   │
│   └─► Does it complete tasks?
│       └─► Use: llm_task_completion
│
├─► Multi-Agent Workflow (orchestrated system)
│   │
│   ├─► Is the workflow structure correct?
│   │   └─► Use: code_workflow_structure_validity (FREE!)
│   │
│   ├─► Do agents execute in the right order?
│   │   └─► Use: code_workflow_execution_order (FREE!)
│   │
│   ├─► Are tools coordinated across agents?
│   │   └─► Use: code_workflow_tool_chain_validity (FREE!)
│   │
│   ├─► Do individual agents perform well?
│   │   └─► Use: Per-executor assertions + agent metrics
│   │
│   └─► Is the final workflow output high quality?
│       └─► Use: llm_workflow_output_quality
│
└─► General LLM Quality
    │
    └─► Is the response relevant?
        └─► Use: llm_relevance

Evaluation Strategies by Use Case

1. CI/CD Pipeline Testing

Goal: Fast, free tests that run on every commit.

Recommended Metrics:

code_tool_selection - Verify correct tools
code_tool_arguments - Validate parameters
code_tool_success - Check execution success
code_tool_efficiency - Monitor performance

Why: Code-based metrics are free, fast, and deterministic.

[Fact]
public async Task TravelAgent_BookFlight_SelectsCorrectTools()
{
    var metric = new ToolSelectionMetric(["FlightSearchTool", "BookingTool"]);
    var result = await metric.EvaluateAsync(context);
    
    result.Score.Should().BeGreaterThan(80);
}

2. RAG Quality Assessment

Goal: Ensure retrieval and generation quality.

Recommended Metrics:

Phase	Metric	Purpose	Cost
Retrieval	`code_recall_at_k`	Are relevant docs found?	Free
Retrieval	`code_mrr`	Is relevant doc ranked first?	Free
Retrieval	`llm_context_precision`	Is retrieved content relevant?	LLM
Retrieval	`llm_context_recall`	Is all needed info retrieved?	LLM
Generation	`llm_faithfulness`	Is response grounded in context?	LLM
Generation	`llm_answer_correctness`	Is the answer factually correct?	LLM

Cost-Optimized Strategy:

CI/CD (Free): Use code_recall_at_k and code_mrr for retrieval testing
Volume Testing ($): Use embed_response_context and embed_answer_similarity
Production Sampling ($$): Use llm_faithfulness and llm_answer_correctness

Example: Retrieval Testing (FREE)

// Test retrieval quality without any API calls
var recallMetric = new RecallAtKMetric(k: 5);
var mrrMetric = new MRRMetric();

var context = new EvaluationContext
{
    RelevantDocumentIds = ["doc1", "doc2", "doc3"],
    RetrievedDocumentIds = ["doc1", "doc4", "doc2", "doc5", "doc6"]
};

var recall = await recallMetric.EvaluateAsync(context); // 67% (2/3 found)
var mrr = await mrrMetric.EvaluateAsync(context);       // 100% (first relevant at rank 1)

3. Agent Task Completion

Goal: Verify agents complete end-to-end tasks.

Recommended Approach:

// 1. Fast tool validation (code-based)
var toolMetric = new ToolSelectionMetric(expectedTools);
var toolResult = await toolMetric.EvaluateAsync(context);

// 2. Deep task evaluation (LLM-based, sample)
if (IsProductionSample())
{
    var taskMetric = new TaskCompletionMetric(chatClient);
    var taskResult = await taskMetric.EvaluateAsync(context);
}

4. stochastic evaluation

Goal: Account for LLM non-determinism.

Approach: Run same evaluation multiple times, analyze statistics.

var runner = new StochasticRunner(harness, statisticsCalculator: null, options);
var result = await runner.RunStochasticTestAsync(
    agent, testCase,
    new StochasticOptions(Runs: 10, SuccessRateThreshold: 0.8));

// Analyze: min, max, mean, std dev
result.Statistics.Mean.Should().BeGreaterThan(75);
result.Statistics.StandardDeviation.Should().BeLessThan(15);

5. Model Comparison

Goal: Compare different models on same tasks.

Approach:

var stochasticRunner = new StochasticRunner(harness);
var comparer = new ModelComparer(stochasticRunner);
var results = await comparer.CompareModelsAsync(
    factories: [gpt4Factory, gpt35Factory, claudeFactory],
    testCases: testSuite,
    metrics: [faithfulness, relevance]);

results.PrintComparisonTable();
// Shows: Model | Mean Score | Cost | Latency

6. Multi-Agent Workflow Evaluation

Goal: Evaluate complex multi-agent systems and orchestrated workflows.

Key Challenges:

Multiple agents with different capabilities
Sequential or parallel execution coordination
Tool sharing and state management across agents
End-to-end workflow performance
Error propagation and recovery

Recommended Approach:

Phase 1: Structural Validation (FREE)

// Fast workflow structure validation
var structureMetric = new WorkflowStructureValidityMetric();
var orderMetric = new WorkflowExecutionOrderMetric(
    expectedOrder: ["Planner", "Researcher", "Writer", "Editor"]
);

// Verify workflow topology and execution sequence
var result = await harness.RunWorkflowTestAsync(workflowAdapter, testCase);
result.ExecutionResult!.Should()
    .HaveStepCount(4, because: "content pipeline has 4 stages")
    .HaveExecutedInOrder("Planner", "Researcher", "Writer", "Editor")
    .HaveNoErrors();

Phase 2: Per-Agent Performance

// Individual agent validation within workflow context
result.ExecutionResult!
    .ForExecutor("Researcher")
        .HaveCompletedWithin(TimeSpan.FromMinutes(3))
        .HaveCalledTool("ResearchTool")
        .HaveEstimatedCostUnder(0.20m)
        .And()
    .ForExecutor("Writer")
        .HaveOutputLongerThan(500, because: "content should be substantial")
        .HaveNonEmptyOutput()
        .And();

Phase 3: Tool Chain Validation

// Multi-agent tool coordination
result.ExecutionResult!.Should()
    .HaveCalledTool("GetInfoAbout", because: "TripPlanner must research")
        .InExecutor("TripPlanner")
        .WithoutError()
        .And()
    .HaveCalledTool("SearchFlights")
        .BeforeTool("BookFlight", because: "must search before booking")
        .InExecutor("FlightReservation")
        .And()
    .HaveToolCallPattern("Search", "Book")  // Pattern across workflow
    .HaveNoToolErrors();

Phase 4: End-to-End Quality (LLM-BASED)

// Overall workflow output assessment
var workflowQuality = new WorkflowOutputQualityMetric(chatClient, 
    criteria: "Evaluate if the multi-agent workflow produced coherent, complete output");

var qualityResult = await workflowQuality.EvaluateAsync(workflowContext);
qualityResult.Score.Should().BeGreaterThan(80);

Cost-Optimized Workflow Strategy:

Structure Validation (FREE): Always validate graph topology and execution order
Performance Bounds (FREE): Check timing, costs, basic success metrics
Tool Coordination (FREE): Validate multi-agent tool usage patterns
Quality Sampling ($$): Use LLM evaluation on subset of workflow outputs

Example: Content Creation Pipeline

// Sample 09 
var testCase = new WorkflowTestCase
{
    Name = "Content Creation Pipeline",
    Input = "Create an article about sustainable technology",
    Agents = ["Planner", "Researcher", "Writer", "Editor"],
    WorkflowTimeout = TimeSpan.FromMinutes(10)
};

var result = await harness.RunWorkflowTestAsync(workflowAdapter, testCase);

// Comprehensive workflow validation
result.ExecutionResult!.Should()
    // Structure (FREE)
    .HaveStepCount(4)
    .HaveExecutedInOrder("Planner", "Researcher", "Writer", "Editor") 
    .HaveCompletedWithin(TimeSpan.FromMinutes(10))
    
    // Per-agent validation (FREE)  
    .ForExecutor("Writer")
        .HaveOutputLongerThan(200)
        .HaveEstimatedCostUnder(0.15m)
        .And()
        
    // Graph validation (FREE)
    .HaveGraphStructure()
        .HaveEntryPoint("Planner")
        .HaveExecutionPath("Planner", "Researcher", "Writer", "Editor")
        .And()
        
    // Final validation
    .HaveNoErrors();

// Optional: Deep quality assessment (LLM-based, for samples)
if (IsProductionSample())
{
    var qualityMetric = new WorkflowOutputQualityMetric(chatClient);
    var quality = await qualityMetric.EvaluateAsync(workflowContext);
    quality.Score.Should().BeGreaterThan(75);
}

Workflow-Specific Metrics:

code_workflow_structure_validity - Graph topology validation (FREE)
code_workflow_execution_order - Sequence verification (FREE)
code_workflow_executor_success - Per-agent success rate (FREE)
code_workflow_tool_chain_validity - Multi-agent tool patterns (FREE)
llm_workflow_output_quality - End-to-end quality assessment (LLM)

7. Snapshot / Regression Testing

Use when: You want to detect regressions by comparing current agent responses against saved baselines.

Recommended Approach:

var store = new SnapshotStore("./snapshots");
var comparer = new SnapshotComparer(new SnapshotOptions
{
    IgnoreFields = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
    {
        "timestamp", "requestId", "duration"
    },
    UseSemanticComparison = true,
    SemanticThreshold = 0.85
});

// Capture or compare
var result = await harness.RunEvaluationAsync(adapter, testCase);
var responseJson = JsonSerializer.Serialize(new { response = result.ActualOutput });

if (!store.Exists("my-baseline"))
{
    await store.SaveAsync("my-baseline", new { response = result.ActualOutput });
}
else
{
    var baseline = await store.LoadAsync<JsonElement>("my-baseline");
    var comparison = comparer.Compare(baseline.GetRawText(), responseJson);
    Assert.True(comparison.IsMatch);
}

Cost-Optimized Snapshot Strategy:

Scrub volatile data (FREE): Timestamps, IDs, request IDs stripped automatically
Field-level diff (FREE): JSON-aware comparison pinpoints exact changes
Semantic matching ($): Use Jaccard similarity for natural language fields
CI-gated updates: Only update baselines with UPDATE_SNAPSHOTS=true

See Snapshots for full documentation.

Metric Selection by Data Availability

What data do you have?

I have...	Recommended Metrics
Query + Response only	`llm_relevance`
Query + Response + Context	`llm_faithfulness`, `llm_context_precision`, `embed_response_context`
Query + Response + Ground Truth	`llm_answer_correctness`, `embed_answer_similarity`
Query + Response + Context + Ground Truth	All RAG metrics
Query + Response + Tool Calls	All agentic metrics
Retrieved + Relevant Document IDs	`code_recall_at_k`, `code_mrr` (FREE!)

Cost vs. Accuracy Trade-offs

Accuracy ▲
         │
    100% │    ★ llm_answer_correctness
         │    ★ llm_faithfulness
     90% │
         │    ● llm_context_precision
     80% │    ● llm_relevance
         │
     70% │        ▲ embed_answer_similarity
         │        ▲ embed_response_context
     60% │
         │            ■ code_recall_at_k  ■ code_mrr
     50% │            ■ code_tool_selection
         │            ■ code_tool_success
         │
         └───────────────────────────────────► Cost
              Free    $0.01   $0.05   $0.10
         
Legend: ★ LLM metrics  ▲ Embedding metrics  ■ Code metrics (FREE!)

Guidance:

Use code metrics for CI/CD (free, fast, deterministic)
Use IR metrics (code_recall_at_k, code_mrr) for retrieval testing (free!)
Use embedding metrics for volume testing (cheap, good accuracy)
Use LLM metrics for production sampling (expensive, highest accuracy)

Recommended Evaluation Suites

Minimal Suite (CI/CD)

var metrics = new IMetric[]
{
    new ToolSelectionMetric(expectedTools),
    new ToolSuccessMetric()
};

Standard Suite (Development)

var metrics = new IMetric[]
{
    new FaithfulnessMetric(chatClient),
    new RelevanceMetric(chatClient),
    new ToolSelectionMetric(expectedTools),
    new ToolSuccessMetric()
};

Comprehensive Suite (Release)

var metrics = new IMetric[]
{
    // Information Retrieval (FREE)
    new RecallAtKMetric(k: 10),
    new MRRMetric(),
    
    // RAG Quality
    new FaithfulnessMetric(chatClient),
    new ContextPrecisionMetric(chatClient),
    new ContextRecallMetric(chatClient),
    new AnswerCorrectnessMetric(chatClient),
    
    // Agentic Quality
    new ToolSelectionMetric(expectedTools),
    new ToolArgumentsMetric(schema),
    new ToolSuccessMetric(),
    new ToolEfficiencyMetric(),
    new TaskCompletionMetric(chatClient)
};

Common Patterns

Pattern 1: Sampling Expensive Metrics

Run expensive LLM metrics on a sample:

var sampleRate = 0.1; // 10% of traffic

if (Random.Shared.NextDouble() < sampleRate)
{
    await faithfulnessMetric.EvaluateAsync(context);
}

Pattern 2: Tiered Evaluation

Start cheap, escalate if concerns:

// Fast embedding check
var embedScore = await embedMetric.EvaluateAsync(context);

// Only call expensive LLM if embedding score is borderline
if (embedScore.Score < 80)
{
    var llmScore = await llmMetric.EvaluateAsync(context);
    return llmScore;
}

return embedScore;

Pattern 3: Composite Scoring

Combine multiple metrics into one score:

var scores = await Task.WhenAll(
    faithfulness.EvaluateAsync(context),
    relevance.EvaluateAsync(context),
    toolSuccess.EvaluateAsync(context));

var compositeScore = scores.Average(s => s.Score);

Anti-Patterns to Avoid

❌ Don't	✅ Do Instead
Run LLM metrics on every request	Sample 1-10% for production
Use only code metrics for quality	Combine with LLM metrics for accuracy
Ignore stochasticity	Run multiple times, analyze statistics
Test with same data as training	Use held-out test sets
Skip ground truth when available	Use `llm_answer_correctness`
Mock evaluation LLM responses	Always use real LLM for evaluation metrics

Evaluation Always Real Principle

When building demos, samples, or tests, there's an important distinction between what should use real LLM calls versus what can be mocked.

Core Principle

"Evaluation Always Real, Structure Optionally Mock"

What This Means

Category	Mock OK?	Why
Agent responses	✅ Yes	Structure demos can show flows without real AI
Tool call results	✅ Yes	Validates tool handling logic
Conversation flows	✅ Yes	Tests multi-turn patterns
Evaluation metrics	❌ No	Defeats the purpose of showing AI assessment
LLM-as-a-Judge	❌ No	Hardcoded scores aren't real evaluation
Consensus voting	❌ No	Multiple judges should have real variance

Acceptable vs Unacceptable Patterns

❌ Silent Mocking (Bad)

// WRONG - User thinks they're seeing real evaluation
private IChatClient CreateEvaluatorClient()
{
    return new FakeChatClient("""{"score": 92, "explanation": "Mock"}""");
}

✅ Explicit User Choice (Good)

// CORRECT - User explicitly chooses mock mode
Console.WriteLine("Select mode:");
Console.WriteLine("[1] MOCK MODE - Demo structure only");
Console.WriteLine("[2] REAL MODE - Full AI evaluation");

if (userChoice == "1")
    return CreateMockClient();  // User understands the trade-off

✅ Graceful Skip (Good)

// CORRECT - Skip with explanation when not configured
if (!AIConfig.IsConfigured)
{
    Console.WriteLine("⚠️ LLM-as-a-Judge requires Azure OpenAI credentials.");
    Console.WriteLine("   Configure AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY");
    return null; // Caller handles gracefully
}
return CreateRealClient();

When Testing Metrics Themselves

For unit testing your metric implementations, FakeChatClient is appropriate:

// This is FINE - testing the metric code, not demonstrating evaluation
[Fact]
public async Task FaithfulnessMetric_ParsesLLMResponse_Correctly()
{
    var fakeClient = new FakeChatClient("""{"score": 85, "explanation": "Test"}""");
    var metric = new FaithfulnessMetric(fakeClient);
    
    var result = await metric.EvaluateAsync(context);
    
    Assert.Equal(85, result.Score);
}

The distinction: Testing metric code vs Demonstrating evaluation capabilities.

Table of Contents

Evaluation Guide

Quick Start Decision Tree

Evaluation Strategies by Use Case

1. CI/CD Pipeline Testing

2. RAG Quality Assessment

3. Agent Task Completion

4. stochastic evaluation

5. Model Comparison

6. Multi-Agent Workflow Evaluation

7. Snapshot / Regression Testing

Metric Selection by Data Availability

What data do you have?

Cost vs. Accuracy Trade-offs

Recommended Evaluation Suites

Minimal Suite (CI/CD)

Standard Suite (Development)

Comprehensive Suite (Release)

Common Patterns

Pattern 1: Sampling Expensive Metrics

Pattern 2: Tiered Evaluation

Pattern 3: Composite Scoring

Anti-Patterns to Avoid

Evaluation Always Real Principle

Core Principle

What This Means

Acceptable vs Unacceptable Patterns

When Testing Metrics Themselves

See Also