Walkthrough: Evaluating Your First AI Agent
This walkthrough guides you through evaluating an AI agent with AgentEval, from setup to assertions.
What You'll Learn
- Setting up an evaluation harness
- Wrapping an agent for evaluation
- Running an evaluation and capturing results
- Asserting on tool usage
- Asserting on performance
- Exporting results for CI/CD
Prerequisites
- .NET 8.0+ SDK
- An AI agent (we'll use a mock for this tutorial)
- AgentEval installed (
dotnet add package AgentEval --prerelease)
Step 1: Create an evaluation harness
The evaluation harness runs your agent and captures all the data needed for assertions.
using AgentEval.MAF;
// Create an evaluation harness with optional verbose logging
var harness = new MAFEvaluationHarness(verbose: true);
Step 2: Wrap Your Agent
AgentEval uses adapters to wrap different agent types. For Microsoft Agent Framework agents:
using AgentEval.MAF;
using Azure.AI.OpenAI;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
// First, create your MAF agent
var azureClient = new AzureOpenAIClient(
new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!),
new Azure.AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_OPENAI_API_KEY")!));
var chatClient = azureClient
.GetChatClient("gpt-4o")
.AsIChatClient();
var myAgent = new ChatClientAgent(
chatClient,
new ChatClientAgentOptions
{
Name = "TravelPlannerAgent",
ChatOptions = new ChatOptions
{
Instructions = "You are a travel planning assistant.",
Tools = [
AIFunctionFactory.Create(SearchFlights),
AIFunctionFactory.Create(SearchHotels),
AIFunctionFactory.Create(GetWeather)
]
}
});
// Then wrap it for evaluation
var adapter = new MAFAgentAdapter(myAgent);
For any IChatClient:
using AgentEval.Adapters;
// Wrap an IChatClient
var adapter = new ChatClientAgentAdapter(chatClient, "MyAgent");
Step 3: Define an Evaluation Case
Evaluation cases describe what to evaluate and how to judge the results:
using AgentEval.Models;
var testCase = new TestCase
{
Name = "Travel Planning Evaluation",
Input = "Plan a trip to Paris for next weekend",
// Optional: Expected tools the agent should use
ExpectedTools = new[] { "SearchFlights", "SearchHotels", "GetWeather" },
// Optional: Criteria for AI-powered evaluation
EvaluationCriteria = new[]
{
"Should include flight options",
"Should include hotel recommendations",
"Should consider weather"
},
// Minimum score to pass (0-100)
PassingScore = 70
};
Step 4: Run the Evaluation
Execute the evaluation and capture results:
using AgentEval.Core;
// Run the evaluation - tool tracking and performance metrics are captured automatically
var result = await harness.RunEvaluationAsync(adapter, testCase);
// Check if the evaluation passed
Console.WriteLine($"Passed: {result.Passed}");
Console.WriteLine($"Score: {result.Score}");
Console.WriteLine($"Output: {result.ActualOutput}");
Step 5: Assert on Tool Usage
Use fluent assertions to verify the agent used tools correctly:
using AgentEval.Assertions;
// Assert specific tools were called
result.ToolUsage!
.Should()
.HaveCalledTool("SearchFlights")
.HaveCalledTool("SearchHotels")
.HaveCalledTool("GetWeather");
// Assert tool ordering
result.ToolUsage!
.Should()
.HaveCalledTool("GetWeather")
.BeforeTool("SearchFlights"); // Weather check before booking
// Assert tool arguments
result.ToolUsage!
.Should()
.HaveCalledTool("SearchFlights")
.WithArgument("destination", "Paris");
// Assert no errors occurred
result.ToolUsage!
.Should()
.HaveNoErrors();
// Assert call count limits
result.ToolUsage!
.Should()
.HaveTotalCallsLessThan(10); // Efficiency check
Step 6: Assert on Performance
Verify the agent meets performance requirements:
using AgentEval.Assertions;
result.Performance!
.Should()
.HaveTotalDurationUnder(TimeSpan.FromSeconds(30))
.HaveTimeToFirstTokenUnder(TimeSpan.FromSeconds(2))
.HaveTokenCountUnder(4000)
.HaveEstimatedCostUnder(0.10m); // Max $0.10 per request
Step 7: Assert on Response Content
Verify the response contains expected information:
using AgentEval.Assertions;
result.Response
.Should()
.Contain("Paris")
.ContainAny("flight", "airline")
.ContainAny("hotel", "accommodation")
.NotContain("error")
.HaveLengthBetween(100, 5000);
Step 8: Export Results for CI/CD
Export results in formats your CI/CD system understands. Use ResultExporterFactory to create exporters:
using AgentEval.Exporters;
// Build an EvaluationReport from your test results
var report = new EvaluationReport
{
Name = "My Evaluation",
TotalTests = 3, PassedTests = 2, FailedTests = 1,
OverallScore = 78.3,
StartTime = DateTimeOffset.UtcNow.AddSeconds(-5),
EndTime = DateTimeOffset.UtcNow,
Agent = new AgentInfo { Name = "MyAgent", Model = "gpt-4o" },
TestResults = testResults // List<TestResultSummary>
};
// JUnit XML for GitHub Actions, Azure DevOps, Jenkins
var junitExporter = ResultExporterFactory.Create(ExportFormat.Junit);
await using var junitStream = File.Create("results.xml");
await junitExporter.ExportAsync(report, junitStream);
// Markdown for PR comments
var mdExporter = ResultExporterFactory.Create(ExportFormat.Markdown);
await using var mdStream = File.Create("results.md");
await mdExporter.ExportAsync(report, mdStream);
// JSON for custom dashboards
var jsonExporter = ResultExporterFactory.Create(ExportFormat.Json);
await using var jsonStream = File.Create("results.json");
await jsonExporter.ExportAsync(report, jsonStream);
// TRX for Visual Studio / Azure DevOps
var trxExporter = ResultExporterFactory.Create(ExportFormat.Trx);
await using var trxStream = File.Create("results.trx");
await trxExporter.ExportAsync(report, trxStream);
// CSV for Excel / Power BI analysis
var csvExporter = ResultExporterFactory.Create(ExportFormat.Csv);
await using var csvStream = File.Create("results.csv");
await csvExporter.ExportAsync(report, csvStream);
// Or create from file extension:
var exporter = ResultExporterFactory.CreateFromExtension(".json");
Step 9: Snapshot Testing for Regression Detection
Save agent responses as baselines and compare future responses to detect regressions:
using AgentEval.Snapshots;
using System.Text.Json;
var store = new SnapshotStore("./snapshots");
var comparer = new SnapshotComparer(new SnapshotOptions
{
UseSemanticComparison = true,
SemanticThreshold = 0.85
});
// First run: capture baseline
if (!store.Exists("travel-agent-test"))
{
await store.SaveAsync("travel-agent-test", new { response = result.ActualOutput });
}
// Subsequent runs: compare against baseline
var baseline = await store.LoadAsync<JsonElement>("travel-agent-test");
var comparison = comparer.Compare(
baseline.GetRawText(),
JsonSerializer.Serialize(new { response = result.ActualOutput }));
Assert.True(comparison.IsMatch,
$"Regression detected: {string.Join(", ", comparison.Differences.Select(d => d.Message))}");
See Snapshots for complete documentation.
Complete Example
Here's the full test in one file:
using AgentEval.MAF;
using AgentEval.Models;
using AgentEval.Core;
using AgentEval.Assertions;
using AgentEval.Exporters;
using Azure.AI.OpenAI;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using System.ComponentModel;
// ═══════════════════════════════════════════════════════════════
// 1. Create your MAF agent with tools
// ═══════════════════════════════════════════════════════════════
var azureClient = new AzureOpenAIClient(
new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!),
new Azure.AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_OPENAI_API_KEY")!));
var chatClient = azureClient
.GetChatClient("gpt-4o")
.AsIChatClient();
var myAgent = new ChatClientAgent(
chatClient,
new ChatClientAgentOptions
{
Name = "TravelPlannerAgent",
ChatOptions = new ChatOptions
{
Instructions = "You are a travel planning assistant. Use tools to search for flights, hotels, and weather.",
Tools = [
AIFunctionFactory.Create(SearchFlights),
AIFunctionFactory.Create(SearchHotels),
AIFunctionFactory.Create(GetWeather)
]
}
});
// ═══════════════════════════════════════════════════════════════
// 2. Setup evaluation harness and adapter
// ═══════════════════════════════════════════════════════════════
var harness = new MAFEvaluationHarness(verbose: true);
var adapter = new MAFAgentAdapter(myAgent);
// ═══════════════════════════════════════════════════════════════
// 3. Define evaluation case
// ═══════════════════════════════════════════════════════════════
var testCase = new TestCase
{
Name = "Travel Planning Evaluation",
Input = "Plan a trip to Paris for next weekend",
ExpectedTools = new[] { "SearchFlights", "SearchHotels", "GetWeather" },
PassingScore = 70
};
// ═══════════════════════════════════════════════════════════════
// 4. Run evaluation
// ═══════════════════════════════════════════════════════════════
var result = await harness.RunEvaluationAsync(adapter, testCase);
// ═══════════════════════════════════════════════════════════════
// 5. Assert
// ═══════════════════════════════════════════════════════════════
result.ToolUsage!
.Should()
.HaveCalledTool("SearchFlights")
.HaveCalledTool("SearchHotels")
.HaveNoErrors();
result.Performance!
.Should()
.HaveTotalDurationUnder(TimeSpan.FromSeconds(30))
.HaveEstimatedCostUnder(0.10m);
// ═══════════════════════════════════════════════════════════════
// 6. Export results
// ═══════════════════════════════════════════════════════════════
var report = new EvaluationReport
{
Name = "Travel Planning",
TotalTests = 1, PassedTests = result.Passed ? 1 : 0, FailedTests = result.Passed ? 0 : 1,
OverallScore = result.Score,
StartTime = DateTimeOffset.UtcNow.AddSeconds(-5),
EndTime = DateTimeOffset.UtcNow,
TestResults = new List<TestResultSummary>
{
new() { Name = testCase.Name, Score = result.Score, Passed = result.Passed }
}
};
var exporter = ResultExporterFactory.Create(ExportFormat.Junit);
await using var exportStream = File.Create("results.xml");
await exporter.ExportAsync(report, exportStream);
Console.WriteLine($"✅ Evaluation {(result.Passed ? "PASSED" : "FAILED")}");
Console.WriteLine($" Output: {result.ActualOutput}");
// ═══════════════════════════════════════════════════════════════
// Tool definitions
// ═══════════════════════════════════════════════════════════════
[Description("Search for available flights")]
static string SearchFlights(
[Description("Destination city")] string destination,
[Description("Departure date")] string date)
{
return $"Found 3 flights to {destination} on {date}: AA123, UA456, DL789";
}
[Description("Search for hotels")]
static string SearchHotels(
[Description("City name")] string city,
[Description("Check-in date")] string checkIn)
{
return $"Found hotels in {city}: Hilton ($200/night), Marriott ($180/night)";
}
[Description("Get weather forecast")]
static string GetWeather(
[Description("City name")] string city)
{
return $"Weather in {city}: Sunny, 72°F";
}
Using with xUnit/NUnit/MSTest
AgentEval integrates naturally with evaluation frameworks:
using AgentEval.MAF;
using AgentEval.Models;
using AgentEval.Assertions;
using Azure.AI.OpenAI;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using Xunit;
public class TravelAgentTests
{
private readonly AIAgent _agent;
public TravelAgentTests()
{
// Setup agent once per test class
var azureClient = new AzureOpenAIClient(
new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!),
new Azure.AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_OPENAI_API_KEY")!));
var chatClient = azureClient
.GetChatClient("gpt-4o")
.AsIChatClient();
_agent = new ChatClientAgent(
chatClient,
new ChatClientAgentOptions
{
Name = "TravelPlannerAgent",
ChatOptions = new ChatOptions
{
Instructions = "You are a travel planning assistant.",
Tools = [AIFunctionFactory.Create(SearchFlights)]
}
});
}
[Fact]
public async Task Agent_ShouldPlanTrip_WithCorrectTools()
{
// Arrange
var harness = new MAFEvaluationHarness();
var adapter = new MAFAgentAdapter(_agent);
var testCase = new TestCase
{
Name = "Travel Planning Evaluation",
Input = "Plan a trip to Paris",
ExpectedTools = new[] { "SearchFlights" }
};
// Act
var result = await harness.RunEvaluationAsync(adapter, testCase);
// Assert
result.ToolUsage!
.Should()
.HaveCalledTool("SearchFlights")
.HaveNoErrors();
Assert.True(result.Passed);
}
[System.ComponentModel.Description("Search for flights")]
private static string SearchFlights(
[System.ComponentModel.Description("Destination")] string destination)
{
return $"Found flights to {destination}";
}
}
All Samples Guide
AgentEval includes comprehensive examples covering every evaluation scenario. Here's what each sample demonstrates:
Foundation Samples (1-4): Getting Started
Mock Mode Available (No Azure OpenAI Required)
- Sample01: Hello World - Basic agent evaluation setup
- Sample02: Tool Usage Assertions - Validate tool calls with fluent syntax
- Sample03: Performance Assertions - Check latency, cost, and token usage
- Sample04: RAG Metrics - Evaluate retrieval-augmented generation quality
Core Evaluation Samples (5-12): Essential Patterns
- Sample05: RAG Quality Metrics - Faithfulness, relevance, context precision
- Sample06: Performance Profiling - Latency percentiles, tokens, tool accuracy via MAFEvaluationHarness
- Sample07: Snapshot Testing - Detect regressions against golden responses
- Sample08: Multi-Turn Conversations - Conversation flow evaluation
- Sample09: Sequential Workflows - Multi-agent pipeline evaluation
- Sample10: Tool-Enabled Workflows - Complex multi-agent tool chains
- Sample11: Datasets and Export - Rich output formats and visual reports
- Sample12: Embedding Metrics - Semantic similarity evaluation
Advanced Evaluation Samples (13-18): Production Patterns
- Sample13: Trace Record & Replay - API-free evaluation for CI/CD
- Sample14: stochastic evaluation - Handle non-deterministic behavior
- Sample15: Model Comparison - Compare models side-by-side
- Sample16: Statistical Analysis - Advanced statistical evaluation
- Sample17: Custom Metrics - Building domain-specific evaluators
- Sample18: Calibrated Judges - Multi-model consensus evaluation
Security & Compliance Samples (19-24): Enterprise Features
- Sample19: Streaming vs Async - Compare streaming and non-streaming performance
- Sample20: Quick Red Team Scan - One-line security assessment
- Sample21: Advanced Red Team Pipeline - Comprehensive security evaluation
- Sample22: Responsible AI - Toxicity, bias, misinformation metrics
- Sample23: Benchmark System - Performance, agentic, standard, and cost benchmarks
- Sample24: Calibrated Evaluator - Multi-model harness evaluation with criteria consensus
Running Samples
# Clone and run
git clone https://github.com/AgentEvalHQ/AgentEval
cd AgentEval/samples/AgentEval.Samples
# Mock mode (no API keys required) - Samples 1-4
dotnet run -- 1 # Hello World
dotnet run -- 2 # Tool assertions
dotnet run -- 3 # Performance assertions
dotnet run -- 4 # RAG metrics
# Azure OpenAI required - Samples 5-24
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-api-key"
dotnet run -- 9 # Sequential Workflows
dotnet run -- 10 # Tool-Enabled Workflows
dotnet run -- 20 # Quick Red Team Scan
dotnet run -- 21 # Advanced Red Team Pipeline
dotnet run -- 24 # Calibrated Evaluator (multi-model consensus)
Workflow Evaluation Deep Dive
Samples 09 & 10 demonstrate multi-agent workflow evaluation—one of AgentEval's most powerful features.
Sequential Pipeline Workflow (Sample 09)
Evaluate a content creation pipeline with multiple agents:
// 1. Define workflow agents
var plannerAgent = CreateAgent("ContentPlanner", "Create content outlines");
var researcherAgent = CreateAgent("Researcher", "Research topics and gather facts");
var writerAgent = CreateAgent("Writer", "Write engaging content");
var editorAgent = CreateAgent("Editor", "Edit and polish content");
// 2. Build MAF workflow
var workflow = new WorkflowBuilder()
.BindAsExecutor("Planner", plannerAgent, emitEvents: true)
.BindAsExecutor("Researcher", researcherAgent, emitEvents: true)
.BindAsExecutor("Writer", writerAgent, emitEvents: true)
.BindAsExecutor("Editor", editorAgent, emitEvents: true)
.Build();
// 3. Create workflow adapter for evaluation
var workflowAdapter = MAFWorkflowAdapter.FromMAFWorkflow(workflow, "ContentPipeline");
// 4. Define test case
var testCase = new WorkflowTestCase
{
Name = "Content Creation Pipeline",
Input = "Create an article about sustainable technology",
Agents = new[] { "Planner", "Researcher", "Writer", "Editor" },
TimeoutPerAgent = TimeSpan.FromMinutes(2),
WorkflowTimeout = TimeSpan.FromMinutes(10)
};
// 5. Run workflow evaluation
var harness = new WorkflowEvaluationHarness();
var result = await harness.RunWorkflowTestAsync(workflowAdapter, testCase);
// 6. Assert on workflow structure and execution
result.ExecutionResult!.Should()
.HaveStepCount(4, because: "pipeline has 4 agents")
.HaveExecutedInOrder("Planner", "Researcher", "Writer", "Editor")
.HaveCompletedWithin(TimeSpan.FromMinutes(10))
.HaveNoErrors();
// 7. Assert on individual agent performance
result.ExecutionResult!
.ForExecutor("Writer")
.HaveOutputLongerThan(200, because: "articles should be substantial")
.HaveEstimatedCostUnder(0.15m)
.And()
.ForExecutor("Editor")
.HaveOutputNotContaining("DRAFT")
.And();
// 8. Validate workflow graph structure
result.ExecutionResult!
.HaveGraphStructure()
.HaveEntryPoint("Planner")
.HaveExecutionPath("Planner", "Researcher", "Writer", "Editor")
.HaveTraversedEdge("Planner", "Researcher")
.HaveTraversedEdge("Writer", "Editor");
Tool-Enabled Workflow (Sample 10)
Evaluate workflows where agents use tools:
// 1. Create agents with tools
var tripPlannerAgent = new ChatClientAgent(chatClient, new()
{
Name = "TripPlanner",
Tools = [AIFunctionFactory.Create(GetInfoAbout)]
});
var flightReservationAgent = new ChatClientAgent(chatClient, new()
{
Name = "FlightReservation",
Tools = [AIFunctionFactory.Create(SearchFlights), AIFunctionFactory.Create(BookFlight)]
});
// 2. Build workflow with tool-enabled agents
var workflow = new WorkflowBuilder()
.BindAsExecutor("TripPlanner", tripPlannerAgent, emitEvents: true)
.BindAsExecutor("FlightReservation", flightReservationAgent, emitEvents: true)
.Build();
var workflowAdapter = MAFWorkflowAdapter.FromMAFWorkflow(workflow, "TravelBooking");
// 3. Run evaluation
var result = await harness.RunWorkflowTestAsync(workflowAdapter, testCase);
// 4. Assert on tool usage across workflow
result.ExecutionResult!.Should()
.HaveCalledTool("GetInfoAbout", because: "TripPlanner must research cities")
.AtLeast(2.Times())
.WithoutError()
.InExecutor("TripPlanner")
.And()
.HaveCalledTool("SearchFlights")
.BeforeTool("BookFlight", because: "can't book without search")
.InExecutor("FlightReservation")
.WithArgument("from", "Seattle")
.And()
.HaveNoToolErrors();
// 5. Export workflow visualization
await result.ExportWorkflowVisualizationAsync("workflow-execution.mmd");
Workflow Evaluation Benefits
- Structure Validation: Verify workflow topology and execution order
- Per-Agent Analysis: Individual agent performance within workflow context
- Tool Chain Validation: Tool usage patterns across multiple agents
- Performance Monitoring: Total workflow cost/timing with per-agent breakdown
- Error Propagation: Track how errors flow through workflow pipeline
- Visualization Export: Mermaid diagrams of workflow execution
See Workflows Documentation for complete workflow evaluation guide.
Next Steps
- Architecture - Understand the framework design
- LLM-as-a-Judge - Multi-model consensus, calibrated evaluation, IEvaluator API
- Workflows - Complete workflow evaluation guide
- Benchmarks - Performance evaluation at scale
- Conversations - Multi-turn evaluation
- Snapshots - Regression evaluation