Benchmarks Guide

Running industry-standard AI benchmarks and creating custom benchmark suites with AgentEval

What You Can Do

AgentEval provides comprehensive benchmarking capabilities for AI agents:

Feature	Description
Performance benchmarks	Measure latency, throughput, and cost
BFCL-style evaluations	Tool calling accuracy with industry-standard patterns
GAIA-style evaluations	Task completion with multi-step reasoning
Multi-step reasoning	Ordered tool chain validation with dependency tracking
Custom benchmark suites	Create your own domain-specific benchmarks
JSONL datasets	Load benchmark data from industry-standard JSONL files

Quick Start

✅ Run Performance Benchmarks

Measure latency, throughput, and cost across your agents.

✅ Run Agentic Benchmarks

Evaluate tool accuracy, task completion, and multi-step reasoning.

✅ Load from JSONL

Use DatasetLoaderFactory to load benchmark data from JSONL files — the industry-standard format used by BFCL, GAIA, MMLU, GSM8K, and ToolBench.

✅ Industry-Standard Patterns

Create evaluations following BFCL/GAIA patterns and compare against published leaderboards.

Prerequisites

All benchmark examples assume you have created a MAF agent:

using Azure.AI.OpenAI;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;

var azureClient = new AzureOpenAIClient(
    new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!),
    new Azure.AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_OPENAI_API_KEY")!));

var chatClient = azureClient
    .GetChatClient("gpt-4o")
    .AsIChatClient();

var agent = new ChatClientAgent(
    chatClient,
    new ChatClientAgentOptions
    {
        Name = "BenchmarkAgent",
        ChatOptions = new ChatOptions
        {
            Instructions = "You are a helpful assistant with access to various tools.",
            Tools = [/* Your tools here */]
        }
    });

var adapter = new MAFAgentAdapter(agent);

// For AI-powered evaluation (task completion), create an evaluator
var evaluator = new ChatClientEvaluator(
    azureClient.GetChatClient("gpt-4o").AsIChatClient());

Berkeley Function Calling Leaderboard (BFCL)

BFCL is the industry standard for evaluating function/tool calling accuracy. AgentEval supports creating BFCL-style evaluation cases:

var bfclTests = new List<ToolAccuracyTestCase>
{
    // Simple function call (BFCL simple_python category)
    new ToolAccuracyTestCase
    {
        Name = "simple_math",
        Prompt = "Calculate 15 factorial",
        ExpectedTools = new[]
        {
            new ExpectedTool { Name = "factorial", RequiredParameters = new[] { "n" } }
        }
    },
    
    // Parallel calls (BFCL parallel category)
    new ToolAccuracyTestCase
    {
        Name = "parallel_weather",
        Prompt = "What's the weather in NYC and LA?",
        ExpectedTools = new[]
        {
            new ExpectedTool { Name = "get_weather", RequiredParameters = new[] { "location" } },
            new ExpectedTool { Name = "get_weather", RequiredParameters = new[] { "location" } }
        }
    },
    
    // Sequential calls (BFCL multiple category)
    new ToolAccuracyTestCase
    {
        Name = "sequential_booking",
        Prompt = "Find flights to Paris and book the cheapest one",
        ExpectedTools = new[]
        {
            new ExpectedTool { Name = "search_flights", RequiredParameters = new[] { "destination" } },
            new ExpectedTool { Name = "book_flight", RequiredParameters = new[] { "flight_id" } }
        }
    }
};

// Run and measure
var benchmark = new AgenticBenchmark(adapter);
var results = await benchmark.RunToolAccuracyBenchmarkAsync(bfclTests);

Console.WriteLine($"BFCL-Style Results");
Console.WriteLine($"  Accuracy: {results.OverallAccuracy:P1}");
Console.WriteLine($"  Passed: {results.PassedTests}/{results.TotalTests}");

Loading from JSONL

Instead of hardcoding test cases, load them from JSONL files using DatasetLoaderFactory:

using AgentEval.DataLoaders;

// Load tool accuracy test cases from JSONL (industry standard!)
var dataset = await DatasetLoaderFactory.LoadAsync("benchmark-tool-accuracy.jsonl");
var toolCases = dataset.Select(dc => dc.ToToolAccuracyTestCase()).ToList();

var results = await benchmark.RunToolAccuracyBenchmarkAsync(toolCases);

Example JSONL file (benchmark-tool-accuracy.jsonl):

{"id": "weather_simple", "input": "What is the weather in Seattle?", "expected_tools": ["GetWeather"], "metadata": {"required_params": {"GetWeather": ["city"]}}}
{"id": "calc_simple", "input": "Calculate 15 * 7 + 3", "expected_tools": ["Calculate"], "metadata": {"required_params": {"Calculate": ["expression"]}}}

Compare Against Published Leaderboard

BFCL publishes scores for major models. Run your evaluations and compare:

Model	Simple	Parallel	Multiple	Multi-turn	Overall
GPT-4o	94.5%	91.2%	88.5%	82.1%	89.1%
Claude-3.5	92.1%	88.7%	85.2%	79.8%	86.5%
Gemini-1.5	91.8%	87.5%	84.1%	78.2%	85.4%
Your Agent	?%	?%	?%	?%	?%

GAIA (General AI Assistants)

GAIA tests multi-step reasoning and real-world task completion:

var gaiaCases = new List<TaskCompletionTestCase>
{
    // Level 1: Simple
    new TaskCompletionTestCase
    {
        Name = "gaia_level1_001",
        Prompt = "What is the population of Tokyo?",
        CompletionCriteria = new[]
        {
            "Provides a specific number",
            "Number is approximately correct (within 5%)",
            "Cites a source or explains where the data comes from"
        },
        PassingScore = 70
    },
    
    // Level 2: Moderate
    new TaskCompletionTestCase
    {
        Name = "gaia_level2_015",
        Prompt = "Find the contact email for customer support at Microsoft",
        CompletionCriteria = new[]
        {
            "Provides an email address",
            "Email appears to be a valid Microsoft domain",
            "Explains how to reach support"
        },
        PassingScore = 70
    },
    
    // Level 3: Complex
    new TaskCompletionTestCase
    {
        Name = "gaia_level3_042",
        Prompt = "Research the top 3 electric vehicle manufacturers by market cap and compare their 2024 revenue",
        CompletionCriteria = new[]
        {
            "Identifies 3 companies correctly",
            "Provides market cap figures",
            "Provides revenue figures",
            "Makes a coherent comparison"
        },
        PassingScore = 80
    }
};

var benchmark = new AgenticBenchmark(adapter, evaluator);
var results = await benchmark.RunTaskCompletionBenchmarkAsync(gaiaCases);

Console.WriteLine($"GAIA-Style Results");
Console.WriteLine($"  Average Score: {results.AverageScore:F1}/100");
Console.WriteLine($"  Pass Rate: {(double)results.PassedTests / results.TotalTests:P1}");

Note: By default, AgenticBenchmark adds two standard criteria ("fully addresses the user's request", "complete and actionable") to every task completion evaluation. Set AddDefaultCompletionCriteria = false in AgenticBenchmarkOptions for strict criteria-only evaluation.

Performance Benchmarks

Measure latency, throughput, and cost across your agents using PerformanceBenchmark.

Latency Benchmark

var benchmark = new PerformanceBenchmark(adapter);

// Single prompt — run multiple iterations
var result = await benchmark.RunLatencyBenchmarkAsync(
    "What is the capital of France?",
    iterations: 10,
    warmupIterations: 2);

Console.WriteLine($"Latency Benchmark");
Console.WriteLine($"  P50: {result.P50Latency.TotalMilliseconds:F0}ms");
Console.WriteLine($"  P90: {result.P90Latency.TotalMilliseconds:F0}ms");
Console.WriteLine($"  P99: {result.P99Latency.TotalMilliseconds:F0}ms");
Console.WriteLine($"  Mean: {result.MeanLatency.TotalMilliseconds:F0}ms");
Console.WriteLine($"  Min: {result.MinLatency.TotalMilliseconds:F0}ms");
Console.WriteLine($"  Max: {result.MaxLatency.TotalMilliseconds:F0}ms");

Multi-Prompt Latency Benchmark

Use varied prompts to avoid server-side caching effects:

// Multiple prompts — avoids LLM server-side caching
var result = await benchmark.RunLatencyBenchmarkAsync(
    new[] { "What is 2+2?", "Name three colors.", "Capital of France?" },
    iterationsPerPrompt: 3,
    warmupIterations: 1);

Console.WriteLine($"  Iterations: {result.Iterations}");   // 9 (3 prompts × 3)
Console.WriteLine($"  Mean: {result.MeanLatency.TotalMilliseconds:F0}ms");

Throughput Benchmark

var result = await benchmark.RunThroughputBenchmarkAsync(
    "What is 2+2?",
    concurrentRequests: 10,
    duration: TimeSpan.FromSeconds(60));

Console.WriteLine($"Throughput Benchmark");
Console.WriteLine($"  Requests/sec: {result.RequestsPerSecond:F1}");
Console.WriteLine($"  Completed: {result.CompletedRequests}");
Console.WriteLine($"  Errors: {result.ErrorCount}");
Console.WriteLine($"  Mean Latency: {result.MeanLatency.TotalMilliseconds:F0}ms");

Cost Benchmark

var result = await benchmark.RunCostBenchmarkAsync(
    new[] { "prompt1", "prompt2", "prompt3" },
    "gpt-4o");

Console.WriteLine($"Cost Benchmark");
Console.WriteLine($"  Total Tokens: {result.TotalTokens}");
Console.WriteLine($"  Avg Input/Prompt: {result.AverageInputTokensPerPrompt:F0}");
Console.WriteLine($"  Avg Output/Prompt: {result.AverageOutputTokensPerPrompt:F0}");
Console.WriteLine($"  Estimated Cost: ${result.EstimatedCostUSD:F6}");

Workflow Performance Benchmarks

🚧 Planned Feature — WorkflowPerformanceBenchmark is planned for a future release. Currently, use PerformanceBenchmark with individual agent adapters within your workflow, or use MAFEvaluationHarness with TrackPerformance = true for end-to-end workflow timing. See Sample 09 (Sequential Workflows) and Sample 10 (Tool-Enabled Workflows) for workflow performance monitoring patterns.

Creating Custom Benchmark Suites

Create domain-specific benchmarks tailored to your use case by composing AgenticBenchmark and PerformanceBenchmark:

public class CustomerSupportBenchmark
{
    private readonly IEvaluableAgent _agent;
    private readonly IEvaluator _evaluator;
    
    public CustomerSupportBenchmark(IEvaluableAgent agent, IEvaluator evaluator)
    {
        _agent = agent;
        _evaluator = evaluator;
    }
    
    public async Task<Dictionary<string, double>> RunFullSuiteAsync()
    {
        var scores = new Dictionary<string, double>();
        
        var benchmark = new AgenticBenchmark(_agent, _evaluator);
        var perfBenchmark = new PerformanceBenchmark(_agent);
        
        // Category 1: Tool accuracy
        var toolResults = await benchmark.RunToolAccuracyBenchmarkAsync(GetToolTestCases());
        scores["ToolAccuracy"] = toolResults.OverallAccuracy * 100;
        
        // Category 2: Task completion
        var taskResults = await benchmark.RunTaskCompletionBenchmarkAsync(GetTaskTestCases());
        scores["TaskCompletion"] = taskResults.AverageScore;
        
        // Category 3: Latency
        var latencyResult = await perfBenchmark.RunLatencyBenchmarkAsync(
            "What is my order status?", iterations: 5);
        scores["P90LatencyMs"] = latencyResult.P90Latency.TotalMilliseconds;
        
        // Overall score (weighted)
        scores["Overall"] = scores["ToolAccuracy"] * 0.4 + scores["TaskCompletion"] * 0.6;
        
        return scores;
    }
    
    private List<ToolAccuracyTestCase> GetToolTestCases() => new()
    {
        new ToolAccuracyTestCase
        {
            Name = "password_reset",
            Prompt = "I forgot my password, help me reset it",
            ExpectedTools = new[] { new ExpectedTool { Name = "reset_password" } }
        },
        new ToolAccuracyTestCase
        {
            Name = "order_lookup",
            Prompt = "Where is my order #12345?",
            ExpectedTools = new[] { new ExpectedTool { Name = "track_order" } }
        }
    };
    
    private List<TaskCompletionTestCase> GetTaskTestCases() => new()
    {
        new TaskCompletionTestCase
        {
            Name = "refund_request",
            Prompt = "I want a refund for my last order",
            CompletionCriteria = new[]
            {
                "Acknowledges the refund request",
                "Retrieves order details",
                "Explains refund policy or timeline"
            }
        }
    };
}

Run and Report

var benchmark = new CustomerSupportBenchmark(agent, evaluator);
var scores = await benchmark.RunFullSuiteAsync();

Console.WriteLine("BENCHMARK RESULTS");
foreach (var (category, score) in scores)
{
    Console.WriteLine($"  {category,-20} {score,8:F1}");
}

Regression Detection Pattern

Note: This is a recommended pattern — AgentEval does not yet provide built-in baseline storage or comparison infrastructure. You implement the baseline I/O yourself.

// Save results as your baseline
var results = await benchmark.RunToolAccuracyBenchmarkAsync(testCases);
File.WriteAllText("baseline.json", JsonSerializer.Serialize(new
{
    Date = DateTimeOffset.UtcNow,
    ToolAccuracy = results.OverallAccuracy,
    PassRate = (double)results.PassedTests / results.TotalTests
}));

// Later, compare against baseline
var baseline = JsonDocument.Parse(File.ReadAllText("baseline.json"));
var baselineAccuracy = baseline.RootElement.GetProperty("ToolAccuracy").GetDouble();

var current = await benchmark.RunToolAccuracyBenchmarkAsync(testCases);

if (current.OverallAccuracy < baselineAccuracy - 0.05)
{
    Console.WriteLine("⚠️ REGRESSION: Tool accuracy dropped!");
    Environment.ExitCode = 1;  // Fail CI/CD
}

CI/CD Integration

GitHub Actions

name: Benchmark

on:
  push:
    branches: [main]
  pull_request:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '9.0.x'
      
      - name: Run Benchmarks
        env:
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
        run: |
          dotnet run --project benchmarks/AgentBenchmarks.csproj

Tip: Combine with the regression detection pattern above to fail CI on score drops.

API Reference Summary

PerformanceBenchmark

Method	Parameters	Returns
`RunLatencyBenchmarkAsync`	`string prompt, int iterations, int warmupIterations`	`LatencyBenchmarkResult`
`RunLatencyBenchmarkAsync`	`IEnumerable<string> prompts, int iterationsPerPrompt, int warmupIterations`	`LatencyBenchmarkResult`
`RunThroughputBenchmarkAsync`	`string prompt, int concurrentRequests, TimeSpan duration`	`ThroughputBenchmarkResult`
`RunCostBenchmarkAsync`	`IEnumerable<string> prompts, string modelName`	`CostBenchmarkResult`

AgenticBenchmark

Method	Parameters	Returns
`RunToolAccuracyBenchmarkAsync`	`IEnumerable<ToolAccuracyTestCase>`	`ToolAccuracyResult`
`RunTaskCompletionBenchmarkAsync`	`IEnumerable<TaskCompletionTestCase>`	`TaskCompletionResult`
`RunMultiStepReasoningBenchmarkAsync`	`IEnumerable<MultiStepTestCase>`	`MultiStepReasoningResult`

Key Result Properties

Result Type	Key Properties
`LatencyBenchmarkResult`	`MeanLatency`, `MinLatency`, `MaxLatency`, `P50Latency`, `P90Latency`, `P99Latency`, `MeanTimeToFirstToken`, `AllLatencies`
`ThroughputBenchmarkResult`	`RequestsPerSecond`, `CompletedRequests`, `ErrorCount`, `MeanLatency`, `Duration`
`CostBenchmarkResult`	`EstimatedCostUSD`, `TotalTokens`, `TotalInputTokens`, `TotalOutputTokens`, `AverageInputTokensPerPrompt`
`ToolAccuracyResult`	`OverallAccuracy`, `PassedTests`, `TotalTests`, `Results`
`TaskCompletionResult`	`AverageScore`, `PassedTests`, `TotalTests`, `Results`
`MultiStepReasoningResult`	`AverageStepCompletion`, `PassedTests`, `TotalTests`, `Results`

Table of Contents