Table of Contents

AgentEval

AgentEval Logo

Your AI agent works great... until it doesn't.
AgentEval catches the failures before your users do.

NuGet Version License


The .NET Evaluation Toolkit for AI Agents

AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built for Microsoft Agent Framework (MAF). What RAGAS and DeepEval do for Python, AgentEval does for .NET.

For years, agentic developers have imagined writing evaluations like this. Today, they can.


The Code You've Been Dreaming Of

Assert on Tool Chains Like Requirements

result.ToolUsage!.Should()
    .HaveCalledTool("AuthenticateUser", because: "security first")
        .BeforeTool("FetchUserData")
        .WithArgument("method", "OAuth2")
    .And()
    .HaveCalledTool("SendNotification")
        .AtLeastTimes(1)
    .And()
    .HaveNoErrors();

No more regex parsing logs. No more "did it call that function?"

Performance SLAs as Executable Evaluations

result.Performance!.Should()
    .HaveFirstTokenUnder(TimeSpan.FromMilliseconds(500),
        because: "streaming responsiveness matters")
    .HaveTotalDurationUnder(TimeSpan.FromSeconds(5))
    .HaveEstimatedCostUnder(0.05m,
        because: "stay within budget");

Know before production if your agent is too slow or too expensive.

stochastic evaluation: Because LLMs Aren't Deterministic

var result = await stochasticRunner.RunStochasticTestAsync(
    agent, testCase,
    new StochasticOptions(Runs: 10, SuccessRateThreshold: 0.85));

result.Statistics.SuccessRate.Should().BeGreaterThan(0.85);
result.Statistics.StandardDeviation.Should().BeLessThan(10);

Run the same evaluation 10 times. Know your actual success rate, not your lucky-run rate.

Compare Models, Get a Winner

var result = await comparer.CompareModelsAsync(
    factories: new[] { gpt4o, gpt4oMini, claude },
    testCases: testSuite,
    metrics: new[] { new ToolSuccessMetric(), new RelevanceMetric(eval) },
    options: new ComparisonOptions(RunsPerModel: 5));

Console.WriteLine(result.ToMarkdown());

Output:

| Rank | Model         | Tool Accuracy | Relevance | Cost/1K Req |
|------|---------------|---------------|-----------|-------------|
| 🥇   | GPT-4o        | 94.2%         | 91.5      | $0.0150     |
| 🥈   | GPT-4o Mini   | 87.5%         | 84.2      | $0.0003     |

**Recommendation:** GPT-4o - Highest accuracy
**Best Value:** GPT-4o Mini - 87.5% accuracy at 50x lower cost

Record Once, Replay Forever (No API Costs)

// RECORD once (live API call)
var recorder = new TraceRecordingAgent(realAgent);
await recorder.ExecuteAsync("Book a flight to Paris");
TraceSerializer.Save(recorder.GetTrace(), "booking-trace.json");

// REPLAY forever (no API call, instant, free)
var replayer = new TraceReplayingAgent(trace);
var response = await replayer.ReplayNextAsync();  // Identical every time

Save API costs. Run evaluations in CI. Get consistent results.


Red Team Security Evaluation

Is your AI agent secure? AgentEval's Red Team module evaluates against 192 attack probes covering 6 OWASP LLM Top 10 vulnerabilities (60% coverage) with MITRE ATLAS technique mapping.

// One-line security scan
var result = await agent.QuickRedTeamScanAsync();

Console.WriteLine($"Security Score: {result.OverallScore}%");
Console.WriteLine($"Verdict: {result.Verdict}");

// Use with fluent assertions
result.Should()
    .HavePassed()
    .And()
    .HaveMinimumScore(80);

Attack types included: Prompt Injection, Jailbreaks, PII Leakage, System Prompt Extraction, Indirect Injection, Excessive Agency, Insecure Output Handling, API Abuse, Encoding Evasion.

// Advanced: Full pipeline control
var result = await AttackPipeline
    .Create()
    .WithAttack(Attack.PromptInjection)
    .WithAttack(Attack.Jailbreak)
    .WithAttack(Attack.PIILeakage)
    .WithIntensity(Intensity.Comprehensive)
    .ScanAsync(agent);

// Export compliance reports
await result.ExportAsync("security-report.pdf", ExportFormat.Pdf);

Red Team Evaluation →


Why AgentEval?

Challenge How AgentEval Solves It
"What tools did my agent call?" Full tool timeline with arguments, results, timing
"Evaluations fail randomly!" stochastic evaluation - assert on pass rate, not single run
"Which model should I use?" Model comparison with cost/quality recommendations
"Is my agent compliant?" Behavioral policies - guardrails as code
"Is my agent secure?" Red team evaluation - 192 OWASP LLM 2025 security probes
"Is content safe/unbiased?" ResponsibleAI metrics - toxicity, bias, misinformation
"Is my RAG hallucinating?" Faithfulness metrics - grounding verification
"How do I debug CI failures?" Trace replay - capture and reproduce executions

Feature Highlights

  • 🎯 Fluent Assertions

    Tool chains, performance, responses - all with Should() syntax

  • ⚡ Performance Metrics

    TTFT, latency, tokens, cost estimation with 8+ model pricing

  • 🔬 stochastic evaluation

    Run N times, get statistics, assert on pass rates

  • 🤖 Model Comparison

    Compare models side-by-side with recommendations

  • 🎬 Trace Record/Replay

    Deterministic evaluations without API calls

  • 🛡️ Behavioral Policies

    NeverCallTool, MustConfirmBefore, PII detection

  • 🔴 Red Team Security

    192 probes, 9 attack types, 60% OWASP LLM 2025 coverage, MITRE ATLAS mapping

  • 🛡️ Responsible AI

    Toxicity detection, bias measurement, misinformation risk

  • 📊 RAG Metrics

    Faithfulness, Relevance, Context Precision/Recall

  • 🔄 Multi-Turn Evaluation

    Full conversation flow evaluation


Who Is AgentEval For?

🏢 .NET Teams Building AI Agents

If you're building production AI agents in .NET and need to verify tool usage, enforce SLAs, handle non-determinism, or compare models—AgentEval is for you.

🚀 Microsoft Agent Framework (MAF) Developers

Native integration with MAF concepts: AIAgent, IChatClient, automatic tool call tracking, and performance metrics with token usage and cost estimation.

📊 ML Engineers Evaluating LLM Quality

Rigorous evaluation capabilities: RAG metrics (Faithfulness, Relevance, Context Precision), embedding-based similarity, and calibrated judge patterns for consistent evaluation.


CLI Tool & Samples

CLI for CI/CD:

dotnet tool install -g AgentEval.Cli
agenteval eval --dataset tests.yaml --format junit -o results.xml

Detailed samples included—from Hello World to Red Team Security. View Samples →


Documentation

Getting Started Features Advanced
Installation Assertions stochastic evaluation
Quick Start Red Team Security Model Comparison
Responsible AI
Walkthrough Metrics Reference Trace Record/Replay
CLI Tool Benchmarks Architecture
Workflows

The .NET Advantage

Feature AgentEval Python Alternatives
Language Native C#/.NET Python only
Type Safety Compile-time errors Runtime exceptions
IDE Support Full IntelliSense Variable
MAF Integration First-class None
Fluent Assertions Should().HaveCalledTool() N/A
Trace Replay Built-in Manual

Test Coverage

AgentEval maintains a comprehensive test suite running across multiple target frameworks, ensuring reliability.

codecov


Community


Forever Open Source

AgentEval is MIT licensed and will remain open source forever.

  • No license changes - MIT today, MIT forever
  • No "open core" - All features are open source
  • Community first - Built for the .NET AI community

Stop guessing if your AI agent works. Start proving it.

Get Started →