AgentEval

AgentEval Logo

Your AI agent works great... until it doesn't.
AgentEval catches the failures before your users do.

The .NET Evaluation Toolkit for AI Agents

AgentEval is the comprehensive .NET toolkit for AI agent evaluation—tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison—built for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS and DeepEval do for Python, AgentEval does for .NET.

For years, agentic developers have imagined writing evaluations like this. Today, they can.

The Code You've Been Dreaming Of

Assert on Tool Chains Like Requirements

result.ToolUsage!.Should()
    .HaveCalledTool("AuthenticateUser", because: "security first")
        .BeforeTool("FetchUserData")
        .WithArgument("method", "OAuth2")
    .And()
    .HaveCalledTool("SendNotification")
        .AtLeastTimes(1)
    .And()
    .HaveNoErrors();

No more regex parsing logs. No more "did it call that function?"

Performance SLAs as Executable Evaluations

result.Performance!.Should()
    .HaveFirstTokenUnder(TimeSpan.FromMilliseconds(500),
        because: "streaming responsiveness matters")
    .HaveTotalDurationUnder(TimeSpan.FromSeconds(5))
    .HaveEstimatedCostUnder(0.05m,
        because: "stay within budget");

Know before production if your agent is too slow or too expensive.

stochastic evaluation: Because LLMs Aren't Deterministic

var result = await stochasticRunner.RunStochasticTestAsync(
    agent, testCase,
    new StochasticOptions(Runs: 10, SuccessRateThreshold: 0.85));

result.Statistics.SuccessRate.Should().BeGreaterThan(0.85);
result.Statistics.StandardDeviation.Should().BeLessThan(10);

Run the same evaluation 10 times. Know your actual success rate, not your lucky-run rate.

Compare Models, Get a Winner

var result = await comparer.CompareModelsAsync(
    factories: new[] { gpt4o, gpt4oMini, claude },
    testCases: testSuite,
    metrics: new[] { new ToolSuccessMetric(), new RelevanceMetric(eval) },
    options: new ComparisonOptions(RunsPerModel: 5));

Console.WriteLine(result.ToMarkdown());

Output:

| Rank | Model         | Tool Accuracy | Relevance | Cost/1K Req |
|------|---------------|---------------|-----------|-------------|
| 🥇   | GPT-4o        | 94.2%         | 91.5      | $0.0150     |
| 🥈   | GPT-4o Mini   | 87.5%         | 84.2      | $0.0003     |

**Recommendation:** GPT-4o - Highest accuracy
**Best Value:** GPT-4o Mini - 87.5% accuracy at 50x lower cost

Record Once, Replay Forever (No API Costs)

// RECORD once (live API call)
var recorder = new TraceRecordingAgent(realAgent);
await recorder.ExecuteAsync("Book a flight to Paris");
TraceSerializer.Save(recorder.GetTrace(), "booking-trace.json");

// REPLAY forever (no API call, instant, free)
var replayer = new TraceReplayingAgent(trace);
var response = await replayer.ReplayNextAsync();  // Identical every time

Save API costs. Run evaluations in CI. Get consistent results.

Red Team Security Evaluation

Is your AI agent secure? AgentEval's Red Team module evaluates against 192 attack probes covering 6 OWASP LLM Top 10 vulnerabilities (60% coverage) with MITRE ATLAS technique mapping.

// One-line security scan
var result = await agent.QuickRedTeamScanAsync();

Console.WriteLine($"Security Score: {result.OverallScore}%");
Console.WriteLine($"Verdict: {result.Verdict}");

// Use with fluent assertions
result.Should()
    .HavePassed()
    .And()
    .HaveMinimumScore(80);

Attack types included: Prompt Injection, Jailbreaks, PII Leakage, System Prompt Extraction, Indirect Injection, Excessive Agency, Insecure Output Handling, API Abuse, Encoding Evasion.

// Advanced: Full pipeline control
var result = await AttackPipeline
    .Create()
    .WithAttack(Attack.PromptInjection)
    .WithAttack(Attack.Jailbreak)
    .WithAttack(Attack.PIILeakage)
    .WithIntensity(Intensity.Comprehensive)
    .ScanAsync(agent);

// Export compliance reports
await result.ExportAsync("security-report.pdf", ExportFormat.Pdf);

Red Team Evaluation →

Why AgentEval?

Challenge	How AgentEval Solves It
"What tools did my agent call?"	Full tool timeline with arguments, results, timing
"Evaluations fail randomly!"	stochastic evaluation - assert on pass rate, not single run
"Which model should I use?"	Model comparison with cost/quality recommendations
"Is my agent compliant?"	Behavioral policies - guardrails as code
"Is my agent secure?"	Red team evaluation - 192 OWASP LLM 2025 security probes
"Is content safe/unbiased?"	ResponsibleAI metrics - toxicity, bias, misinformation
"Is my RAG hallucinating?"	Faithfulness metrics - grounding verification
"How do I debug CI failures?"	Trace replay - capture and reproduce executions

Feature Highlights

🎯 Fluent Assertions

Tool chains, performance, responses - all with Should() syntax
⚡ Performance Metrics

TTFT, latency, tokens, cost estimation with 8+ model pricing
🔬 stochastic evaluation

Run N times, get statistics, assert on pass rates
🤖 Model Comparison

Compare models side-by-side with recommendations
🎬 Trace Record/Replay

Deterministic evaluations without API calls
🛡️ Behavioral Policies

NeverCallTool, MustConfirmBefore, PII detection
🔴 Red Team Security

192 probes, 9 attack types, 60% OWASP LLM 2025 coverage, MITRE ATLAS mapping
🛡️ Responsible AI

Toxicity detection, bias measurement, misinformation risk
🖥️ CLI Tool

agenteval eval - evaluate any AI agent from the command line
🔌 Cross-Framework

Universal IChatClient.AsEvaluableAgent() one-liner + Semantic Kernel bridge
📦 Dependency Injection

services.AddAgentEval() - interface-first architecture
📊 RAG Metrics

Faithfulness, Relevance, Context Precision/Recall
🔄 Multi-Turn Evaluation

Full conversation flow evaluation

Who Is AgentEval For?

🏢 .NET Teams Building AI Agents

If you're building production AI agents in .NET and need to verify tool usage, enforce SLAs, handle non-determinism, or compare models—AgentEval is for you.

🚀 Microsoft Agent Framework (MAF) Developers

Native integration with MAF concepts: AIAgent, IChatClient, automatic tool call tracking, and performance metrics with token usage and cost estimation.

📊 ML Engineers Evaluating LLM Quality

Rigorous evaluation capabilities: RAG metrics (Faithfulness, Relevance, Context Precision), embedding-based similarity, and calibrated judge patterns for consistent evaluation.

Samples

27 detailed examples included—from Hello World to advanced Multi-Agent Workflows, Red Team Security, and Cross-Framework evaluation.

dotnet run --project samples/AgentEval.Samples

View Examples →

Documentation

Getting Started	Features	Advanced
Installation	Assertions	stochastic evaluation
Quick Start	Red Team Security	Model Comparison
	Responsible AI
Walkthrough	Metrics Reference	Trace Record/Replay
	Benchmarks	Architecture
	Workflows

The .NET Advantage

Feature	AgentEval	Python Alternatives
Language	Native C#/.NET	Python only
Type Safety	Compile-time errors	Runtime exceptions
IDE Support	Full IntelliSense	Variable
MAF Integration	First-class	None
Fluent Assertions	`Should().HaveCalledTool()`	N/A
Trace Replay	Built-in	Manual

Quality Assurance

AgentEval maintains a comprehensive evaluation suite running across multiple target frameworks, ensuring reliability.

Community

GitHub: github.com/AgentEvalHQ/AgentEval
NuGet: nuget.org/packages/AgentEval
Issues: Report bugs or request features
Discussions: Ask questions
Commercial & Enterprise (planned): Learn more

Forever Open Source

AgentEval is MIT licensed and will remain open source forever.

✅ No license changes — MIT today, MIT forever
✅ No bait-and-switch — core stays MIT and fully usable
✅ Community first — built with the .NET AI community
ℹ️ Optional add-ons may exist separately (if/when built)

Stop guessing if your AI agent works. Start proving it.

Get Started →

Table of Contents