Cross-Framework Evaluation

AgentEval evaluates any AI agent regardless of the framework used to build it. The key abstraction is IChatClient from Microsoft.Extensions.AI — a universal interface that Azure OpenAI, OpenAI, Ollama, Semantic Kernel, LM Studio, Groq, and dozens of other providers implement.

The Universal Adapter Pattern

// One line to make any IChatClient evaluable:
IStreamableAgent agent = chatClient.AsEvaluableAgent(
    name: "MyAgent",
    systemPrompt: "You are a helpful assistant.");

The AsEvaluableAgent() extension method wraps any IChatClient into an IStreamableAgent — the interface AgentEval uses for all evaluation operations. No MAF boilerplate, no framework-specific adapters needed.

Supported Providers

Provider	Code	Notes
Azure OpenAI	`new AzureOpenAIClient(endpoint, key).GetChatClient(model).AsIChatClient()`	Production workloads
OpenAI	`new OpenAIClient(key).GetChatClient(model).AsIChatClient()`	Direct OpenAI API
Semantic Kernel	`kernel.GetRequiredService<IChatCompletionService>()`	SK agents with tools
Ollama	`new OllamaChatClient("http://localhost:11434", "llama3")`	Local models
LM Studio	`new OpenAIChatClient(new("lm-studio"), new("http://localhost:1234/v1"))`	Local models
Groq	`new OpenAIChatClient(new(key), new("https://api.groq.com/openai/v1"))`	Fast inference
Together.ai	`new OpenAIChatClient(new(key), new("https://api.together.xyz/v1"))`	Open models
vLLM	`new OpenAIChatClient(new("vllm"), new("http://localhost:8000/v1"))`	Self-hosted
Any OpenAI-compat	`EndpointFactory.CreateOpenAICompatible(url, model, key)`	Universal fallback

Framework Integration Examples

Microsoft Agent Framework (MAF)

MAF agents get first-class support through MAFAgentAdapter:

using AgentEval.MAF;
using Microsoft.Agents.AI;

// Create a MAF agent with tools
var agent = new ChatClientAgent(chatClient, new ChatClientAgentOptions
{
    Name = "TravelAgent",
    ChatOptions = new ChatOptions
    {
        Instructions = "You are a travel booking assistant.",
        Tools = [
            AIFunctionFactory.Create(SearchFlights),
            AIFunctionFactory.Create(BookFlight)
        ]
    }
});

// Wrap for evaluation
var evaluable = new MAFAgentAdapter(agent);

// Evaluate with full tool tracking
var harness = new MAFEvaluationHarness(verbose: true);
var result = await harness.RunEvaluationAsync(evaluable, testCase, new EvaluationOptions
{
    TrackTools = true,
    TrackPerformance = true
});

// Assert tools were called correctly
result.ToolUsage!.Should()
    .HaveCalledTool("SearchFlights", because: "must search before booking")
        .BeforeTool("BookFlight")
    .And()
    .HaveNoErrors();

Semantic Kernel

Semantic Kernel plugins bridge to AgentEval via AIFunctionFactory.Create():

using Microsoft.SemanticKernel;
using Microsoft.Extensions.AI;

// Build a Semantic Kernel with plugins
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAIChatCompletion("gpt-4o", endpoint, key)
    .Build();
kernel.Plugins.AddFromType<FlightPlugin>();

// Bridge SK plugins to M.E.AI tools — same class, both frameworks!
var plugin = new FlightPlugin();
var tools = new List<AITool>
{
    AIFunctionFactory.Create(plugin.SearchFlights),
    AIFunctionFactory.Create(plugin.BookFlight)
};

// Create IChatClient and agent with tools
var chatClient = azureClient.GetChatClient("gpt-4o").AsIChatClient();
var agent = new ChatClientAgent(chatClient, new ChatClientAgentOptions
{
    ChatOptions = new ChatOptions { Tools = tools }
});
var adapter = new MAFAgentAdapter(agent);
var result = await harness.RunEvaluationAsync(adapter, testCase);

See the NuGetConsumer SK Demo for the full working example.

Plain IChatClient (No Framework)

For simple Q&A or zero-boilerplate evaluation:

using Microsoft.Extensions.AI;

// Any IChatClient works directly
IChatClient client = new AzureOpenAIClient(endpoint, key)
    .GetChatClient("gpt-4o")
    .AsIChatClient();

var agent = client.AsEvaluableAgent(
    name: "GPT-4o",
    systemPrompt: "Answer concisely.");

var result = await harness.RunEvaluationAsync(agent, testCase);

CLI (Workspace + Benchmark Suites)

The CLI focuses on workspace lifecycle and shipped benchmark suites — the cross-framework evaluation pattern shown above lives in code, not in a CLI flag. To evaluate non-MAF agents from the terminal, point a benchmark at your subject:

# Run the agentic benchmark against a subject — judge wired via AZURE_OPENAI_* env vars
export AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_DEPLOYMENT=gpt-4o
agenteval bench agentic --subject MyAgent --input "user prompt to evaluate"

For arbitrary dataset-driven runs against any IChatClient-compatible provider (Azure OpenAI, Ollama, Groq, etc.), use the in-tree samples as a starting point:

dotnet run --project samples/AgentEval.Samples

See Mock-vs-Real-Mode samples for working programs.

Multi-Framework Model Comparison

Compare the same test cases across different providers and models:

using AgentEval.Comparison;

// Define agent factories for different providers
var factories = new IAgentFactory[]
{
    new AzureOpenAIFactory("gpt-4o", endpoint, key),
    new AzureOpenAIFactory("gpt-4o-mini", endpoint, key),
    new OllamaFactory("llama3.1", "http://localhost:11434"),
};

// Run identical tests across all providers
var comparer = new ModelComparer(harness, statisticsCalculator: null);
foreach (var factory in factories)
{
    var agent = factory.CreateAgent();
    var result = await stochasticRunner.RunStochasticTestAsync(
        agent, testCase,
        new StochasticOptions(Runs: 5, SuccessRateThreshold: 0.8));
    results.Add((factory.ModelName, result));
}

results.PrintComparisonTable();

Output:

┌──────────────┬───────┬──────────┬──────────┬──────────┐
│ Model        │ Score │ Pass Rate│ Latency  │ Cost     │
├──────────────┼───────┼──────────┼──────────┼──────────┤
│ gpt-4o       │  94.2 │  100%    │  1.2s    │ $0.0045  │
│ gpt-4o-mini  │  87.5 │   80%    │  0.6s    │ $0.0008  │
│ llama3.1     │  72.1 │   60%    │  2.1s    │ $0.0000  │
└──────────────┴───────┴──────────┴──────────┴──────────┘

Evaluation Capabilities by Integration Level

Feature	Plain IChatClient	MAF Agent	Semantic Kernel
Basic pass/fail	Yes	Yes	Yes
LLM-as-judge scoring	Yes	Yes	Yes
Performance metrics	Yes	Yes	Yes
Tool call tracking	No*	Yes	Yes†
Tool chain assertions	No*	Yes	Yes†
Streaming evaluation	Yes	Yes	Yes
Conversation testing	Yes	Yes	Yes
Trace record/replay	Yes	Yes	Yes
Stochastic evaluation	Yes	Yes	Yes
Model comparison	Yes	Yes	Yes
Red team security	Yes	Yes	Yes

* Tool tracking requires the agent to use FunctionCallContent in responses.
† Via AIFunctionFactory.Create() bridge — SK plugin methods become M.E.AI tools tracked by AgentEval.

The IChatClient Advantage

The IChatClient interface from Microsoft.Extensions.AI is the key enabler:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Azure OpenAI │    │   Ollama     │    │  Semantic    │
│              │    │              │    │   Kernel     │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       ▼                   ▼                   ▼
  ┌────────────────────────────────────────────────┐
  │              IChatClient interface              │
  │   GetResponseAsync() / GetStreamingResponseAsync()   │
  └────────────────────────┬───────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │  .AsEvaluableAgent()   │
              │  Zero-boilerplate wrap │
              └────────────┬───────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │  AgentEval Evaluation  │
              │  Metrics, Assertions,  │
              │  Export, Comparison    │
              └────────────────────────┘

Any provider that speaks IChatClient gets the full AgentEval evaluation suite for free.

Table of Contents