ADR-009: Benchmark Strategy
Status
Accepted
Date
2026-01-13
Context
AgentEval aims to provide comprehensive testing capabilities for AI agents. A key question is: How should AgentEval support industry-standard benchmarks like BFCL, GAIA, ToolBench, and HumanEval?
The agentic AI ecosystem has several established benchmarks:
| Benchmark | Focus | Dataset Source | Complexity |
|---|---|---|---|
| BFCL (Berkeley Function Calling) | Tool/function calling accuracy | HuggingFace | Low |
| GAIA | General AI Assistant capabilities | HuggingFace | Medium |
| ToolBench | Complex API tool usage | GitHub | Medium |
| HumanEval | Code generation correctness | OpenAI | High (requires code execution) |
| WebArena | Web browsing tasks | Requires browser | Very High |
Key Questions
- Should AgentEval include dataset loaders for these benchmarks?
- Should AgentEval auto-download from HuggingFace?
- How do we handle benchmarks requiring code execution (HumanEval)?
- What's the right abstraction for "running a benchmark"?
Decision
Tiered Benchmark Support
We will implement benchmark support in three tiers:
Tier 1: Core Infrastructure (In Scope - Phase 1)
- Benchmark abstractions:
IBenchmark,IBenchmarkDataset,IBenchmarkResult - Manual test case patterns: Users create BFCL/GAIA-style test cases manually
- Result aggregation: Accuracy, pass rate, latency statistics
- Leaderboard comparison: Compare your scores against published results
Tier 2: Dataset Loaders (In Scope - Phase 2)
- BFCL Loader: Download and parse BFCL dataset from HuggingFace
- GAIA Loader: Download and parse GAIA dataset
- ToolBench Loader: Download and parse ToolBench scenarios
- Caching: Local caching of downloaded datasets
- Subset selection: Run on subsets for quick validation
Tier 3: Execution Sandboxes (Out of Scope - Future Consideration)
- HumanEval: Requires secure code execution sandbox
- WebArena: Requires browser automation
- SWE-bench: Requires git/code editing capabilities
Rationale
Why Tier 1 First:
- Provides immediate value with zero external dependencies
- Users can create benchmark-style tests without downloading datasets
- Works in air-gapped environments
Why Tier 2 In Scope:
- BFCL, GAIA, ToolBench are static datasets (no execution needed)
- High community value - "run BFCL with one command"
- Manageable implementation effort (weeks, not months)
Why Tier 3 Out of Scope (for now):
- Security: Executing untrusted LLM-generated code is dangerous
- Complexity: Sandboxing requires Docker, process isolation, resource limits
- Maintenance: These are moving targets with frequent updates
- Focus: AgentEval's core value is agent testing, not code execution
HumanEval: If We Did It
If we decided to support HumanEval in the future, the approach would be:
- Docker-based sandbox: Each code execution in isolated container
- Resource limits: CPU, memory, time limits per execution
- Network isolation: No network access from sandbox
- Pass@k scoring: Run k samples, measure success rate
- Language support: Python first, then expand
Estimated effort: 4-6 weeks for a secure implementation.
Implementation
Phase 1: Core Infrastructure
// Core abstractions
public interface IBenchmark
{
string Name { get; }
string Version { get; }
Task<BenchmarkResult> RunAsync(ITestableAgent agent, BenchmarkOptions options);
}
public interface IBenchmarkDataset
{
string Name { get; }
int Count { get; }
IAsyncEnumerable<BenchmarkTestCase> GetTestCasesAsync();
}
public interface IBenchmarkResult
{
string BenchmarkName { get; }
double OverallScore { get; }
Dictionary<string, double> CategoryScores { get; }
BenchmarkStatistics Statistics { get; }
}
Phase 2: Dataset Loaders
// BFCL loader example
public class BfclDatasetLoader : IBenchmarkDatasetLoader
{
private const string DatasetUrl = "https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard";
public async Task<IBenchmarkDataset> LoadAsync(BfclCategory category)
{
// Download from HuggingFace
// Parse JSON structure
// Return typed dataset
}
}
// Usage
var loader = new BfclDatasetLoader();
var dataset = await loader.LoadAsync(BfclCategory.SimplePython);
var benchmark = new BfclBenchmark(dataset);
var result = await benchmark.RunAsync(agent, options);
Directory Structure
src/AgentEval/
├── Benchmarks/
│ ├── Core/
│ │ ├── IBenchmark.cs
│ │ ├── IBenchmarkDataset.cs
│ │ ├── BenchmarkResult.cs
│ │ └── BenchmarkRunner.cs
│ ├── Datasets/
│ │ ├── BfclDatasetLoader.cs
│ │ ├── GaiaDatasetLoader.cs
│ │ └── ToolBenchDatasetLoader.cs
│ └── Implementations/
│ ├── BfclBenchmark.cs
│ ├── GaiaBenchmark.cs
│ └── ToolBenchBenchmark.cs
Consequences
Positive
- Clear scope boundaries prevent scope creep
- Phase 1 delivers immediate value
- Phase 2 enables "run BFCL with one line" experience
- Security-conscious approach to code execution
Negative
- No HumanEval support initially
- Users wanting HumanEval must implement their own sandbox
- Some community pressure for "complete" benchmark support
Neutral
- Dataset loaders add HuggingFace dependency (optional)
- Caching requires disk space management
Alternatives Considered
1. Full Benchmark Support Including HumanEval
Rejected: Security risks outweigh benefits. Code execution requires significant investment in sandboxing.
2. No Benchmark Support (Just Patterns)
Rejected: Misses opportunity to provide significant value with dataset loaders.
3. External Benchmark Tool
Rejected: Fragmented experience. Users want one tool, not two.
References
- BFCL Leaderboard
- GAIA Benchmark
- ToolBench
- HumanEval
- ADR-005: Model Comparison & Stochastic Testing