Roadmap
AgentEval is actively developed. This page outlines completed features and planned enhancements.
Current Status: v1.0.0-alpha
AgentEval is in alpha with all core features complete and ready for production use.
✅ Completed Features
Core Testing (v1.0.0-alpha)
- [x] Test harness for AI agents (
MAFTestHarness,ITestHarness) - [x] Fluent assertions for tool usage, performance, and responses
- [x] Multi-turn conversation testing (
ConversationRunner) - [x] Snapshot testing for regression detection (
SnapshotComparer) - [x] Workflow testing for multi-agent orchestration
Metrics & Evaluation
- [x] RAG metrics: Faithfulness, Relevance, Context Precision/Recall, Answer Correctness
- [x] Agentic metrics: Tool Selection, Tool Arguments, Tool Success, Task Completion, Efficiency
- [x] Embedding-based similarity metrics
- [x] AI-powered response evaluation
Performance & Observability
- [x] Streaming support with real-time callbacks
- [x] Time to First Token (TTFT) tracking
- [x] Per-tool timing and execution waterfall
- [x] Token counting and cost estimation (8+ models)
- [x] Performance benchmarks (latency, throughput, cost)
CI/CD Integration
- [x] CLI tool (
agenteval eval,agenteval init,agenteval list) - [x] Result exporters: JSON, JUnit XML, Markdown, TRX
- [x] Dataset loaders: JSON, JSONL, CSV, YAML
Framework Support
- [x] Microsoft Agent Framework (MAF) adapter
- [x] Generic
IChatClientadapter - [x] Microsoft.Extensions.AI.Evaluation integration
🔄 In Progress
Documentation & Community
- [x] Community files (CONTRIBUTING, CODE_OF_CONDUCT, SECURITY)
- [x] GitHub issue and PR templates
- [x] Installation and walkthrough documentation
- [ ] Complete API reference documentation (auto-generated from XML docs)
- [ ] Video tutorials and walkthroughs
- [ ] Community Discord server (deferred until 50+ active users)
📋 Planned Features
Short-term (Q1 2026)
- [x] Workflow assertions P0 enhancements (
because, structured exceptions) - [ ] CLI
summarycommand — tabular view of runs in directory - [ ] CLI
diffcommand — side-by-side answer comparison - [ ] Standardized result directory structure (eval_results.jsonl, summary.json)
- [ ] Console visualization enhancements (Spectre.Console tables, progress)
- [ ] Visual assertion reports (ASCII diagrams)
- [ ] GitHub Actions workflow templates
- [ ] Visual Studio test integration
- [ ] Additional framework adapters (Semantic Kernel)
Medium-term (Q2 2026)
- [ ] Code metrics (ResponseLength, HasCitation, CitationMatch)
- [ ] Refusal quality metric ("dontknowness" for unanswerable questions)
- [ ] Multi-agent orchestration contracts
- [ ] Assertion telemetry (local storage)
- [ ] Self-healing assertions (rule-based)
- [ ] Record/Replay for deterministic testing
- [ ] Experiment management and A/B testing
- [ ] Baseline comparison dashboard
Long-term (Q3-Q4 2026)
- [ ] Visual assertion reports (HTML/interactive) — Premium
- [ ] Assertion telemetry (cloud dashboard) — Premium
- [ ] Self-healing assertions (LLM-powered) — Premium
- [ ] Assertion-driven prompt optimization — Premium
- [ ] Self-hosted dashboard & baseline comparison — Enterprise Self-Host
- [ ] AgentEval Studio (self-hosted option) — Enterprise Self-Host
- [ ] Red-teaming and safety testing
- [ ] Synthetic dataset generation
- [ ] AgentEval Studio (visual workflow editor)
� Premium and Enterprise features are planned for future releases. Watch the GitHub Releases for announcements.
Feature Requests
Have a feature request? Open an issue on GitHub!
Version History
| Version | Date | Highlights |
|---|---|---|
| 1.0.0-alpha | Jan 2026 | Initial public release with core features |
See CHANGELOG.md for detailed release notes.