ADR-003: CLI Review Commands
Status: Proposed
Date: 2026-01-07
Decision Makers: AgentEval Contributors
Context
Current CLI State
AgentEval CLI provides:
| Command | Purpose | Status |
|---|---|---|
agenteval eval |
Run evaluations | ✅ Implemented |
agenteval init |
Initialize config | ✅ Implemented |
agenteval list |
List metrics/assertions/formats | ✅ Implemented |
Problem
After running multiple evaluations, users need to:
- View summary — Compare aggregate metrics across runs
- Diff runs — See which specific tests changed between versions
- Identify regressions — Quickly spot degraded metrics
Current workflow (manual):
# Run evaluations
agenteval eval --output run1.json
agenteval eval --output run2.json
# Compare manually (no tooling!)
# User must write custom scripts or eyeball JSON files
Example workflow (ai-rag-chat-evaluator):
# View all runs
python -m evaltools summary ./results
# Compare specific runs
python -m evaltools diff ./results/run1 ./results/run2 --changed=relevance
User Value
| Feature | Value | Effort |
|---|---|---|
summary command |
High — quick overview of all runs | Low — table formatting |
diff command |
High — identifies regressions | Medium — comparison logic |
--changed filter |
Medium — focuses attention | Low — simple filter |
Decision
Add two new CLI commands for reviewing evaluation results:
Command: agenteval summary
agenteval summary ./results
┌─────────────────────┬────────────┬─────────────┬──────────────┬───────┐
│ Run │ Pass Rate │ Avg Latency │ Avg Relevance│ Tests │
├─────────────────────┼────────────┼─────────────┼──────────────┼───────┤
│ 2026-01-07_baseline │ 85% │ 1.2s │ 78 │ 50 │
│ 2026-01-07_v2 │ 92% │ 1.1s │ 85 │ 50 │
│ 2026-01-08_v3 │ 94% │ 0.9s │ 88 │ 50 │
└─────────────────────┴────────────┴─────────────┴──────────────┴───────┘
Options:
--format <table|json|markdown>— Output format (default: table)--highlight <run>— Highlight a specific run for comparison
Command: agenteval diff
agenteval diff ./results/run1 ./results/run2
Comparing: 2026-01-07_baseline → 2026-01-07_v2
Changed tests: 7 of 50
┌────────────────────────────────────┬──────────┬──────────┬────────┐
│ Test │ run1 │ run2 │ Delta │
├────────────────────────────────────┼──────────┼──────────┼────────┤
│ customer_returns_question │ 72 │ 89 │ +17 ⬆️ │
│ password_reset_flow │ 85 │ 91 │ +6 ⬆️ │
│ billing_inquiry │ 90 │ 82 │ -8 ⬇️ │
└────────────────────────────────────┴──────────┴──────────┴────────┘
Options:
--changed <metric>— Show only tests where metric changed--threshold <n>— Minimum delta to show (default: 0)--format <table|json|markdown>— Output format
Console Visualization
Use Spectre.Console (already a dependency) for rich output:
// Already referenced in AgentEval.Cli
using Spectre.Console;
// Rich table output
var table = new Table()
.AddColumn("Run")
.AddColumn("Pass Rate")
.AddColumn("Latency");
table.AddRow("baseline", "[green]85%[/]", "1.2s");
table.AddRow("v2", "[green]92%[/]", "1.1s");
AnsiConsole.Write(table);
No additional dependencies required.
Consequences
Positive
- Quick Insights — See run status at a glance
- Regression Detection — Immediately spot degraded tests
- CI-Friendly —
--format jsonenables scripted comparison - No New Dependencies — Uses existing Spectre.Console
- Industry Best Practice — Matches evaluation framework standards
Negative
- Requires ADR-002 — Needs structured result directories
- CLI Expansion — More commands to maintain
Neutral
- Optional — Users can still use raw JSON if preferred
Alternatives Considered
Alternative A: Web Dashboard Only
Rejected — Adds infrastructure requirements; CLI is simpler for local use.
Alternative B: HTML Report Diff
Rejected — Better as Pro feature; CLI addresses immediate need.
Alternative C: VS Code Extension
Considered for future — Good UX but higher effort.
Alternative D: TUI (Terminal UI)
┌──────────────────────────────────────────────────┐
│ Question: How do I reset my password? │
├────────────────────┬────────────────────────────┤
│ run1 │ run2 │
│ Click forgot pass │ To reset your password... │
│ relevance: 72 │ relevance: 89 ⬆️ │
├────────────────────┴────────────────────────────┤
│ [Next] [Previous] [Quit] │
└──────────────────────────────────────────────────┘
Deferred — Good for v2; requires more effort. Start with simple table output.
Implementation
- Prerequisite: Implement ADR-002 (DirectoryExporter)
- Add
SummaryCommandreadingsummary.jsonfiles - Add
DiffCommandreadingresults.jsonlfiles - Use Spectre.Console tables with color coding
- Add
--formatoption for CI integration
File Dependencies
results/
└── run1/
├── results.jsonl ← DiffCommand reads this
└── summary.json ← SummaryCommand reads this