ADR-003: CLI Review Commands

Status: Proposed
Date: 2026-01-07
Decision Makers: AgentEval Contributors

Context

Current CLI State

AgentEval CLI provides:

Command	Purpose	Status
`agenteval eval`	Run evaluations	✅ Implemented
`agenteval init`	Initialize config	✅ Implemented
`agenteval list`	List metrics/assertions/formats	✅ Implemented

Problem

After running multiple evaluations, users need to:

View summary — Compare aggregate metrics across runs
Diff runs — See which specific tests changed between versions
Identify regressions — Quickly spot degraded metrics

Current workflow (manual):

# Run evaluations
agenteval eval --output run1.json
agenteval eval --output run2.json

# Compare manually (no tooling!)
# User must write custom scripts or eyeball JSON files

Example workflow (ai-rag-chat-evaluator):

# View all runs
python -m evaltools summary ./results

# Compare specific runs
python -m evaltools diff ./results/run1 ./results/run2 --changed=relevance

User Value

Feature	Value	Effort
`summary` command	High — quick overview of all runs	Low — table formatting
`diff` command	High — identifies regressions	Medium — comparison logic
`--changed` filter	Medium — focuses attention	Low — simple filter

Decision

Add two new CLI commands for reviewing evaluation results:

Command: `agenteval summary`

agenteval summary ./results

┌─────────────────────┬────────────┬─────────────┬──────────────┬───────┐
│ Run                 │ Pass Rate  │ Avg Latency │ Avg Relevance│ Tests │
├─────────────────────┼────────────┼─────────────┼──────────────┼───────┤
│ 2026-01-07_baseline │ 85%        │ 1.2s        │ 78           │ 50    │
│ 2026-01-07_v2       │ 92%        │ 1.1s        │ 85           │ 50    │
│ 2026-01-08_v3       │ 94%        │ 0.9s        │ 88           │ 50    │
└─────────────────────┴────────────┴─────────────┴──────────────┴───────┘

Options:

--format <table|json|markdown> — Output format (default: table)
--highlight <run> — Highlight a specific run for comparison

Command: `agenteval diff`

agenteval diff ./results/run1 ./results/run2

Comparing: 2026-01-07_baseline → 2026-01-07_v2

Changed tests: 7 of 50

┌────────────────────────────────────┬──────────┬──────────┬────────┐
│ Test                               │ run1     │ run2     │ Delta  │
├────────────────────────────────────┼──────────┼──────────┼────────┤
│ customer_returns_question          │ 72       │ 89       │ +17 ⬆️ │
│ password_reset_flow                │ 85       │ 91       │ +6  ⬆️ │
│ billing_inquiry                    │ 90       │ 82       │ -8  ⬇️ │
└────────────────────────────────────┴──────────┴──────────┴────────┘

Options:

--changed <metric> — Show only tests where metric changed
--threshold <n> — Minimum delta to show (default: 0)
--format <table|json|markdown> — Output format

Console Visualization

Use Spectre.Console (already a dependency) for rich output:

// Already referenced in AgentEval.Cli
using Spectre.Console;

// Rich table output
var table = new Table()
    .AddColumn("Run")
    .AddColumn("Pass Rate")
    .AddColumn("Latency");

table.AddRow("baseline", "[green]85%[/]", "1.2s");
table.AddRow("v2", "[green]92%[/]", "1.1s");

AnsiConsole.Write(table);

No additional dependencies required.

Consequences

Positive

Quick Insights — See run status at a glance
Regression Detection — Immediately spot degraded tests
CI-Friendly — --format json enables scripted comparison
No New Dependencies — Uses existing Spectre.Console
Industry Best Practice — Matches evaluation framework standards

Negative

Requires ADR-002 — Needs structured result directories
CLI Expansion — More commands to maintain

Neutral

Optional — Users can still use raw JSON if preferred

Alternatives Considered

Alternative A: Web Dashboard Only

Rejected — Adds infrastructure requirements; CLI is simpler for local use.

Alternative B: HTML Report Diff

Rejected — Better as Pro feature; CLI addresses immediate need.

Alternative C: VS Code Extension

Considered for future — Good UX but higher effort.

Alternative D: TUI (Terminal UI)

┌──────────────────────────────────────────────────┐
│ Question: How do I reset my password?            │
├────────────────────┬────────────────────────────┤
│ run1               │ run2                       │
│ Click forgot pass  │ To reset your password...  │
│ relevance: 72      │ relevance: 89 ⬆️           │
├────────────────────┴────────────────────────────┤
│ [Next] [Previous] [Quit]                        │
└──────────────────────────────────────────────────┘

Deferred — Good for v2; requires more effort. Start with simple table output.

Implementation

Prerequisite: Implement ADR-002 (DirectoryExporter)
Add SummaryCommand reading summary.json files
Add DiffCommand reading results.jsonl files
Use Spectre.Console tables with color coding
Add --format option for CI integration

File Dependencies

results/
└── run1/
    ├── results.jsonl   ← DiffCommand reads this
    └── summary.json    ← SummaryCommand reads this

Table of Contents

ADR-003: CLI Review Commands

Context

Current CLI State

Problem

User Value

Decision

Command: `agenteval summary`

Command: `agenteval diff`

Console Visualization

Consequences

Positive

Negative

Neutral

Alternatives Considered

Alternative A: Web Dashboard Only

Alternative B: HTML Report Diff

Alternative C: VS Code Extension

Alternative D: TUI (Terminal UI)

Implementation

File Dependencies

References

Table of Contents

ADR-003: CLI Review Commands

Context

Current CLI State

Problem

User Value

Decision

Command: agenteval summary

Command: agenteval diff

Console Visualization

Consequences

Positive

Negative

Neutral

Alternatives Considered

Alternative A: Web Dashboard Only

Alternative B: HTML Report Diff

Alternative C: VS Code Extension

Alternative D: TUI (Terminal UI)

Implementation

File Dependencies

References

Command: `agenteval summary`

Command: `agenteval diff`