CLI Reference

AgentEval ships a standalone CLI tool for running evaluations from the terminal, CI/CD pipelines, and shell scripts — without writing C# code.

Installation

dotnet tool install --global AgentEval.Cli --prerelease

After installation the agenteval command is available system-wide.

Quick Start

# Evaluate with Azure OpenAI
agenteval eval \
  --azure --model gpt-4o \
  --dataset tests.yaml \
  --format json -o results.json

# Evaluate with any OpenAI-compatible endpoint (Ollama, vLLM, Groq, etc.)
agenteval eval \
  --endpoint http://localhost:11434/v1 \
  --model llama3 \
  --dataset tests.yaml

# Pipe JSON to jq for quick inspection
agenteval eval --azure --model gpt-4o --dataset tests.yaml | jq '.passRate'

Commands

`agenteval eval`

Evaluate an AI agent against a dataset of test cases.

agenteval eval [options]

Required Options

Option	Description
`--dataset <file>`	Dataset file (YAML, JSON, JSONL, or CSV)
`--model <name>`	Model or deployment name

Endpoint Options (choose one)

Option	Description
`--endpoint <url>`	OpenAI-compatible API endpoint URL
`--azure`	Use Azure OpenAI (reads `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_API_KEY` env vars)

Authentication

Option	Description
`--api-key <key>`	API key (or set `OPENAI_API_KEY` / `AZURE_OPENAI_API_KEY` env var)

Tip: For local providers like Ollama that don't require authentication, omit --api-key.

Agent Configuration

Option	Default	Description
`--system-prompt <text>`	(none)	System prompt text
`--system-prompt-file <file>`	(none)	Read system prompt from a file
`--temperature <float>`	`0`	Sampling temperature (0 = deterministic)
`--max-tokens <int>`	(none)	Maximum output tokens

LLM-as-Judge

Option	Description
`--judge <url>`	Separate endpoint for LLM-as-judge scoring
`--judge-model <name>`	Model for the judge (defaults to `--model`)

When --judge is provided, the evaluation harness uses a separate LLM to score agent responses using the TaskCompletionMetric and other LLM-backed metrics.

Output

Option	Default	Description
`--format <fmt>`	`json`	Export format (see table below)
`-o, --output <file>`	stdout	Output file path

Supported formats:

Format	Aliases	Description
`json`		JSON report with full details
`junit`	`xml`	JUnit XML for CI/CD integration
`markdown`	`md`	Human-readable Markdown table
`trx`		Visual Studio TRX format
`csv`		Comma-separated values

Verbosity

Option	Description
`--verbose`	Show detailed progress during evaluation
`--quiet`	Suppress all output except the export data

Exit Codes

Code	Meaning
`0`	All tests passed
`1`	One or more tests failed
`2`	Usage error (invalid arguments)
`3`	Runtime error (network, auth, file not found)

Exit codes are designed for CI/CD integration — a non-zero exit code fails the pipeline step.

Dataset Format

The CLI uses DatasetLoaderFactory which auto-detects format by file extension.

YAML (recommended)

data:
  - id: capital_city
    input: What is the capital of France?
    expected: Paris
    expected_tools:
      - lookup_city

  - id: simple_math
    input: What is 7 * 8?
    expected: "56"

JSON

[
  { "input": "What is the capital of France?", "expected": "Paris" },
  { "input": "What is 7 * 8?", "expected": "56" }
]

JSONL

{"input": "What is the capital of France?", "expected": "Paris"}
{"input": "What is 7 * 8?", "expected": "56"}

CSV

input,expected
"What is the capital of France?","Paris"
"What is 7 * 8?","56"

CI/CD Examples

GitHub Actions

- name: Evaluate AI Agent
  run: |
    agenteval eval \
      --azure --model gpt-4o \
      --dataset tests/eval-dataset.yaml \
      --format junit -o results.xml
  env:
    AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
    AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}

- name: Publish Test Results
  uses: dorny/test-reporter@v1
  with:
    name: Agent Evaluation
    path: results.xml
    reporter: java-junit

Azure DevOps

- task: DotNetCoreCLI@2
  displayName: Install AgentEval CLI
  inputs:
    command: custom
    custom: tool
    arguments: install --global AgentEval.Cli --prerelease

- script: |
    agenteval eval \
      --azure --model gpt-4o \
      --dataset $(Build.SourcesDirectory)/tests/eval-dataset.yaml \
      --format trx -o $(Build.ArtifactStagingDirectory)/eval-results.trx
  displayName: Run Agent Evaluation
  env:
    AZURE_OPENAI_ENDPOINT: $(AZURE_OPENAI_ENDPOINT)
    AZURE_OPENAI_API_KEY: $(AZURE_OPENAI_API_KEY)

- task: PublishTestResults@2
  inputs:
    testResultsFormat: VSTest
    testResultsFiles: '**/*.trx'

Provider Examples

Azure OpenAI

agenteval eval --azure --model gpt-4o --dataset tests.yaml

OpenAI

agenteval eval \
  --endpoint https://api.openai.com/v1 \
  --model gpt-4o \
  --api-key $OPENAI_API_KEY \
  --dataset tests.yaml

Ollama (local)

agenteval eval \
  --endpoint http://localhost:11434/v1 \
  --model llama3.1 \
  --dataset tests.yaml

LM Studio

agenteval eval \
  --endpoint http://localhost:1234/v1 \
  --model local-model \
  --dataset tests.yaml

Groq

agenteval eval \
  --endpoint https://api.groq.com/openai/v1 \
  --model llama-3.1-70b-versatile \
  --api-key $GROQ_API_KEY \
  --dataset tests.yaml

Together.ai

agenteval eval \
  --endpoint https://api.together.xyz/v1 \
  --model meta-llama/Llama-3-70b-chat-hf \
  --api-key $TOGETHER_API_KEY \
  --dataset tests.yaml

Piping & Composition

The CLI follows Unix conventions: data goes to stdout, messages to stderr.

# Pipe JSON to jq
agenteval eval --azure --model gpt-4o --dataset tests.yaml | jq '.passRate'

# Save JSON and view Markdown
agenteval eval --azure --model gpt-4o --dataset tests.yaml -o results.json
agenteval eval --azure --model gpt-4o --dataset tests.yaml --format md

# Compare two models (run both, diff)
agenteval eval --azure --model gpt-4o --dataset tests.yaml -o gpt4o.json
agenteval eval --azure --model gpt-4o-mini --dataset tests.yaml -o gpt4omini.json

Troubleshooting

Error	Cause	Fix
`Specify --endpoint <url> or --azure`	No endpoint provided	Add `--endpoint` or `--azure`
`Dataset not found`	File doesn't exist	Check `--dataset` path
`Dataset is empty`	File has no test cases	Ensure YAML/JSON has `data:` entries
`Unknown format 'xxx'`	Unsupported export format	Use: json, junit, xml, markdown, md, trx, csv
`Azure endpoint required`	Missing Azure config	Set `AZURE_OPENAI_ENDPOINT` env var
`Azure API key required`	Missing Azure key	Set `AZURE_OPENAI_API_KEY` env var

Architecture

The CLI is a thin shell over the same libraries you use in C# code:

agenteval eval --azure --model gpt-4o --dataset tests.yaml --format junit
       │
       ▼
  ┌─────────────────┐
  │  EvalCommand     │  Parse options, validate
  └────────┬────────┘
           ▼
  ┌─────────────────┐
  │ EndpointFactory  │  Create IChatClient (Azure or OpenAI-compatible)
  └────────┬────────┘
           ▼
  ┌─────────────────┐
  │ AsEvaluableAgent │  IChatClient → IStreamableAgent (zero boilerplate)
  └────────┬────────┘
           ▼
  ┌─────────────────┐
  │ DatasetLoader    │  YAML/JSON/JSONL/CSV → List<TestCase>
  └────────┬────────┘
           ▼
  ┌─────────────────────────┐
  │ MAFEvaluationHarness    │  Run tests, track tools & performance
  └────────┬────────────────┘
           ▼
  ┌─────────────────┐
  │ ExportHandler    │  Format routing → stdout or file
  └─────────────────┘

All evaluation logic lives in the shared libraries — the CLI adds no custom evaluation code.

Table of Contents