Agentic Benchmark — Cost Guidance
This document describes the cost-tier classification for every evaluator in the AgentEval agentic benchmark suite, explains how to use the --budget-tier CLI flag to control evaluation spend, and provides estimated costs per preset and use-case scenario.
Cost-Tier Definitions
The EvaluatorCostTier enum (in AgentEval.Abstractions/Evals/EvaluatorCostTier.cs) defines five tiers:
| Tier |
Enum value |
Description |
Approx. cost per scenario |
| TRIVIAL |
EvaluatorCostTier.Trivial |
Pure-code computation — no LLM invocation, no API cost. Examples: F1Score, all Telemetry evaluators, JudgeQuality evaluators, CostQualityEfficiency. |
~$0.000 |
| LOW |
EvaluatorCostTier.Low |
Single short prompt to the judge model — one turn, minimal context. Examples: Groundedness, TaskCompletion, VerbosityAppropriateness, DirectInjection. |
~$0.005–$0.010 |
| MEDIUM |
EvaluatorCostTier.Medium |
Single prompt with moderate context (tool call results, prior turn, plan text). Examples: TurnCoherence, IntermediateStepHallucination, SelfCorrectionQuality. |
~$0.010–$0.050 |
| HIGH |
EvaluatorCostTier.High |
Single prompt with large context (full conversation history at 10+ turns). Examples: MemoryRecallAccuracy, LongConversationCoherence, GoalTracking. |
~$0.050–$0.200 |
Cost estimates assume GPT-4o class judge (approximately $0.0025/1K input tokens, $0.010/1K output tokens). Actual costs depend on your model, provider pricing, and prompt length.
Per-Evaluator Cost-Tier Table
Phase 1 — System + Process
| Evaluator key |
Cost tier |
Notes |
task_completion |
LOW |
Single-turn LLM judge |
task_adherence |
MEDIUM |
Composite of 5 sub-LLM judges (goal/rule/procedural/presentation/authorization) |
intent_identification |
LOW |
Single-turn LLM judge |
intent_resolution |
MEDIUM |
Composite of 2 sub-LLM judges (intent_identified + intent_resolved) |
task_navigation_efficiency |
LOW |
Hybrid: deterministic edit-distance (free) + 1 LLM judge call |
tool_selection |
LOW |
Single-turn LLM judge |
tool_input_accuracy |
LOW |
Hybrid: schema check (free) + 1 LLM call |
tool_output_utilization |
LOW |
Single-turn LLM judge |
tool_call_success |
TRIVIAL |
Deterministic-first; LLM fallback only when status field absent |
tool_efficiency |
LOW |
Single-turn LLM judge |
tool_call_accuracy |
MEDIUM |
Composite of 5 sub-evaluators — total LLM calls multiply |
Phase 2 — RAG / Quality
| Evaluator key |
Cost tier |
Notes |
groundedness |
MEDIUM |
Composite of 4 sub-LLM judges (claim_support / claim_contradicted / citation_accuracy / evidence_coverage) |
relevance |
LOW |
Single-turn LLM judge |
coherence |
LOW |
Single-turn LLM judge |
fluency |
LOW |
Single-turn LLM judge |
similarity |
LOW |
Single-turn LLM judge |
response_completeness |
LOW |
Single-turn LLM judge |
f1_score |
TRIVIAL |
Pure-code token overlap |
qa_composite |
HIGH |
Composite of all 7 above — GroundednessEval alone fans out to 4 LLM judges; 10+ LLM calls per scenario |
| Evaluator key |
Cost tier |
Notes |
judge_agreement |
TRIVIAL |
Pure-code Cohen's kappa computation |
calibration_accuracy |
TRIVIAL |
Pure-code accuracy computation |
judge_drift |
TRIVIAL |
Pure-code max-delta computation |
Phase 4 — Safety
| Evaluator key |
Cost tier |
Notes |
prohibited_actions |
LOW |
Hybrid deterministic policy-as-code + LLM fallback |
sensitive_data_leakage |
LOW |
Hybrid regex scan + LLM fallback |
indirect_attack |
LOW |
Single-turn LLM judge (XPIA) |
hate_unfairness |
LOW |
Hybrid content-safety client + LLM fallback |
sexual |
LOW |
Hybrid content-safety client + LLM fallback |
violence |
LOW |
Hybrid content-safety client + LLM fallback |
self_harm |
LOW |
Hybrid content-safety client + LLM fallback |
protected_material |
LOW |
Single-turn LLM judge |
code_vulnerability |
LOW |
Single-turn LLM judge |
ungrounded_attributes |
LOW |
Single-turn LLM judge |
system_prompt_leakage |
LOW |
Hybrid pattern scan + LLM fallback |
unsafe_tool_use |
MEDIUM |
LLM-judge in v1 (deterministic short-circuit when no tool calls); v2 will add policy-driven short-circuit |
Phase 5 — Telemetry + Stochastic Stability
| Evaluator key |
Cost tier |
Notes |
latency |
TRIVIAL |
Pure-code telemetry check |
token_usage |
TRIVIAL |
Pure-code telemetry check |
cost |
TRIVIAL |
Pure-code telemetry check |
error_rate |
TRIVIAL |
Pure-code telemetry check |
retry_rate |
TRIVIAL |
Pure-code telemetry check |
tool_latency |
TRIVIAL |
Pure-code telemetry check |
stochastic_stability |
TRIVIAL |
Pure-code statistical analysis over N prior runs |
Phase 6 — Memory
| Evaluator key |
Cost tier |
Notes |
memory_recall_accuracy |
HIGH |
Full conversation history (10+ turns) injected into judge prompt |
long_conversation_coherence |
HIGH |
Full conversation history (10+ turns) injected into judge prompt |
Phase 6 — Multi-Turn
| Evaluator key |
Cost tier |
Notes |
turn_coherence |
MEDIUM |
Previous turn only (small context) |
goal_tracking |
HIGH |
Full conversation history required |
clarification_appropriateness |
LOW |
Single-turn; optional history |
Phase 6 — Reasoning
| Evaluator key |
Cost tier |
Notes |
reasoning_correctness |
MEDIUM |
Single LLM call with reasoning trace context |
goal_decomposition_quality |
LOW |
Single LLM call over goal + response |
plan_formulation_quality |
MEDIUM |
Single LLM call over plan text + query |
intermediate_step_hallucination |
MEDIUM |
LLM cross-check of claims vs. tool call results |
Phase 6 — Calibration
| Evaluator key |
Cost tier |
Notes |
confidence_calibration |
LOW |
Single-turn LLM judge |
uncertainty_acknowledgment |
LOW |
Single-turn LLM judge |
self_correction_quality |
MEDIUM |
Original exchange + correction turn in judge context |
Phase 6 — Adversarial
| Evaluator key |
Cost tier |
Notes |
direct_injection |
LOW |
Deterministic pattern scan (zero LLM cost); LLM only on match or nuanced case |
persona_attack |
LOW |
Deterministic template scan (zero LLM cost); LLM only on match or nuanced case |
jailbreak_resistance |
MEDIUM |
Scans combined library; up to N LLM calls per matched pattern |
Phase 6 — UX
| Evaluator key |
Cost tier |
Notes |
verbosity_appropriateness |
LOW |
Single-turn LLM judge |
tone_appropriateness |
LOW |
Single-turn LLM judge |
refusal_quality |
LOW |
Single-turn LLM judge; fast-pass when not a refusal |
Phase 6 — Efficiency
| Evaluator key |
Cost tier |
Notes |
cost_quality_efficiency |
TRIVIAL |
Pure-code formula — no LLM call |
Recommended Budget Per Use Case
Dev-loop iteration (--budget-tier low)
Use during active development for fast feedback without incurring significant API costs.
- Keeps: all TRIVIAL + LOW tier evaluators
- Removes: MEDIUM and HIGH tier evaluators
- Conversational preset result after
low filter: 1 component retained (clarification_appropriateness). Weight renormalised to 1.0.
- Recommended presets for dev-loop:
agentic-execution, rag-quality, safety, user-experience
agenteval bench agentic --preset agentic-execution --subject MyAgent --budget-tier low
agenteval bench agentic --preset user-experience --subject MyAgent --budget-tier low
PR build gate (--budget-tier medium)
Use in CI for pull request validation — balances speed and coverage.
- Keeps: all TRIVIAL + LOW + MEDIUM tier evaluators
- Removes: HIGH tier evaluators (full-history conversation evaluators)
- Conversational preset result after
medium filter: 2 components retained (turn_coherence, clarification_appropriateness); the three HIGH-tier evaluators (memory_recall_accuracy, long_conversation_coherence, goal_tracking) are removed and remaining weights renormalised.
agenteval bench agentic --preset conversational --subject MyAgent --budget-tier medium
agenteval bench agentic --preset reasoning --subject MyAgent --budget-tier medium
Release-gate audit (no --budget-tier flag, defaults to all)
Use for full coverage at release time or for scheduled quality audits.
- All evaluators run regardless of cost tier.
- No filtering applied.
agenteval bench agentic --preset conversational --subject MyAgent
agenteval bench agentic --preset adversarial-direct --subject MyAgent
Estimated Full-Calibration Costs Per Preset
Estimates assume a GPT-4o-class judge, 10 scenarios per evaluator, and average prompt lengths typical for each tier. Costs are approximate.
| Preset |
Components |
Tiers |
Est. cost per run (10 scenarios) |
agentic-execution |
6 |
LOW×6 |
~$0.30–$0.60 |
tool-call-accuracy |
5 (via aggregate) |
LOW×5 |
~$0.25–$0.50 |
rag-quality |
7 |
TRIVIAL×1, LOW×5, HIGH×1 |
~$0.80–$2.00 |
judge-quality |
3 |
TRIVIAL×3 |
~$0.00 |
safety |
12 |
LOW×11, MEDIUM×1 |
~$0.60–$1.20 |
telemetry |
6 |
TRIVIAL×6 |
~$0.00 |
stochastic-stability |
1 |
TRIVIAL×1 |
~$0.00 |
conversational |
5 |
HIGH×3, MEDIUM×1, LOW×1 |
~$1.50–$3.00 |
reasoning |
4 |
MEDIUM×4 |
~$0.40–$2.00 |
user-experience |
5 |
LOW×5 |
~$0.25–$0.50 |
adversarial-direct |
3 |
MEDIUM×1, LOW×2 |
~$0.15–$0.60 |
The conversational preset is the most expensive due to its three HIGH-tier evaluators (MemoryRecallAccuracy, LongConversationCoherence, GoalTracking) which embed full conversation histories in every judge prompt. For dev-loop use, apply --budget-tier medium or --budget-tier low.
Implementation Reference
- Enum:
AgentEval.Evals.EvaluatorCostTier in src/AgentEval.Abstractions/Evals/
- Cost map:
AgentEval.Evals.EvaluatorCostMap (static dictionary keyed by evaluator key — covers all shipped atomic evaluators plus the composites and aggregates that consumers reference directly)
- Filter logic:
AgentEval.Evals.Agentic.Composition.CostFilteredCompositeBuilder.FilterByBudget
- CLI integration:
--budget-tier flag on agenteval bench agentic
The EvaluatorCostMap.IsWithinBudget(tier, budget) method returns true when tier <= budget. Unknown evaluator keys default to Medium (conservative).