Agentic Benchmark — Cost Guidance

This document describes the cost-tier classification for every evaluator in the AgentEval agentic benchmark suite, explains how to use the --budget-tier CLI flag to control evaluation spend, and provides estimated costs per preset and use-case scenario.

Cost-Tier Definitions

The EvaluatorCostTier enum (in AgentEval.Abstractions/Evals/EvaluatorCostTier.cs) defines five tiers:

Tier	Enum value	Description	Approx. cost per scenario
TRIVIAL	`EvaluatorCostTier.Trivial`	Pure-code computation — no LLM invocation, no API cost. Examples: F1Score, all Telemetry evaluators, JudgeQuality evaluators, CostQualityEfficiency.	~$0.000
LOW	`EvaluatorCostTier.Low`	Single short prompt to the judge model — one turn, minimal context. Examples: Groundedness, TaskCompletion, VerbosityAppropriateness, DirectInjection.	~$0.005–$0.010
MEDIUM	`EvaluatorCostTier.Medium`	Single prompt with moderate context (tool call results, prior turn, plan text). Examples: TurnCoherence, IntermediateStepHallucination, SelfCorrectionQuality.	~$0.010–$0.050
HIGH	`EvaluatorCostTier.High`	Single prompt with large context (full conversation history at 10+ turns). Examples: MemoryRecallAccuracy, LongConversationCoherence, GoalTracking.	~$0.050–$0.200

Cost estimates assume GPT-4o class judge (approximately $0.0025/1K input tokens, $0.010/1K output tokens). Actual costs depend on your model, provider pricing, and prompt length.

Per-Evaluator Cost-Tier Table

Phase 1 — System + Process

Evaluator key	Cost tier	Notes
`task_completion`	LOW	Single-turn LLM judge
`task_adherence`	MEDIUM	Composite of 5 sub-LLM judges (goal/rule/procedural/presentation/authorization)
`intent_identification`	LOW	Single-turn LLM judge
`intent_resolution`	MEDIUM	Composite of 2 sub-LLM judges (intent_identified + intent_resolved)
`task_navigation_efficiency`	LOW	Hybrid: deterministic edit-distance (free) + 1 LLM judge call
`tool_selection`	LOW	Single-turn LLM judge
`tool_input_accuracy`	LOW	Hybrid: schema check (free) + 1 LLM call
`tool_output_utilization`	LOW	Single-turn LLM judge
`tool_call_success`	TRIVIAL	Deterministic-first; LLM fallback only when status field absent
`tool_efficiency`	LOW	Single-turn LLM judge
`tool_call_accuracy`	MEDIUM	Composite of 5 sub-evaluators — total LLM calls multiply

Phase 2 — RAG / Quality

Evaluator key	Cost tier	Notes
`groundedness`	MEDIUM	Composite of 4 sub-LLM judges (claim_support / claim_contradicted / citation_accuracy / evidence_coverage)
`relevance`	LOW	Single-turn LLM judge
`coherence`	LOW	Single-turn LLM judge
`fluency`	LOW	Single-turn LLM judge
`similarity`	LOW	Single-turn LLM judge
`response_completeness`	LOW	Single-turn LLM judge
`f1_score`	TRIVIAL	Pure-code token overlap
`qa_composite`	HIGH	Composite of all 7 above — GroundednessEval alone fans out to 4 LLM judges; 10+ LLM calls per scenario

Phase 3 — Judge Quality (Meta)

Evaluator key	Cost tier	Notes
`judge_agreement`	TRIVIAL	Pure-code Cohen's kappa computation
`calibration_accuracy`	TRIVIAL	Pure-code accuracy computation
`judge_drift`	TRIVIAL	Pure-code max-delta computation

Phase 4 — Safety

Evaluator key	Cost tier	Notes
`prohibited_actions`	LOW	Hybrid deterministic policy-as-code + LLM fallback
`sensitive_data_leakage`	LOW	Hybrid regex scan + LLM fallback
`indirect_attack`	LOW	Single-turn LLM judge (XPIA)
`hate_unfairness`	LOW	Hybrid content-safety client + LLM fallback
`sexual`	LOW	Hybrid content-safety client + LLM fallback
`violence`	LOW	Hybrid content-safety client + LLM fallback
`self_harm`	LOW	Hybrid content-safety client + LLM fallback
`protected_material`	LOW	Single-turn LLM judge
`code_vulnerability`	LOW	Single-turn LLM judge
`ungrounded_attributes`	LOW	Single-turn LLM judge
`system_prompt_leakage`	LOW	Hybrid pattern scan + LLM fallback
`unsafe_tool_use`	MEDIUM	LLM-judge in v1 (deterministic short-circuit when no tool calls); v2 will add policy-driven short-circuit

Phase 5 — Telemetry + Stochastic Stability

Evaluator key	Cost tier	Notes
`latency`	TRIVIAL	Pure-code telemetry check
`token_usage`	TRIVIAL	Pure-code telemetry check
`cost`	TRIVIAL	Pure-code telemetry check
`error_rate`	TRIVIAL	Pure-code telemetry check
`retry_rate`	TRIVIAL	Pure-code telemetry check
`tool_latency`	TRIVIAL	Pure-code telemetry check
`stochastic_stability`	TRIVIAL	Pure-code statistical analysis over N prior runs

Phase 6 — Memory

Evaluator key	Cost tier	Notes
`memory_recall_accuracy`	HIGH	Full conversation history (10+ turns) injected into judge prompt
`long_conversation_coherence`	HIGH	Full conversation history (10+ turns) injected into judge prompt

Phase 6 — Multi-Turn

Evaluator key	Cost tier	Notes
`turn_coherence`	MEDIUM	Previous turn only (small context)
`goal_tracking`	HIGH	Full conversation history required
`clarification_appropriateness`	LOW	Single-turn; optional history

Phase 6 — Reasoning

Evaluator key	Cost tier	Notes
`reasoning_correctness`	MEDIUM	Single LLM call with reasoning trace context
`goal_decomposition_quality`	LOW	Single LLM call over goal + response
`plan_formulation_quality`	MEDIUM	Single LLM call over plan text + query
`intermediate_step_hallucination`	MEDIUM	LLM cross-check of claims vs. tool call results

Phase 6 — Calibration

Evaluator key	Cost tier	Notes
`confidence_calibration`	LOW	Single-turn LLM judge
`uncertainty_acknowledgment`	LOW	Single-turn LLM judge
`self_correction_quality`	MEDIUM	Original exchange + correction turn in judge context

Phase 6 — Adversarial

Evaluator key	Cost tier	Notes
`direct_injection`	LOW	Deterministic pattern scan (zero LLM cost); LLM only on match or nuanced case
`persona_attack`	LOW	Deterministic template scan (zero LLM cost); LLM only on match or nuanced case
`jailbreak_resistance`	MEDIUM	Scans combined library; up to N LLM calls per matched pattern

Phase 6 — UX

Evaluator key	Cost tier	Notes
`verbosity_appropriateness`	LOW	Single-turn LLM judge
`tone_appropriateness`	LOW	Single-turn LLM judge
`refusal_quality`	LOW	Single-turn LLM judge; fast-pass when not a refusal

Phase 6 — Efficiency

Evaluator key	Cost tier	Notes
`cost_quality_efficiency`	TRIVIAL	Pure-code formula — no LLM call

Recommended Budget Per Use Case

Dev-loop iteration (`--budget-tier low`)

Use during active development for fast feedback without incurring significant API costs.

Keeps: all TRIVIAL + LOW tier evaluators
Removes: MEDIUM and HIGH tier evaluators
Conversational preset result after low filter: 1 component retained (clarification_appropriateness). Weight renormalised to 1.0.
Recommended presets for dev-loop: agentic-execution, rag-quality, safety, user-experience

agenteval bench agentic --preset agentic-execution --subject MyAgent --budget-tier low
agenteval bench agentic --preset user-experience --subject MyAgent --budget-tier low

PR build gate (`--budget-tier medium`)

Use in CI for pull request validation — balances speed and coverage.

Keeps: all TRIVIAL + LOW + MEDIUM tier evaluators
Removes: HIGH tier evaluators (full-history conversation evaluators)
Conversational preset result after medium filter: 2 components retained (turn_coherence, clarification_appropriateness); the three HIGH-tier evaluators (memory_recall_accuracy, long_conversation_coherence, goal_tracking) are removed and remaining weights renormalised.

agenteval bench agentic --preset conversational --subject MyAgent --budget-tier medium
agenteval bench agentic --preset reasoning --subject MyAgent --budget-tier medium

Release-gate audit (no `--budget-tier` flag, defaults to `all`)

Use for full coverage at release time or for scheduled quality audits.

All evaluators run regardless of cost tier.
No filtering applied.

agenteval bench agentic --preset conversational --subject MyAgent
agenteval bench agentic --preset adversarial-direct --subject MyAgent

Estimated Full-Calibration Costs Per Preset

Estimates assume a GPT-4o-class judge, 10 scenarios per evaluator, and average prompt lengths typical for each tier. Costs are approximate.

Preset	Components	Tiers	Est. cost per run (10 scenarios)
`agentic-execution`	6	LOW×6	~$0.30–$0.60
`tool-call-accuracy`	5 (via aggregate)	LOW×5	~$0.25–$0.50
`rag-quality`	7	TRIVIAL×1, LOW×5, HIGH×1	~$0.80–$2.00
`judge-quality`	3	TRIVIAL×3	~$0.00
`safety`	12	LOW×11, MEDIUM×1	~$0.60–$1.20
`telemetry`	6	TRIVIAL×6	~$0.00
`stochastic-stability`	1	TRIVIAL×1	~$0.00
`conversational`	5	HIGH×3, MEDIUM×1, LOW×1	~$1.50–$3.00
`reasoning`	4	MEDIUM×4	~$0.40–$2.00
`user-experience`	5	LOW×5	~$0.25–$0.50
`adversarial-direct`	3	MEDIUM×1, LOW×2	~$0.15–$0.60

The conversational preset is the most expensive due to its three HIGH-tier evaluators (MemoryRecallAccuracy, LongConversationCoherence, GoalTracking) which embed full conversation histories in every judge prompt. For dev-loop use, apply --budget-tier medium or --budget-tier low.

Implementation Reference

Enum: AgentEval.Evals.EvaluatorCostTier in src/AgentEval.Abstractions/Evals/
Cost map: AgentEval.Evals.EvaluatorCostMap (static dictionary keyed by evaluator key — covers all shipped atomic evaluators plus the composites and aggregates that consumers reference directly)
Filter logic: AgentEval.Evals.Agentic.Composition.CostFilteredCompositeBuilder.FilterByBudget
CLI integration: --budget-tier flag on agenteval bench agentic

The EvaluatorCostMap.IsWithinBudget(tier, budget) method returns true when tier <= budget. Unknown evaluator keys default to Medium (conservative).

Table of Contents