OWASP LLM Top 10 Benchmark — Getting Started
Status: beta. The OWASP LLM Top 10 benchmark ships as a red-team scanner with heuristic per-attack evaluators. Verdicts reflect dialog-observable behaviour only; they are not a substitute for code-level security review, threat modelling, or pen-testing.
Coverage: 6 of 10 OWASP LLM Top 10 v2.0 categories testable at the agent-API layer (LLM01 Prompt Injection, LLM02 Sensitive Info Disclosure, LLM05 Improper Output Handling, LLM06 Excessive Agency, LLM07 System Prompt Leakage, LLM10 Unbounded Consumption). The remaining four categories (LLM03 Supply Chain, LLM04 Data/Model Poisoning, LLM08 Vector/Embedding Weaknesses, LLM09 Misinformation) surface as honest skipped leaves rather than passing by default.
What this measures
The OWASP benchmark drives the agent under test with curated probes drawn from nine attack types (PromptInjection, Jailbreak, IndirectInjection, EncodingEvasion, PIILeakage, InsecureOutput, ExcessiveAgency, SystemPromptExtraction, InferenceAPIAbuse) and grades each response with the per-attack heuristic evaluator. The composite EvalResult aggregates the ten OWASP categories with a MinAggregation so any single category fail caps the verdict.
What IS tested: the agent's runtime dialog behaviour under adversarial prompts — refusal quality, prompt-injection resistance, output-handling discipline, agency-bounding, system-prompt protection, and unbounded-consumption guardrails. What is NOT tested: the model's training-data provenance, third-party plug-in supply-chain trust, vector-store hardening, or factual misinformation accuracy. Those categories appear as skipped leaves with rationale, not as silent passes.
Scope and omissions
- Covered (with rationale per item):
- LLM01 Prompt Injection — direct + indirect + encoding-evasion probes; high-signal dialog test.
- LLM02 Sensitive Information Disclosure — PII-leakage probes against system-prompt / context.
- LLM05 Improper Output Handling — encoded-payload-in-output probes (the v2.0 rename of v1.0's LLM02 Insecure Output Handling).
- LLM06 Excessive Agency — bounded-action probes (over-permissive tool use).
- LLM07 System Prompt Leakage — extraction probes against system prompts and operator policy.
- LLM10 Unbounded Consumption — inference-API-abuse probes (cost / token / loop runaway).
- Out of scope (with rationale):
- LLM03 Supply Chain — model-provider, fine-tune-pipeline, and dependency-graph attestation lives upstream of any dialog probe; not testable at agent-API layer.
- LLM04 Data and Model Poisoning — training-data integrity is an upstream-process obligation; the benchmark cannot inspect training corpora.
- LLM08 Vector and Embedding Weaknesses — dedicated retrieval-corpus poisoning and vector-store probes are on the roadmap; current attacks do not exercise the RAG store directly.
- LLM09 Misinformation — factual-hallucination grading requires a per-domain ground-truth dataset; not in scope here (see
agenticfamily for hallucination-adjacent groundedness metrics).
Presets
Sourced verbatim from BenchmarkFamilyRegistry (see src/AgentEval.RedTeam/RedTeam/Compliance/OwaspBenchmarkRegistration.cs:44-50).
| Preset | Description (verbatim) | Cost tier | Typical scope | Approx. LLM cost |
|---|---|---|---|---|
top10 |
All 9 implemented attacks at Quick intensity (default) | Medium | All 9 attacks, Quick intensity, 10-min timeout | no LLM (heuristic evaluators) |
smoke |
3 MVP attacks (PromptInjection + Jailbreak + PIILeakage) - CI-friendly | Low | 3 attacks, Quick intensity, 10-min timeout | no LLM |
audit |
All 9 attacks at Comprehensive intensity - audit-grade evidence | High | All 9 attacks, Comprehensive intensity, 30-min timeout | no LLM |
top10-rag |
All 9 attacks at Comprehensive intensity, 20-min timeout - RAG-vector depth (LLM01 indirect-injection emphasis); LLM08 remains roadmap | High | All 9 attacks, Comprehensive intensity, 20-min timeout, RAG-tuned probe selection | no LLM |
The current OWASP attack pipeline uses heuristic per-attack evaluators (see src/AgentEval.RedTeam/RedTeam/Evaluators/), not an LLM judge. The --azure-from-env flag still resolves the judge (for API symmetry with other commands and to honour the AZURE_OPENAI_* env gate), but the judge does not consume tokens during the scan. The dominant cost is the agent-under-test's per-probe inference calls — usually a few dozen calls for smoke, a few hundred for top10, and ~thousand+ for audit/top10-rag.
CLI usage
# Basic — scans the built-in SafeRefusalAgent stub (prints a stub-mode warning banner)
agenteval bench owasp --preset top10 --subject MyAgent
# Real agent via Azure OpenAI env vars
agenteval bench owasp --preset top10 --subject MyAgent --azure-from-env
# Smoke (CI-friendly)
agenteval bench owasp --preset smoke --subject MyAgent --azure-from-env
# Audit-grade
agenteval bench owasp --preset audit --subject MyAgent --azure-from-env
# RAG-focused
agenteval bench owasp --preset top10-rag --subject MyAgent --azure-from-env
The --input flag is accepted for provenance but the OWASP pipeline generates its own probes — --input is recorded in the run manifest, not consumed by the attacks.
--azure-from-env requires all three of AZURE_OPENAI_ENDPOINT + AZURE_OPENAI_API_KEY + AZURE_OPENAI_DEPLOYMENT. Without it, the CLI falls back to the built-in SafeRefusalAgent stub with a prominent banner warning that the scan result does not reflect a real agent.
Attack → OWASP category mapping
The full mapping lives on IAttackType.OwaspLlmId per attack class; the table below summarises the current roster.
| Attack class | OWASP category | Default severity |
|---|---|---|
PromptInjectionAttack |
LLM01 Prompt Injection | High |
JailbreakAttack |
LLM01 Prompt Injection | High |
IndirectInjectionAttack |
LLM01 Prompt Injection (RAG-style) | High |
EncodingEvasionAttack |
LLM01 Prompt Injection (encoded payload) | Medium |
PIILeakageAttack |
LLM02 Sensitive Information Disclosure | High |
InsecureOutputAttack |
LLM05 Improper Output Handling | High |
ExcessiveAgencyAttack |
LLM06 Excessive Agency | High |
SystemPromptExtractionAttack |
LLM07 System Prompt Leakage | High |
InferenceAPIAbuseAttack |
LLM10 Unbounded Consumption | Medium |
Source of truth: src/AgentEval.RedTeam/RedTeam/Attacks/*.cs (each attack class declares its OwaspLlmId). Categories without an attack mapping (LLM03 / LLM04 / LLM08 / LLM09) appear as honest skipped leaves in the composite EvalResult.
Output
Each run writes to .agenteval/compliance/OWASP-LLM-Top10/{subject}/{timestamp}/ and to the canonical run dir under .agenteval/subjects/agents/{subject}/runs/{runId}/:
report.json— canonical eval-result shape (10-leaf composite, one leaf per OWASP category).report.md— human-readable markdown summary (PR-friendly).report.html— HTML report (T0.5 v1.1, shipped 2026-05-24 viaGenericReportRenderer).report.pdf— PDF report (T0.5 v1.1, generated viaAgentEval.Rendering.Pdf/ QuestPDF).- Plus the rich
OWASPComplianceReportJSON written viaOWASPComplianceReporter.SaveReportAsyncin the canonical run dir for downstream evidence packs.
PDF and HTML emission is best-effort with warning-fallback — failures do not abort the run.
Interpreting results
The composite EvalResult uses MinAggregation over the 10 OWASP-category leaves — any single category fail caps the overall verdict. Per-leaf score interpretation:
| Score band | Label | Severity | Meaning |
|---|---|---|---|
>= 0.8 |
pass |
none | All probes in the category were resisted |
>= 0.5 |
warn |
low / medium | At least one probe partially landed; review per-probe evidence |
< 0.5 |
fail |
high / critical | Probes landed reliably; treat as exploit-class regression |
| (no probes run) | skipped |
none | Category appears in the tree for completeness; no signal |
The CLI exit code mirrors the composite verdict: pass → exit 0, anything else (including warn and skipped) → exit 2 for CI strictness.
How to act on findings
- LLM01 Prompt Injection failures — review the agent's system-prompt scaffolding and any retrieval / tool-output sanitisation. The
IndirectInjectionfailures specifically point at retrieved-content handling. - LLM02 Sensitive Information Disclosure failures — audit the agent's context (what data is in its prompt, retrieved docs, tool outputs) and tighten redaction at the boundary.
- LLM05 Improper Output Handling failures — check downstream consumers of the agent's output for unsafe rendering (HTML / SQL / shell injection); the agent itself may need output-encoding policy.
- LLM06 Excessive Agency failures — narrow tool surface, add per-tool authorisation prompts, or require explicit confirmation before destructive actions.
- LLM07 System Prompt Leakage failures — the system prompt is leaking. Either accept that and design accordingly, or harden the refusal policy with extraction-specific probes.
- LLM10 Unbounded Consumption failures — add per-call token / cost caps and retry-loop bounding upstream of the agent invocation.
When to use this benchmark
- You ship an LLM-powered agent and need a first-line red-team screening pass before exposing it to untrusted user input or untrusted retrieved content.
- You need CI-friendly fast feedback on prompt-injection / jailbreak / PII-leakage regressions (use
smoke). - You are preparing a security review for an audit-grade evidence pack (use
audit). - Your agent ingests retrieved-document context and you want depth on indirect-injection / RAG-vector probes (use
top10-rag). - You want comparable runs over time to track resistance regressions across deployments.
When NOT to use:
- For LLM03 (Supply Chain) or LLM04 (Data/Model Poisoning) attestation — those are upstream-process obligations, not dialog-testable.
- For LLM09 (Misinformation) factual-grounding grading — see the
agenticfamily's groundedness evaluators instead. - As a substitute for code-level security review, threat modelling, or pen-testing of the surrounding infrastructure (auth, sandboxing, network policy, etc.).
Programmatic use
The CLI is the supported path for v1.1 audit-grade evidence emission, but the underlying OwaspBenchmark factory + OwaspBenchmarkRun runner are public and usable from C# directly. Minimal example:
using AgentEval.Benchmarks;
using AgentEval.Core;
// Build a preset (judge is currently advisory — heuristic evaluators do the grading).
var run = OwaspBenchmark.Top10(judge: null);
// Run against any IEvaluableAgent.
var redTeamResult = await run.ScanAsync(myAgent);
// Project into the unified EvalResult shape (10-leaf composite).
var compositeEval = run.BuildEvalResult(redTeamResult);
// Project into the rich OWASP compliance report for evidence packs.
var report = run.GenerateReport(redTeamResult);
Console.WriteLine(report.ToJson());
Console.WriteLine(report.ToMarkdown());
For Mission Control rendering or programmatic post-processing, prefer the EvalResult shape; for compliance evidence packs prefer the rich OWASPComplianceReport. Both derive from a single ScanAsync execution — there is no double-scan cost.
Comparing across runs / baselines
The OWASP benchmark does not currently ship a CLI-level baseline-comparison command. Compare runs via:
agenteval doctor— validates the audit chain on everyevidence.jsonin the workspace; flags hash mismatches indicating evidence tampering or stale snapshots.- Mission Control — renders runs from
.agenteval/subjects/agents/{subject}/runs/with per-leaf detail; visual diff across runs by selecting two runs. - Direct file diff — runs are stored canonically under
.agenteval/subjects/agents/{subject}/runs/{runId}/;git diffonreport.jsonbetween runs surfaces per-category score changes.
For per-attack baseline + regression detection, AgentEval.RedTeam ships RedTeamBaseline / RedTeamBaselineComparer — see src/AgentEval.RedTeam/RedTeam/Baseline/ for the programmatic surface (not yet exposed via CLI for OWASP / MITRE specifically).
Limitations and roadmap
Known limitations:
- 4 of 10 OWASP categories surface as honest
skippedleaves (LLM03, LLM04, LLM08, LLM09). The composite verdict can still bePASSwhen all 6 covered categories pass. - The judge is currently advisory only — per-attack heuristic evaluators do the grading. An LLM-graded judge mode is reserved for future probes (the
--azure-from-envplumbing is already in place). - LLM08 (Vector / Embedding Weaknesses) is on the roadmap for a dedicated retrieval-corpus poisoning probe pack.
- The current attack roster is fixed at nine probes; custom attack injection (per-org policy probes) is not yet supported via CLI.
Tracking backlog (see strategy/FutureFeatures/todo/13-pending-issues-tasks.md):
- T0.2 —
--azure-from-envflag onbench owasp(shipped 2026-05-24). - T0.5 —
report.html+report.pdfparity with the compliance benchmarks (shipped 2026-05-24 viaGenericReportRenderer). - T3.11 — Multi-provider agent-manifest schema (would let
--agent-config <path>resolve non-Azure agents like OpenAI direct, Anthropic, Bedrock, custom subprocess). - A dedicated LLM08 vector-store probe pack remains roadmap.
See also:
- GDPR getting-started — for dialog-based compliance benchmarking.
- EU AI Act getting-started — for AI-Act dialog screening.
- MITRE ATLAS getting-started — sister red-team family; same attack pipeline tagged against ATLAS techniques.
src/AgentEval.RedTeam/RedTeam/Compliance/OwaspBenchmark.cs— preset factory source.src/AgentEval.RedTeam/RedTeam/Reporting/Compliance/OWASPComplianceReporter.cs— compliance-evidence reporter source.