[CVE-AGENT-BENCH]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

See benchmark results →

[BENCHMARKS]

15 agents benchmarked on 128 real vulnerabilities

OutcomePick the right agent before you deploy. See which ones produce fixes that pass.

Mechanism1,920 evaluations. Pass rates, cost per fix, and difficulty scores for every agent.

ProofBest pass rate: 62.7%. Cheapest verified fix: $2.64.

See benchmark results →

Agent Rankings

Pass rate (primary metric)
Cost per verified fix
Difficulty-weighted performance
Trend over time

Cost Economics

Fix a CVE with an agent for $2.64–$52. Incident response costs thousands. Scale pre-production agent fixes instead.

Difficulty Scoring

Vulnerabilities range from trivial (syntax errors) to hard (architectural refactors). Difficulty score helps you understand agent capability on different threat classes.

128

CVE samples

1,920

Verified evaluations

Agent configurations

62.7%

Best agent pass rate

Current test dataset

128 real bugs tested, 1,920 test runs across 15 agent configurations. Growing to 6,138+ vulnerabilities across 250+ production codebases.

CVE-Agent-Bench evaluates how well AI agents can generate security patches for real vulnerability samples. This is not a toy benchmark -- the bugs come from a curated dataset of real vulnerabilities collected from production codebases that teams maintain. The agents run in isolated containers and patches are verified with automated tests.

How it works

Reproduce each bug with a known way to trigger it and a known-good fix.
Run each agent in an isolated environment.
Apply the agent's fix and check if the bug is gone.
Record pass/fail results and categorize failures.
Adjust scores for bug difficulty so results are fair.

The process is deterministic. Every agent receives the same environment, the same inputs, and the same constraints. We measure what agents actually do, not what they claim to do. If an agent fails to generate a fix, or generates a broken patch, or times out, all of those count as failures.

What's in the report

Agent leaderboard by pass rate and cost
Failure categories and why fixes fail
Guide for choosing the right agent and model
Fix examples and test results

The report goes beyond rankings. We document every failure mode - agents that produce patches that compile but do not fix the bug, agents that refuse to patch, infrastructure failures. We analyze patch semantics to understand different fixing approaches. We map which agents agree and which ones disagree on the same bugs.

Why this matters

Engineering leaders need proof before scaling AI fixes to hundreds of developers. Security leaders need audit-ready evidence. XOR delivers independent, tested results that both teams can trust.

Most agent benchmarks measure generic coding tasks. This one measures security patching specifically. The skills are different. A high-pass-rate general coding agent might fail on security context. We test what matters for your pipeline.

How to use this data

For RLHF / DPO training

Each evaluation is a labeled example (+1 pass, 0 fail, -1 build-fail). Use difficulty scores for curriculum ordering.

For benchmarking your agent

Run your agent on the same 128 CVEs. Log results to W&B Weave. Compare against 15 baselines.

For pre-training data

1,920 labeled vulnerability-patching examples across 40 production codebases. Patches are surgical — 74% are 10 lines or fewer.

For research

Empirical difficulty calibration, cross-agent agreement (kappa), behavioral trajectory clusters, ensemble analysis.

Browse the data

[INTERACTIVE]

Benchmark Explorer

Charts, heatmaps, agent trajectories, and W&B integration.

Results & rankings

Economics overview

Agent profiles

Bug complexity

Agent strategies

Execution metrics

Cost analysis

Validation process

Pricing transparency

DPO Training Pairs

Dataset Schema

W&B Weave Integration

Full benchmark report

Enter your email below to access agent configurations, patch examples, failure analysis, and full methodology.

Agent Configurations

128 real bugs tested across 15 agent configurations. Growing to 6,138+ vulnerabilities. Each agent runs in an isolated container with automated safety checks. See how verification works .


codex	gpt-5.2	62.7%
cursor	opus-4.6	62.5%
claude	claude-opus-4-6	61.6%
gemini31	gemini-3.1-pro-preview	58.7%
opencode	gemini-gemini-3.1-pro-preview	54.9%
cursor	gpt-5.2	51.6%
opencode	gpt-5.2	51.6%
cursor	gpt-5.3-codex	50.4%
codex	gpt-5.2-codex	49.2%
opencode	claude-opus-4-6	47.5%
claude	claude-opus-4-5	45.7%
cursor	composer-1.5	45.2%
gemini	gemini-3-pro-preview	43%
opencode	gpt-5.2-codex	37.8%
opencode	claude-opus-4-5	36.8%

[SAMPLE VIEW]

Agent: opencode-o3  │  CVE-2024-XXXXX  │  OpenSSL
──────────────────────────────────────────────
Step 1: Clone repository            [2.1s]
Step 2: Reproduce vulnerability     [4.7s]  ← test case triggers crash
Step 3: Analyze root cause          [8.3s]  ← bounds check missing
Step 4: Generate patch              [3.2s]  ← adds size validation
Step 5: Verify fix (safety check)   [5.1s]  ← test case no longer crashes
──────────────────────────────────────────────
Result: PASS  │  Time: 23.4s  │  IRT Score: 0.73

Sample Distribution

128 evaluated / 6,138+ dataset

Current evaluation: 128 healthchecked samples across 27 codebases. Full dataset: 6,138+ vulnerabilities across 250+ codebases.

text-shaping/engineC++

archive-library/handlerC

git-library/coreC

image-processor/raw-decoderC++

industrial-protocol/opc-uaC

network-switch/ovsC

data-processing/arrowC++

js-engine/runtimeC

cryptocurrency/nodeC++

data-compressor/c-codecC

disassembler/engineC

embedded-server/networkingC

analytics-db/engineC++

coverage-tool/engineC++

3d-codec/decoderC++

serialization/buffersC++

rpc-framework/rpcC++

image-codec/jxlC++

sip-server/proxyC

mesh-networking/threadC++

language-runtime/cpythonC

reverse-engineering/frameworkC

unicode-processing/simdC++

unicode-support/icuC++

system-utilities/coreC

malware-detection/rulesC

statistics/readerC

Patch Examples

Real vulnerability fixes from CVE-Agent-Bench samples. Each patch is CI-verified.

[file]

[VULNERABILITY]

Uninitialized memory read in regex match buffer. The `pmatch` array was allocated but not zeroed, causing memory safety checks to flag undefined behavior on partial match paths.

[FIX]

Added `memset(pmatch, 0, sizeof(regmatch_t) * nmatch)` before the regex match call to initialize the buffer.

[VERIFY]

Safety check passes: no uninitialized memory access. Regression tests unchanged.

[packet analyzer]

[VULNERABILITY]

Out-of-bounds read in DOF protocol dissector (`packet-dof.c`). Insufficient bounds checking on packet length allowed reading past buffer end.

[FIX]

Added bounds check before accessing packet data to verify remaining buffer length covers the expected field size.

[VERIFY]

Safety check passes: no out-of-bounds read. Existing dissector tests pass.

[text shaping]

[VULNERABILITY]

Buffer overflow in OpenType layout table processing. Font shaping with malformed GPOS/GSUB tables triggered writes past allocated buffer.

[FIX]

Added length validation on subtable offsets before processing, rejecting malformed tables early.

[VERIFY]

Safety check passes: no buffer overflow. Text shaping test suite passes.

Failure Taxonomy

10 layers

[AGENT]Agent capability issues

Infrastructure failures - Empty patches, missing files, agent produced no output.

Agent behavior failures - Code reformatting, wrong file location, partial patch.

Vulnerability understanding failures - Agent misunderstood root cause, fixed wrong issue.

[BUILD]Build and verification issues

Build environment failures - Syntax errors, missing includes, incompatible types.

Verification failures - Build succeeds but safety check still fires, crash not fixed.

[INFRA]Infrastructure and timeout issues

Trajectory errors - Billing failures, rate limits, authentication errors during agent run.

Timeout subcategories - Context window exhaustion, reasoning loops, agent exceeded time limit.

[SYSTEM]Systemic and composite patterns

Convergent failure patterns - All agents produce empty patch, all agents fail same sample.

Cloud Run job status - Job scheduling failures, resource limits, container crashes.

L10

Composite diagnostic scoring - Aggregate across layers to classify overall failure mode.

[WHERE PATCHES FAIL]

1,920total attempts

978 failed (50.9%)·942 passed (49.1%)

L1 Infrastructure failures

213

L2 Agent behavior failures

178

L3 Vulnerability understanding failures

237

L4 Build environment failures

142

L5 Verification failures

113

L6 Trajectory errors

L7 Timeout subcategories

L8 Convergent failure patterns

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

FAQ

Which agent has the highest pass rate?

Codex GPT-5.2 at 62.7% on the CVE benchmark dataset. See the full rankings with cost breakdowns.

How much does it cost to fix a vulnerability?

Costs range from $2.64 to $52 per verified fix, depending on agent and model. Pre-production fixing via agents is 100x cheaper than incident response.

Are these costs real or estimates?

Real. Calculated from 1,920 verified fixes at actual API costs (no rounding, no statistical assumptions).

Do pass rates change?

Yes. As new models ship, benchmarks update. We re-run tests regularly so rankings stay current. Data updated as of today.

[RELATED TOPICS]

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations benchmarked on real vulnerabilities. Compare pass rates and costs.

Benchmark Methodology

How XOR benchmarks AI coding agents on real security vulnerabilities. Reproducible, deterministic, and transparent.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

Cost Analysis

10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.