Skip to main content

Open benchmark for AI vulnerability patching — 128 real CVEs, 15 agents, reproducible results.

[CVE-AGENT-BENCH]

Benchmark Explorer

CVE-Agent-Bench tests whether AI coding agents can fix real CVE vulnerabilities in production codebases. 1,920 evaluations. 128 CVEs. 15 agents. Best pass rate: 62.7%. Cheapest fix: $2.64.

How each CVE is verified:

1. Reproduce

Run the trigger against unpatched code. Confirm it crashes.

2. Patch

Apply the agent's git diff inside a Docker container.

3. Build

Compile with the verifier toolchain and memory safety instrumentation.

4. Verify

Re-run the same trigger. If no crash → [PASS]. Still crashes → [FAIL].

Scoring: Pass = +1 (trigger no longer crashes). Fail = 0 (still crashes). Build = -1 (patch doesn't compile). Infra = excluded.

128
CVE samples
1,920
Verified evaluations
15
Agent configurations
62.7%
Best agent pass rate
[CVE-AGENT-BENCH LEADERBOARD]
gpt-5.2
62.7%
cursor-opus-4.6
62.5%
claude-opus-4-6
61.6%
gemini31-gemini-3.1-pro-preview
58.7%
opencode-gemini-3.1-pro-preview
54.9%
cursor-gpt-5.2
51.6%
oc/gpt-5.2
51.6%
cursor-gpt-5.3-codex
50.4%
gpt-5.2-codex
49.2%
oc/claude-opus-4-6
47.5%
claude-opus-4-5
45.7%
cursor-composer-1.5
45.2%
gemini-3-pro-preview
43%
oc/gpt-5.2-codex
37.8%
oc/claude-opus-4-5
36.8%
Current verified dataset: 1,920 evaluations · 128 CVE samples · 15 agents · Target: 6,138+ vulnerabilities

Agent names = harness/model. The same LLM through different coding agents produces different patch quality. For example: claude/opus-4-5 is Claude Opus 4.5 through Anthropic's Claude Code. opencode/claude-opus-4-5 is the same model through the OpenCode harness. Same model, different harness — different results. The harness is what we are measuring, not just the model.

[HOW TO USE THIS DATA]
For RLHF / DPO training

Each evaluation is a labeled example: +1 (pass), 0 (fail), -1 (build-fail). Use difficulty scores for curriculum ordering.

For benchmarking your agent

Run your agent on the same 128 CVEs. Log results to W&B Weave. Compare against 15 baselines.

For pre-training data

1,920 labeled vulnerability-patching examples across 40 production codebases. Patches are surgical — 74% are 10 lines or fewer.

For research

IRT difficulty calibration, cross-agent agreement (kappa), behavioral trajectory clusters, ensemble analysis.

[EVALUATION FACTORY]

Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.

Generate

AI agents generate patches for CVE samples

128 samples

Reproduce

Verify POC and patch correctness

1920 evals

Patch

Test patches against test suites

50.5% pass rate
[AGENT TRAJECTORIES]

Real agent sessions from CVE-Agent-Bench. Watch how different agents approach the same vulnerability.

Speed-runner: arrow #20123

Claude Opus 4.5 fixes a null-check bug in 3 tool calls and 19 seconds. Grep → Read → Edit pattern.

Tools
3
Tokens
8.4k
Duration
19s
[PASS]
5/1
0 / 6
Grep: Search for FieldFromFlatbuffer function to locate the vulnerable code
Found in src/arrow/extension_array_builder.cc at line 156
Read: Read src/arrow/flatbuffer.cc to understand the null-check issue
Files: src/arrow/flatbuffer.cc
Code shows field->name() called without null check. Missing safety guard.
Edit: Add null check before field->name() call to prevent crash
Files: src/arrow/flatbuffer.cc
[PASS] Patch applied. Tests pass. Null-check fix prevents crash.
[PASS] 5 added / 1 removed
[Files]
src/arrow/flatbuffer.cc
[VERIFICATION PIPELINE]
Step 1: Pull Image
2.10s
[verify] Pulling benchmark image
verifier:20123 pulled successfully
Step 2: Write Patch
0.30s
[verify] Writing patch to /tmp/agent-patch.diff
Patch written (4 bytes)
Step 3: Apply Patch
1.60s
[verify] Applying patch to source
git apply /tmp/agent-patch.diff
Patch applied cleanly
Step 4: Build
39.10s
[verify] Building with verifier toolchain
compile with memory safety instrumentation
Linking verification runtime
Build completed successfully
Step 5: Run Trigger
1.67s
[verify] Running trigger against patched binary
verify /tmp/trigger
Reading 2 bytes from trigger input
Compression error. Error code: -6
Execution successful
[PASS]
Vulnerability fixed. trigger no longer crashes.
[VERIFIABLE]

This session conforms to the IETF Verifiable Agent Conversation Record format. The data structure maps to the VAC entry types (tool-call, tool-result, message) and could be wrapped in a COSE_Sign1 envelope for cryptographic non-repudiation.

→ draft-birkholz-verifiable-agent-conversations

[SAMPLE EXPLORER]

136 CVE samples × 15 agents. Each cell is one evaluation. Sorted by difficulty (easiest top) and pass rate (best left).

Sample (136)
GPT5.2
CsrC4.6
C4.6
Gem3.1
OC-Gem3.1
CsrGPT
OC-GPT5.2
Csr5.3
GPT5.2C
OC-C4.6
C4.5
Csr1.5
Gem3
OC-GPT5.2C
OC-C4.5
text-shaping/text-shaping #10899
text-shaping/text-shaping #11001
git-library/git-library #11167
network-switch/network-switch #10796
packet-analyzer/packet-analyzer #1237
image-processor/image-processor #11429
text-shaping/text-shaping #11033
text-shaping/text-shaping #11081
text-shaping/text-shaping #11263
text-shaping/text-shaping #11290
archive-library/archive-library #11196
git-library/git-library #10999
git-library/git-library #11382
mesh-networking/mesh-networking #11376
mesh-networking/mesh-networking #14821
network-switch/network-switch #10710
crypto-library/crypto-library #10628
packet-analyzer/packet-analyzer #1236
text-shaping/text-shaping #11522
mesh-networking/mesh-networking #12589
data-framework/data-framework #24101
image-processor/image-processor #11078
spell-checker/spell-checker #16531
rpc-framework/rpc-framework #7188
text-shaping/text-shaping #10948
text-shaping/text-shaping #11305
archive-library/archive-library #13435
archive-library/archive-library #15431
git-library/git-library #11173
sip-server/sip-server #53080
linux-utils/linux-utils #53149
file-identifier/file-identifier #13222
archive-library/archive-library #12817
sip-server/sip-server #52204
data-framework/data-framework #20116
text-shaping/text-shaping #11351
text-shaping/text-shaping #11367
archive-library/archive-library #38751
opcua-library/opcua-library #11484
network-switch/network-switch #11408
data-framework/data-framework #57209
text-shaping/text-shaping #11060
text-shaping/text-shaping #12241
archive-library/archive-library #11011
git-library/git-library #11004
sip-server/sip-server #53397
embedded-server/embedded-server #53038
unicode-codec/unicode-codec #66063
pattern-matcher/pattern-matcher #12424
data-framework/data-framework #28750
text-shaping/text-shaping #12312
image-formats/image-formats #12818
archive-library/archive-library #14574
archive-library/archive-library #20459
network-switch/network-switch #12255
packet-analyzer/packet-analyzer #10162
data-compressor/data-compressor #50433
network-switch/network-switch #11160
embedded-server/embedded-server #53029
pgp-library/pgp-library #25386
data-framework/data-framework #20123
service-proxy/service-proxy #22137
json-parser/json-parser #18140
data-framework/data-framework #20113
data-compressor/data-compressor #24837
embedded-server/embedded-server #28474
metadata-library/metadata-library #45993
fs-utilities/fs-utilities #49679
data-compressor/data-compressor #30193
data-framework/data-framework #37888
text-shaping/text-shaping #10724
image-converter/image-converter #12193
image-codec/image-codec #42839
image-codec/image-codec #49277
image-pipeline/image-pipeline #26855
data-compressor/data-compressor #30253
image-codec/image-codec #35293
opcua-library/opcua-library #10676
geo-library/geo-library #10637
pgp-library/pgp-library #25388
network-switch/network-switch #10731
archive-library/archive-library #19509
git-library/git-library #11007
opcua-library/opcua-library #11435
python-runtime/python-runtime #58295
service-proxy/service-proxy #22080
service-proxy/service-proxy #25207
image-formats/image-formats #13016
mesh-networking/mesh-networking #12631
text-shaping/text-shaping #10081
archive-library/archive-library #15120
opcua-library/opcua-library #10604
chem-toolkit/chem-toolkit #36609
data-compressor/data-compressor #30761
crypto-node/crypto-node #34657
rpc-framework/rpc-framework #1847
rpc-framework/rpc-framework #47834
archive-library/archive-library #12466
unicode-codec/unicode-codec #57632
analytics-db/analytics-db #60890
image-converter/image-converter #13180
image-codec/image-codec #40396
chem-toolkit/chem-toolkit #42769
stat-reader/stat-reader #12662
disassembly-engine/disassembly-engine #12953
disassembly-engine/disassembly-engine #12957
disassembly-engine/disassembly-engine #12988
disassembly-engine/disassembly-engine #58789
disassembly-engine/disassembly-engine #8877
js-engine/js-engine #65386
js-engine/js-engine #65393
data-compressor/data-compressor #29287
service-proxy/service-proxy #26685
service-proxy/service-proxy #26834
service-proxy/service-proxy #28869
service-proxy/service-proxy #30618
service-proxy/service-proxy #32878
service-proxy/service-proxy #44850
file-identifier/file-identifier #1065
fuzz-engine/fuzz-engine #51072
mesh-compressor/mesh-compressor #37705
serial-library/serial-library #38778
text-shaping/text-shaping #10097
text-shaping/text-shaping #10953
text-shaping/text-shaping #12292
archive-library/archive-library #15278
cad-library/cad-library #54380
mesh-networking/mesh-networking #12536
geo-library/geo-library #11016
binary-analyzer/binary-analyzer #10222
binary-analyzer/binary-analyzer #11359
pgp-library/pgp-library #24528
pgp-library/pgp-library #24538
pgp-library/pgp-library #25292
i18n-library/i18n-library #65873
cpu-emulator/cpu-emulator #36552
Pass — vulnerability fixed
Fail — patch applied, tests still crash
Build — patch does not compile
Infra — environment failure (excluded)
Difficulty:easymediumhardfloorceiling

[W&B INTEGRATION]

Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench

Import your evaluation results into W&B for centralized tracking.

# Pseudocode — implement these functions for your agent

import wandb

def upload_to_wandb(results):
    with wandb.init(
        project="cve-bench",
        entity="tobias_xor-xor",
        job_type="evaluation"
    ):
        wandb.log({
            "pass_rate": results.pass_rate,
            "total_evals": results.total,
            "cost_usd": results.cost
        })

upload_to_wandb(evaluation_results)

[DATA ACCESS]

Dataset access is gated. Request access and receive download link within 24 hours.

Request Access

Get download link for full CVE-Agent-Bench dataset with evaluation metadata.

Request Dataset

Dataset Schema

FieldType
sample_idstring
agent_modelstring
outcomepass|fail|build|infra
time_secondsnumber
cost_usdnumber

RLHF Reward Signal

Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.

Run your agent against 128 CVEs

Download the dataset, log results to W&B Weave, and compare against 15 baselines. The current best hits 62.7%.