Open benchmark for AI vulnerability patching — 128 real CVEs, 15 agents, reproducible results.
Benchmark Explorer
CVE-Agent-Bench tests whether AI coding agents can fix real CVE vulnerabilities in production codebases. 1,920 evaluations. 128 CVEs. 15 agents. Best pass rate: 62.7%. Cheapest fix: $2.64.
How each CVE is verified:
Run the trigger against unpatched code. Confirm it crashes.
Apply the agent's git diff inside a Docker container.
Compile with the verifier toolchain and memory safety instrumentation.
Re-run the same trigger. If no crash → [PASS]. Still crashes → [FAIL].
Scoring: Pass = +1 (trigger no longer crashes). Fail = 0 (still crashes). Build = -1 (patch doesn't compile). Infra = excluded.
Agent names = harness/model. The same LLM through different coding agents produces different patch quality. For example: claude/opus-4-5 is Claude Opus 4.5 through Anthropic's Claude Code. opencode/claude-opus-4-5 is the same model through the OpenCode harness. Same model, different harness — different results. The harness is what we are measuring, not just the model.
Each evaluation is a labeled example: +1 (pass), 0 (fail), -1 (build-fail). Use difficulty scores for curriculum ordering.
Run your agent on the same 128 CVEs. Log results to W&B Weave. Compare against 15 baselines.
1,920 labeled vulnerability-patching examples across 40 production codebases. Patches are surgical — 74% are 10 lines or fewer.
IRT difficulty calibration, cross-agent agreement (kappa), behavioral trajectory clusters, ensemble analysis.
[EVALUATION FACTORY]
Three-stage pipeline: generate patches, reproduce vulnerabilities, verify correctness.
Generate
AI agents generate patches for CVE samples
Reproduce
Verify POC and patch correctness
Patch
Test patches against test suites
Real agent sessions from CVE-Agent-Bench. Watch how different agents approach the same vulnerability.
Speed-runner: arrow #20123
Claude Opus 4.5 fixes a null-check bug in 3 tool calls and 19 seconds. Grep → Read → Edit pattern.
This session conforms to the IETF Verifiable Agent Conversation Record format. The data structure maps to the VAC entry types (tool-call, tool-result, message) and could be wrapped in a COSE_Sign1 envelope for cryptographic non-repudiation.
→ draft-birkholz-verifiable-agent-conversations[SAMPLE EXPLORER]
136 CVE samples × 15 agents. Each cell is one evaluation. Sorted by difficulty (easiest top) and pass rate (best left).
[W&B INTEGRATION]
Track your agent evaluations on Weights and Biases. View live results at wandb.ai/tobias_xor-xor/cve-bench
Import your evaluation results into W&B for centralized tracking.
# Pseudocode — implement these functions for your agent
import wandb
def upload_to_wandb(results):
with wandb.init(
project="cve-bench",
entity="tobias_xor-xor",
job_type="evaluation"
):
wandb.log({
"pass_rate": results.pass_rate,
"total_evals": results.total,
"cost_usd": results.cost
})
upload_to_wandb(evaluation_results)Run your agent against the benchmark and log results to W&B.
# Pseudocode — implement these functions for your agent
def evaluate_agent(agent, samples):
results = {
"pass": 0,
"fail": 0,
"build": 0,
"infra": 0
}
for sample in samples:
outcome = run_agent(agent, sample)
results[outcome] += 1
return results
# Log to W&B with detailed metrics
results = evaluate_agent(my_agent, cve_samples)
wandb.log(results)Expected schema for evaluation results.
{
"agent_model": "string",
"sample_id": "string",
"outcome": "pass" | "fail" | "build" | "infra",
"time_seconds": number,
"cost_usd": number,
"tokens_in": number,
"tokens_out": number
}[DATA ACCESS]
Dataset access is gated. Request access and receive download link within 24 hours.
Request Access
Get download link for full CVE-Agent-Bench dataset with evaluation metadata.
Request DatasetDataset Schema
| Field | Type |
|---|---|
| sample_id | string |
| agent_model | string |
| outcome | pass|fail|build|infra |
| time_seconds | number |
| cost_usd | number |
RLHF Reward Signal
Reward model weights: pass=+1, fail=-0.5, build=-0.75, infra=0 (excluded). Use for training agent policies.
Cost per fix vs. pass rate. Pareto frontier with 95% confidence intervals. Oracle set cover: the minimum set of agents needed to fix the maximum number of samples.
Cost data: 2 of 15 agents have measured token costs (Claude native). All others use turn-count heuristic estimates.
Read full cost analysis →Difficulty scored from observed pass rates across all agents (raw empirical measurement, not theoretical fitting). Samples categorized by patch type and source project.
DPO preference pairs for training. Gold = pass vs build-fail (strongest signal). Silver = pass vs test-fail. Bronze = test-fail vs build-fail. Ternary reward signal (+1/0/-1) and five-level distributions.
Reward Configuration
Base Reward
1
Difficulty Bonus
+0.5
Teamwork Bonus
+0.25
Exploration Bonus
+0.1
Bonuses applied when agents solve difficult samples, contribute unique solutions, or explore novel reasoning paths.
128 samples, +/-8.7pp 95% confidence intervals. Cohen's kappa for cross-agent agreement. The leading agents may be statistically indistinguishable.
Read full methodology →Run your agent against 128 CVEs
Download the dataset, log results to W&B Weave, and compare against 15 baselines. The current best hits 62.7%.