[EVALUATION]

Patch verification

XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.

See how verification works →

[PATCH + VERIFY]

Every fix tested against the vulnerability

OutcomeConfirm the agent's patch resolves the CVE before it ships.

MechanismXOR writes a verifier for the specific CVE, applies the agent's patch in an isolated environment, and re-runs the verifier. Pass or fail.

Proof1,736 evaluations with pass/fail evidence. Infrastructure failures count.

See how verification works →

Verify the fix, not just the code

Finding bugs is solved. Confirming the fix works is not. XOR writes a verifier for each CVE, applies the agent's patch, and re-runs the verifier. Only confirmed fixes ship. Failures become learning signal for the next run.

How it works

Agent tools and permissions are checked before the run
Patch is applied in an isolated vulnerable environment
The verifier is re-run to confirm the CVE is resolved
A pass/fail report is attached to the PR
Results feed back into the agent harness for continuous learning

811

Fixes passed

370

Fixes failed

455

Build failures

100

Infrastructure failures

How XOR checks every agent PR

The XOR GitHub App writes a verifier for each bug, applies the agent's fix, and runs safety checks on every PR. Each run produces a report with a pass or fail verdict.

No report, no merge.

How verification works

When a coding agent generates a fix for a bug, three things happen:

The fix is applied to an isolated container running the vulnerable code. The container is an exact reproduction of the original environment — same compiler flags, same dependencies, same OS.
A verifier runs against the fixed code. This is the test that checks whether the vulnerability still triggers. If it no longer triggers, the fix works.
Safety checks run automatically. Memory safety tools detect buffer overflows, use-after-free, and other issues the fix may have introduced or missed.

If the bug no longer triggers AND safety checks pass, the fix is verified. Everything else is a failure.

What counts as verified

Bug no longer triggers after the fix
No new memory safety issues introduced
Build passes in the reproduced environment

What fails

Bug still triggers after the fix
Safety checks find new issues
Build or infrastructure failures during testing

What pass and fail look like

PASS — harfbuzz/harfbuzz#11033

$ docker run --rm xor-verify harfbuzz-11033

applying patch... 23 lines changed

building with safety checks...

running verifier... clean exit

exit 0 — fix verified ✓

FAIL — libarchive/libarchive#12466

$ docker run --rm xor-verify libarchive-12466

applying patch... 18 lines changed

building with safety checks...

ERROR: memory safety issue detected

exit 1 — bug still present ✗

BUILD FAIL — envoy/envoy#28190

$ docker run --rm xor-verify envoy-28190

applying patch... 41 lines changed

ERROR: compilation failed — missing include

exit 2 — patch does not compile ✗

Four possible outcomes

[PASS]

Bug no longer triggers. Safety checks pass. The fix works.

[FAIL]

Bug still triggers after the fix. The agent's code change didn't resolve the issue.

[BUILD]

Code doesn't compile. Missing files, syntax errors, type mismatches.

[INFRA]

Container timeout, sandbox error, network failure. Excluded from agent scoring.

What gets rejected

Fixes with no bug reproduction
Runs with missing or unsigned audit logs
Build failures or infrastructure errors during testing
Agent tools that fail security checks

Infrastructure failures are excluded

Infrastructure failures (timeouts, network errors, CI flakes) are logged for debugging but excluded from agent pass-rate calculations. This prevents environment instability from penalizing otherwise functional agents.

CI integration

Verification runs as a GitHub Check. Install the

XOR GitHub App

. Every PR from a coding agent gets a pass/fail result with a link to the full test report.

[NEXT STEPS]

See verification results

Agent leaderboard →

Install GitHub App →

Agent safety →

FAQ

How does agent verification work?

Agents are wrapped in observation harnesses. When an agent writes a fix for a CVE, XOR runs the fix against the original vulnerability. If the test passes, the fix is verified. Results are logged and attached to the PR.

What if the agent fix causes a regression?

Regressions are caught in the verification harness. The agent can see the regression and try again. Failed runs are primary learning signals that feed back into the agent training pipeline.

Which agents are compatible?

Any agent that writes code: Claude Code, Codex, Gemini CLI, Cursor, or custom agents with code generation. No lock-in. The GitHub App monitors the code change and runs verification automatically.

[RELATED TOPICS]

Automated vulnerability patching

AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,736 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,736 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$87 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

13 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.

Benchmark Methodology

How CVE-Agent-Bench evaluates 13 coding agents on 136 real vulnerabilities. Deterministic, reproducible, open methodology.

Agent Environment Security

AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.

Security Economics for Agentic Patching

Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.

Automated Vulnerability Patching and PR Review

Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.

Continuous Learning from Verified Agent Runs

A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.

Signed Compliance Evidence for AI Agents

A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.

Compliance Evidence and Standards Alignment

How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.

See which agents produce fixes that work

136 CVEs. 13 agents. 1,736 evaluations. Agents learn from every run.