Patch verification
XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.
Every fix tested against the vulnerability
OutcomeConfirm the agent's patch resolves the CVE before it ships.
MechanismXOR writes a verifier for the specific CVE, applies the agent's patch in an isolated environment, and re-runs the verifier. Pass or fail.
Proof1,736 evaluations with pass/fail evidence. Infrastructure failures count.
Verify the fix, not just the code
Finding bugs is solved. Confirming the fix works is not. XOR writes a verifier for each CVE, applies the agent's patch, and re-runs the verifier. Only confirmed fixes ship. Failures become learning signal for the next run.
How it works
- Agent tools and permissions are checked before the run
- Patch is applied in an isolated vulnerable environment
- The verifier is re-run to confirm the CVE is resolved
- A pass/fail report is attached to the PR
- Results feed back into the agent harness for continuous learning
How XOR checks every agent PR
The XOR GitHub App writes a verifier for each bug, applies the agent's fix, and runs safety checks on every PR. Each run produces a report with a pass or fail verdict.
No report, no merge.
How verification works
When a coding agent generates a fix for a bug, three things happen:
The fix is applied to an isolated container running the vulnerable code. The container is an exact reproduction of the original environment — same compiler flags, same dependencies, same OS.
A verifier runs against the fixed code. This is the test that checks whether the vulnerability still triggers. If it no longer triggers, the fix works.
Safety checks run automatically. Memory safety tools detect buffer overflows, use-after-free, and other issues the fix may have introduced or missed.
If the bug no longer triggers AND safety checks pass, the fix is verified. Everything else is a failure.
What counts as verified
- Bug no longer triggers after the fix
- No new memory safety issues introduced
- Build passes in the reproduced environment
What fails
- Bug still triggers after the fix
- Safety checks find new issues
- Build or infrastructure failures during testing
What pass and fail look like
PASS — harfbuzz/harfbuzz#11033
$ docker run --rm xor-verify harfbuzz-11033
applying patch... 23 lines changed
building with safety checks...
running verifier... clean exit
exit 0 — fix verified ✓
FAIL — libarchive/libarchive#12466
$ docker run --rm xor-verify libarchive-12466
applying patch... 18 lines changed
building with safety checks...
ERROR: memory safety issue detected
exit 1 — bug still present ✗
BUILD FAIL — envoy/envoy#28190
$ docker run --rm xor-verify envoy-28190
applying patch... 41 lines changed
ERROR: compilation failed — missing include
exit 2 — patch does not compile ✗
Four possible outcomes
[PASS]
Bug no longer triggers. Safety checks pass. The fix works.
[FAIL]
Bug still triggers after the fix. The agent's code change didn't resolve the issue.
[BUILD]
Code doesn't compile. Missing files, syntax errors, type mismatches.
[INFRA]
Container timeout, sandbox error, network failure. Excluded from agent scoring.
What gets rejected
- Fixes with no bug reproduction
- Runs with missing or unsigned audit logs
- Build failures or infrastructure errors during testing
- Agent tools that fail security checks
Infrastructure failures are excluded
Infrastructure failures (timeouts, network errors, CI flakes) are logged for debugging but excluded from agent pass-rate calculations. This prevents environment instability from penalizing otherwise functional agents.
CI integration
Verification runs as a GitHub Check. Install the
XOR GitHub App
. Every PR from a coding agent gets a pass/fail result with a link to the full test report.
[NEXT STEPS]
See verification results
FAQ
How does agent verification work?
Agents are wrapped in observation harnesses. When an agent writes a fix for a CVE, XOR runs the fix against the original vulnerability. If the test passes, the fix is verified. Results are logged and attached to the PR.
What if the agent fix causes a regression?
Regressions are caught in the verification harness. The agent can see the regression and try again. Failed runs are primary learning signals that feed back into the agent training pipeline.
Which agents are compatible?
Any agent that writes code: Claude Code, Codex, Gemini CLI, Cursor, or custom agents with code generation. No lock-in. The GitHub App monitors the code change and runs verification automatically.
Automated vulnerability patching
AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,736 evaluations.
Benchmark Results
62.7% pass rate. $2.64 per fix. Real data from 1,736 evaluations.
Agent Cost Economics
Fix vulnerabilities for $2.64–$87 with agents. 100x cheaper than incident response. Real cost data.
Agent Configurations
13 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.
Benchmark Methodology
How CVE-Agent-Bench evaluates 13 coding agents on 136 real vulnerabilities. Deterministic, reproducible, open methodology.
Agent Environment Security
AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.
Security Economics for Agentic Patching
Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.
Automated Vulnerability Patching and PR Review
Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.
Continuous Learning from Verified Agent Runs
A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.
Signed Compliance Evidence for AI Agents
A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.
Compliance Evidence and Standards Alignment
How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.
See which agents produce fixes that work
136 CVEs. 13 agents. 1,736 evaluations. Agents learn from every run.