[COUNCIL]

Council Deliberations

25 strategic questions answered by multi-model deliberation. Consensus scores and conclusions.

Deliberation process

Each question is analyzed independently by multiple models. Responses are compared for agreement, and a final conclusion is synthesized with a confidence rating.

Using the results

High-consensus answers inform benchmark design decisions. Low-consensus questions identify areas where the benchmark needs more data or where reasonable people disagree.

Deliberations completed

Independent personas

High confidence

Medium confidence

Three analysts argued about every finding

Each of the 25 research questions went through a structured deliberation process. Three independent personas (a mathematician, a creative analyst, and a skeptic) each evaluated the evidence and wrote their opinion. A separate review panel ranked the opinions, then a synthesis produced the final conclusion with a confidence rating.

14 deliberations produced usable conclusions. 11 failed due to upstream infrastructure errors during the council generation run and will be re-run in the next evaluation cycle.

[KEY INSIGHT]

1 high-confidence conclusions

from 14 completed deliberations. High-confidence conclusions had three personas in agreement with corroborating evidence chains.

All deliberations

Questions, confidence levels, and synthesized conclusions. Each row represents a full three-persona deliberation cycle.

#	Question	Confidence	Conclusion
1	Are the per-agent rankings stable across different random samples of CVEs, or would different sample selection produce completely different orderings?	[MEDIUM]	Underdetermined from the provided evidence: we cannot claim rankings are stable across different random CVE samples, and different sample selection could plausibly change at least the close-call order...
2	Does the OpenCode wrapper add genuine agent value, or is it primarily passing through to the same underlying model with overhead?	[HIGH]	Based on the provided benchmark evidence, OpenCode is primarily passing through to the same underlying models while adding substantial overhead (higher cost, more build failures) and yielding worse pa...
3	If training contamination IS present in the benchmark, which specific CVE patterns would show the strongest signals?	[LOW]	You cannot identify which specific CVE patterns show the strongest contamination signals from the provided aggregates; to do so you need a per-CVE table (CVE→CWE/vendor/language/etc.) joined to per-...
4	Assuming the per-sample difficulty ratings (floor/ceiling/hard/medium/easy) are accurate, what agent characteristics predict success on hard vs easy CVEs?	[MEDIUM]	With the current aggregates, we cannot identify which agent characteristics predict success on hard vs easy CVEs; we can only say agent-model identity predicts overall success, and hypothesize that mo...
5	If cost-effectiveness is the primary optimization metric (not just accuracy), which agent configuration offers the best bang-for-buck?	[MEDIUM]	Choose gemini-gemini-3-pro-preview for best bang-for-buck (lowest Cost/Pass $3.52, Efficiency Rank 1). If you need ≥~60% pass rate while staying cost-efficient, pick codex-gpt-5.2 (**$...
6	If behavioral patterns extracted from RLM data correlate with outcomes, which patterns are most predictive of success?	[LOW]	Underdetermined from the current evidence; no single behavioral pattern is demonstrably “most predictive of success.” The only defensible takeaway is that interaction intensity (turns/tool calls) is n...
7	Can the benchmark methodology be extended to measure patch quality beyond binary pass/fail (e.g., minimal changes, semantic preservation)?	[MEDIUM]	Yes, the methodology can be extended to score patch quality beyond pass/fail for metrics like minimality and stability using the stored patch/session artifacts, but robust semantic-preservation measur...
8	How would results change if agents were given access to the project's existing test suite during patch generation (not just for evaluation)?	[LOW]	Results would likely improve primarily by reducing “Test Fail” (and increasing passes) for cases where tests can be executed during generation, but the magnitude—and any effect on build/infra failur...
9	Could the RLM behavioral analysis be used to create an early-stopping criterion that saves compute on doomed attempts?	[MEDIUM]	Yes in principle, but it’s underdetermined from the current evidence whether RLM behavioral analysis will yield a reliable early-stopping criterion that saves compute without harming pass rate; yo...
10	What would CVE-Bench v2 look like if it controlled for training contamination and infrastructure variance using double-blind evaluation?	[MEDIUM]	CVE-Bench v2 would look like a double-blind, sealed-holdout + standardized-sandbox benchmark that anonymizes model/sample identities during runs and scoring, and that separates “can it fix the CVE?” f...
11	At what infrastructure failure rate does retry cost dominate agent selection decisions?	[LOW]	From this report alone, the dominance threshold is underdetermined due to missing retry and cost-accounting details; under a standard “retry-until-non-infra” assumption, it would be on the order o...
12	Build a cost-performance Pareto frontier for production agent selection — which agents are dominated and should never be deployed?	[MEDIUM]	On the (Cost/Eval, Pass Rate) Pareto frontier, keep gemini-gemini-3-pro-preview and codex-gpt-5.2; never deploy claude-claude-opus-4-5, claude-claude-opus-4-6, codex-gpt-5.2-codex,...
13	Design a sequential waterfall dispatch protocol (cheapest first, escalate on failure) and model its expected cost per fix	[MEDIUM]	Dispatch gemini-3-pro-preview → codex-gpt-5.2 → claude-opus-4-6, escalating only on non-pass and stopping on first pass; under an independence approximation the expected spend is **≈ $3.96 per sam...
14	Model the break-even point where automated agentic patching becomes cheaper than manual developer fixes (assume $150/hr developer cost)	[MEDIUM]	Automated agentic patching is cheaper than manual fixes at $150/hr whenever your average manual time per successful fix exceeds $t^=0.4\cdot\text{Cost/Pass}$ minutes—e.g., 1.41 min* (gemini $3.5...

Unlock full results

Enter your email to access the full methodology, per-sample analysis, and patch examples.

[NEXT STEPS]

Read what the deliberations produced

The economics questions led to 10 actionable findings. The validation questions stress-tested every claim in the benchmark.

Cost analysis findings →

Validation questions →

Explore more

Results & leaderboard
: the pass rates these deliberations validated
Methodology
: scoring rules and exclusion criteria under scrutiny

FAQ

What is the benchmark council?

A structured deliberation process where multiple AI models independently analyze the same question, then a synthesis step identifies consensus, disagreements, and confidence levels.

How is consensus measured?

Each question receives a consensus score (0-1) based on agreement across models. Scores above 0.7 indicate strong agreement. Below 0.4 signals open questions that need more evidence.

[RELATED TOPICS]

Patch verification

XOR writes a verifier for each vulnerability, then tests agent-generated patches against it. If the fix passes, it ships. If not, the failure feeds back into the agent harness.

Automated vulnerability patching

AI agents generate fixes for known CVEs. XOR verifies each fix and feeds outcomes back into the agent harness so future patches improve.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Benchmark Results

62.7% pass rate. $2.64 per fix. Real data from 1,920 evaluations.

Agent Cost Economics

Fix vulnerabilities for $2.64–$52 with agents. 100x cheaper than incident response. Real cost data.

Agent Configurations

15 agent-model configurations evaluated on real CVEs. Compare Claude Code, Codex, Gemini CLI, Cursor, and OpenCode.

Benchmark Methodology

How CVE-Agent-Bench evaluates 15 coding agents on 128 real vulnerabilities. Deterministic, reproducible, open methodology.

Agent Environment Security

AI agents run with real permissions. XOR verifies tool configurations, sandbox boundaries, and credential exposure.

Security Economics for Agentic Patching

Security economics for agentic patching. ROI models backed by verified pass/fail data and business-impact triage.

Validation Process

25 questions we ran against our own data before publishing. Challenges assumptions, explores implications, extends findings.

Cost Analysis

10 findings on what AI patching costs and whether it is worth buying. 1,920 evaluations analyzed.

Bug Complexity

128 vulnerabilities scored by difficulty. Floor = every agent fixes it. Ceiling = no agent can.

Agent Strategies

How different agents approach the same bug. Strategy matters as much as model capability.

Execution Metrics

Per-agent session data: turns, tool calls, tokens, and timing. See what happens inside an agent run.

Pricing Transparency

Every cost number has a source. Published pricing models, measurement methods, and provider rates.

Automated Vulnerability Patching and PR Review

Automated code review, fix generation, GitHub Actions hardening, safety checks, and learning feedback. One-click install on any GitHub repository.

Getting Started with XOR GitHub App

Install in 2 minutes. First result in 15. One-click GitHub App install, first auto-review walkthrough, and engineering KPI triad.

Platform Capabilities

One install. Seven capabilities. Prompt-driven. CVE autopatch, PR review, CI hardening, guardrail review, audit packets, and more.

Dependabot Verification

Dependabot bumps versions. XOR verifies they're safe to merge. Reachability analysis, EPSS/KEV enrichment, and structured verdicts.

Compliance Evidence

Machine-readable evidence for every triaged vulnerability. VEX statements, verification reports, and audit trails produced automatically.

Compatibility and Prerequisites

Languages, build systems, CI platforms, and repository types supported by XOR. What you need to get started.

Command Reference

Every @xor-hardener command on one page. /review, /describe, /ask, /patch_i, /issue_spec, /issue_implement, and more.

Continuous Learning from Verified Agent Runs

A signed record of every agent run. See what the agent did, verify it independently, and feed the data back so agents improve.

Signed Compliance Evidence for AI Agents

A tamper-proof record of every AI agent action. Produces evidence for SOC 2, EU AI Act, PCI DSS, and more. Built on open standards so auditors verify independently.

Compliance Evidence and Standards Alignment

How XOR signed audit trails produce evidence for SOC 2, EU AI Act, PCI DSS, NIST, and other compliance frameworks.

Agentic Third-Party Risk

33% of enterprise software will be agentic by 2028. 40% of those projects will be canceled due to governance failures. A risk overview for CTOs.

MCP Server Security

17 attack types across 4 surfaces. 7.2% of 1,899 open-source MCP servers contain vulnerabilities. Technical deep-dive with defense controls.

How Agents Get Attacked

20% jailbreak success rate. 42 seconds average. 90% of successful attacks leak data. Threat data grounded in published research.

Governing AI Agents in the Enterprise

92% of AI vendors claim broad data usage rights. 17% commit to regulatory compliance. Governance frameworks from NIST, OWASP, EU CRA, and Stanford CodeX.

OWASP Top 10 for Agentic Applications

The OWASP Agentic Top 10 mapped to real-world attack data and XOR capabilities. A reference page for security teams.

See which agents produce fixes that work

128 CVEs. 15 agents. 1,920 evaluations. Agents learn from every run.