CVE-Factory: Multi-Agent Vulnerability Reproduction at Scale

This skill teaches Claude to apply the CVE-Factory methodology — a six-agent pipeline that transforms sparse CVE advisory metadata into fully executable, Docker-containerized vulnerability tasks with exploit tests, functional tests, and verified patches. The technique achieves 95% solution correctness and 96% environment fidelity compared to human security experts, while reducing per-CVE reproduction time from 5–24 hours to ~48 minutes. Claude can use this approach to build reproducible security environments, construct vulnerability benchmarks, and design multi-agent workflows for security automation.

When to Use

When the user wants to reproduce a specific CVE in an isolated, testable Docker environment
When building a vulnerability benchmark or training dataset for security agents
When designing a multi-agent pipeline that decomposes a complex security task into decoupled stages
When the user needs exploit proof-of-concept tests alongside functional regression tests for a vulnerability
When containerizing a vulnerable application with multi-service orchestration (app + database + dependencies)
When validating that a security patch actually fixes a vulnerability without breaking functionality
When automating the conversion of CVE advisories into structured, agent-solvable tasks

Key Technique

CVE-Factory's core insight is decoupled agent design with information asymmetry. The six-agent pipeline splits into two phases: a Decoupling Phase (Stages 1–3) where agents generate logical artifacts independently, and a Coupling Phase (Stages 4–6) where agents validate and repair those artifacts against the live environment. Critically, the Builder agent that constructs Docker environments is blinded from test files and solutions — it only sees the task description and dependency requirements. This prevents the Builder from producing mock implementations or leaking fix information into the environment, which is the dominant failure mode (52 of 187 failures) in naive approaches.

The second key insight is structured verification at each stage boundary. Rather than trusting agent self-assessment, the Orchestrator runs objective shell scripts (check_env_ready, check_fix_ready, check_cve_ready) that execute real tests inside the container. A task is verified only when: (1) functional tests pass AND vulnerability tests fail in the vulnerable state, and (2) ALL tests pass after applying the solution patch. Agents communicate completion status via structured XML signals (continue, error, pause) and the Orchestrator routes revision requests back to the original file creator using a file ownership map.

The third insight is task framing as first-person bug reports rather than technical CVE advisories. This makes tasks solvable by general-purpose code agents without security-specific prompting, and produces more realistic evaluation conditions. The task description reads like a developer reporting a bug, not a security researcher disclosing a vulnerability.

Step-by-Step Workflow

1. Collect and filter CVE metadata

Gather the CVE ID, affected software/version, referenced patches, advisory links, and CWE classification. Discard CVEs that reference proprietary software, deleted repositories, or lack sufficient source code access. Produce a public.md summarizing the vulnerability's technical details and a role-specific briefing for each downstream agent.

2. Generate the task description as a first-person bug report

Write task.yaml with an instruction field that describes the vulnerability symptoms from a developer's perspective — not as a CVE advisory. Include the technology stack, observed behavior, and suspected root cause. Tag with difficulty, category, and relevant technologies.

instruction: |
  I'm running the Acme Widget plugin v2.3.1 on WordPress 6.4.
  When I submit a form with '../' in the template parameter,
  I see PHP warnings about file inclusion from unexpected paths.
  I suspect the template parameter isn't being sanitized...
difficulty: medium
category: security
tags: [php, wordpress, path-traversal]
parser_name: pytest

3. Write dual test suites (functional + vulnerability)

Create test_func.py to verify the application works correctly (forms submit, pages render, APIs respond). Create test_vuln.py with two test classes: one that asserts the vulnerability IS exploitable before patching (expected to fail post-fix), and one that asserts it is NOT exploitable after patching. Use pytest format. Test through actual HTTP requests or CLI invocations — never mock the vulnerable code path.

4. Write the reference solution patch

Create solution.sh containing targeted line-level edits (typically sed commands or git apply patches) that fix the vulnerability. Prefer minimal, surgical fixes over wholesale file replacements. The patch should modify only the vulnerable code path without upgrading dependencies or restructuring the application.

5. Build the Docker environment under blind constraints

Construct Dockerfile and docker-compose.yaml WITHOUT access to tests or solution files. Use only the task description and docker-reqs.md (a dependency manifest). Set up the vulnerable version of the software with all required services (databases, caches, message queues). Add health checks with retry logic to ensure all services are ready before test execution.

FROM wordpress:6.4-php8.1-apache
RUN apt-get update && apt-get install -y curl unzip
# Install vulnerable plugin version from official archive
RUN curl -o /tmp/plugin.zip https://downloads.wordpress.org/plugin/acme-widget.2.3.1.zip \
    && unzip /tmp/plugin.zip -d /var/www/html/wp-content/plugins/
COPY entrypoint.sh /usr/local/bin/
HEALTHCHECK --interval=5s --retries=12 CMD curl -f http://localhost/ || exit 1

services:
  wordpress:
    build: .
    ports: ["8080:80"]
    depends_on:
      db: { condition: service_healthy }
  db:
    image: mariadb:10.11
    environment:
      MYSQL_ROOT_PASSWORD: root
      MYSQL_DATABASE: wordpress
    healthcheck:
      test: ["CMD", "mariadb-admin", "ping", "-h", "localhost"]
      interval: 5s
      retries: 10

6. Validate the vulnerable environment

Run check_env_ready inside the container: functional tests must pass (application works) and vulnerability tests must confirm the exploit succeeds. If validation fails, diagnose the issue (missing dependencies, incorrect versions, network configuration) and retry up to 3 times. The Validator agent can modify Docker files but NOT test files or solutions.

7. Validate the solution

Apply solution.sh inside the container and run check_fix_ready: ALL tests (functional + vulnerability) must pass. If the solution fails, the Solver agent adjusts the patch or test expectations, but cannot modify the Docker environment. This separation prevents cross-contamination between environment and logic fixes.

8. Run end-to-end quality check

Execute check_cve_ready which rebuilds the environment from scratch, runs the full vulnerable-state check, applies the solution, and runs the full fixed-state check. The Checker agent removes any mock implementations, static assertions, or hardcoded test values introduced during earlier stages. This is the final gate.

9. Package the reproducible task

Organize all artifacts into the standard directory structure:

CVE-2025-XXXXX/
├── task.yaml
├── Dockerfile
├── docker-compose.yaml
├── task-deps/
├── solution.sh
└── test/
    ├── test_func.py
    ├── test_vuln.py
    └── run-tests.sh

10. Run isolation and verify end-to-end in a clean DinD environment

Execute the entire task inside a Docker-in-Docker container to confirm reproducibility on a clean machine. This catches host-dependent assumptions (cached images, network state, filesystem artifacts) that would make the task non-portable.

Concrete Examples

Example 1: Reproducing a path traversal vulnerability in a WordPress plugin

User: "Reproduce CVE-2025-10686 — a path traversal in the Creta Testimonial Showcase WordPress plugin"

Approach:

Fetch the CVE advisory and identify the affected plugin version, vulnerable endpoint, and patch commit
Write task.yaml framing it as: "When I change the layout value from 'layout-5' to something with '../' in it, I see error messages suggesting file inclusion from unexpected paths"
Create test_vuln.py that sends HTTP requests with ../ in the layout parameter and asserts the server leaks file contents or returns path-related errors
Create test_func.py that submits valid layout values and asserts the plugin renders correctly
Write solution.sh with a sed command that adds basename() sanitization to the layout parameter input
Build a docker-compose.yaml with WordPress + MariaDB, installing the vulnerable plugin version
Validate: functional tests pass, exploit test confirms traversal works, then apply patch and confirm all tests pass

Output:

CVE-2025-10686/
├── task.yaml          # First-person bug report about path traversal
├── Dockerfile         # WordPress 6.4 + vulnerable plugin v1.2
├── docker-compose.yaml # WordPress + MariaDB with health checks
├── solution.sh        # sed -i 's/include($layout)/include(basename($layout))/' ...
└── test/
    ├── test_func.py   # Valid layout rendering tests
    ├── test_vuln.py   # Path traversal exploit tests
    └── run-tests.sh   # pytest orchestration

Example 2: Building a multi-service vulnerability environment for an API

User: "Create a reproducible environment for a SQL injection in a Python Flask API that uses PostgreSQL"

Approach:

Identify the vulnerable endpoint, affected query construction, and the specific SQLi vector
Write the task description as a developer noticing unexpected database behavior when special characters appear in API input
Build docker-compose.yaml with three services: Flask app, PostgreSQL database, and an init container that seeds test data
Write test_vuln.py that sends crafted SQL payloads through the API and asserts data exfiltration or error-based injection succeeds
Write test_func.py that exercises the API with normal inputs and validates correct CRUD operations
Write solution.sh that replaces string formatting with parameterized queries
Validate the full vulnerable-then-fixed cycle

Output:

CVE-2025-XXXXX/
├── task.yaml
├── docker-compose.yaml   # flask + postgres + seed-data init
├── Dockerfile            # Python 3.11 + Flask + psycopg2
├── solution.sh           # sed replacing f-strings with parameterized queries
└── test/
    ├── test_func.py      # Normal API CRUD operations
    ├── test_vuln.py      # SQL injection payloads + exfiltration checks
    └── run-tests.sh

Example 3: Designing a multi-agent pipeline for batch CVE processing

User: "I need to process 50 CVEs into executable tasks automatically"

Approach:

Set up a Docker-in-Docker environment (dev-env/docker-compose.yaml) to isolate all vulnerable containers from the host
Configure the Orchestrator with concurrency limits: max_concurrent_cves: 3, builder_concurrency: 2 to prevent Docker resource exhaustion
Place CVE markdown files in original_cves_md/ as CVE-YYYY-NNNNN.md
Run the pipeline: python -m orchestrator.run — the Orchestrator spawns agents asynchronously, allowing faster tasks to complete without blocking slower ones
Monitor structured XML signals from agents: continue (stage complete), error (CVE irreproducible), pause (needs upstream revision)
Collect results: expect ~66% verified success rate on recent CVEs; failed tasks produce diagnostic logs identifying whether the failure is due to missing source code, complex multi-step reproduction, or environment limitations

Output:

cve_tasks/
├── CVE-2025-00001/  # ✓ Verified
├── CVE-2025-00002/  # ✓ Verified
├── CVE-2025-00003/  # ✗ Error: proprietary software
├── ...
└── summary.json     # 33/50 verified, 12 irreproducible, 5 failed validation

Best Practices

Do:

Blind the environment builder from tests and solutions — this is the single most important design decision to prevent mock implementations and solution leakage
Use real HTTP requests, CLI commands, or library calls in tests — never mock the vulnerable code path or use static assertions
Frame task descriptions as first-person bug reports, not CVE advisories — this produces more realistic and generalizable tasks
Add health checks with retry logic to all Docker services — vulnerable applications often have slow startup and flaky initialization
Prefer sed-based line-level patches over wholesale file replacements in solution.sh — surgical fixes are more verifiable and closer to real developer behavior

Avoid:

Letting any single agent see both the tests AND the environment configuration — information asymmetry between agents prevents circular dependencies and mock solutions
Trusting agent self-assessment for completion — always run objective verification scripts that execute real tests inside the container
Including version upgrades or dependency bumps as "fixes" — solutions must patch the vulnerable code path directly
Running vulnerable containers outside Docker-in-Docker isolation — always use a DinD wrapper to prevent accidental host compromise
Skipping the end-to-end rebuild check (Stage 6) — cached Docker layers and leftover filesystem state can mask reproducibility failures

Error Handling

| Failure Mode | Frequency | Resolution | |---|---|---| | Mock implementations — Builder creates fake services instead of real ones | ~28% of failures | Enforce blind building: Builder never sees tests. Checker detects mocks by verifying actual network traffic and process execution | | Static test assertions — Tests check hardcoded values instead of dynamic behavior | ~27% of failures | Checker rewrites tests to make real requests and assert on response content, not static strings | | Fix leakage — Environment already contains the patch | ~13% of failures | Verify vulnerable state BEFORE applying solution: exploit test must succeed in the unpatched environment | | Service startup failures — Database or app not ready when tests run | Common | Add health checks with interval: 5s, retries: 12, and depends_on: condition: service_healthy | | Resource exhaustion — Too many concurrent Docker builds | Common | Limit concurrency per agent type: builder: 2, max_concurrent_cves: 3 | | Irreproducible CVEs — Deleted repos, proprietary software, hardware-dependent bugs | ~20% of CVEs | Detect early in the Analyzer stage and emit error signal; do not waste downstream compute |

When a validation stage fails, retry up to 3 times with diagnostic feedback from the verification script. If all retries fail, terminate the task and log the failure mode for analysis.

Limitations

Proprietary and closed-source software cannot be reproduced — the pipeline requires access to source code and build systems
Hardware-dependent vulnerabilities (firmware, device drivers, side-channel attacks) are out of scope — Docker cannot simulate hardware
Complex multi-step exploits requiring human interaction (social engineering, browser UI manipulation) have low success rates
Very large monorepo applications may exceed Docker build time limits or require impractical amounts of disk space
The 66.2% verified success rate means roughly 1 in 3 generated tasks will have issues — always run the end-to-end validation and manually inspect a sample
Language coverage is uneven — Python, JavaScript, PHP, and Go have the highest success rates; less common languages (Rust, Erlang) have fewer training examples
Cost: each CVE reproduction costs roughly one Claude API call per agent stage (~6 calls total), plus Docker compute time

Reference

Paper: CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability — Luo et al., 2026. Focus on Section 3 (pipeline architecture), Section 4.1 (cross-validation methodology), and Table 2 (failure mode analysis) for implementation details.

Code: github.com/livecvebench/CVE-Factory — Full pipeline with agent definitions, orchestrator, and 1,000+ generated tasks.

CVE-Factory: Multi-Agent Vulnerability Reproduction at Scale

When to Use

When the user wants to reproduce a specific CVE in an isolated, testable Docker environment
When building a vulnerability benchmark or training dataset for security agents
When designing a multi-agent pipeline that decomposes a complex security task into decoupled stages
When the user needs exploit proof-of-concept tests alongside functional regression tests for a vulnerability
When containerizing a vulnerable application with multi-service orchestration (app + database + dependencies)
When validating that a security patch actually fixes a vulnerability without breaking functionality
When automating the conversion of CVE advisories into structured, agent-solvable tasks

Key Technique

Step-by-Step Workflow

1. Collect and filter CVE metadata

2. Generate the task description as a first-person bug report

instruction: |
  I'm running the Acme Widget plugin v2.3.1 on WordPress 6.4.
  When I submit a form with '../' in the template parameter,
  I see PHP warnings about file inclusion from unexpected paths.
  I suspect the template parameter isn't being sanitized...
difficulty: medium
category: security
tags: [php, wordpress, path-traversal]
parser_name: pytest

3. Write dual test suites (functional + vulnerability)

4. Write the reference solution patch

5. Build the Docker environment under blind constraints

FROM wordpress:6.4-php8.1-apache
RUN apt-get update && apt-get install -y curl unzip
# Install vulnerable plugin version from official archive
RUN curl -o /tmp/plugin.zip https://downloads.wordpress.org/plugin/acme-widget.2.3.1.zip \
    && unzip /tmp/plugin.zip -d /var/www/html/wp-content/plugins/
COPY entrypoint.sh /usr/local/bin/
HEALTHCHECK --interval=5s --retries=12 CMD curl -f http://localhost/ || exit 1

services:
  wordpress:
    build: .
    ports: ["8080:80"]
    depends_on:
      db: { condition: service_healthy }
  db:
    image: mariadb:10.11
    environment:
      MYSQL_ROOT_PASSWORD: root
      MYSQL_DATABASE: wordpress
    healthcheck:
      test: ["CMD", "mariadb-admin", "ping", "-h", "localhost"]
      interval: 5s
      retries: 10

6. Validate the vulnerable environment

7. Validate the solution

8. Run end-to-end quality check

9. Package the reproducible task

Organize all artifacts into the standard directory structure:

CVE-2025-XXXXX/
├── task.yaml
├── Dockerfile
├── docker-compose.yaml
├── task-deps/
├── solution.sh
└── test/
    ├── test_func.py
    ├── test_vuln.py
    └── run-tests.sh

10. Run isolation and verify end-to-end in a clean DinD environment

Concrete Examples

Example 1: Reproducing a path traversal vulnerability in a WordPress plugin

User: "Reproduce CVE-2025-10686 — a path traversal in the Creta Testimonial Showcase WordPress plugin"

Approach:

Fetch the CVE advisory and identify the affected plugin version, vulnerable endpoint, and patch commit
Write task.yaml framing it as: "When I change the layout value from 'layout-5' to something with '../' in it, I see error messages suggesting file inclusion from unexpected paths"
Create test_vuln.py that sends HTTP requests with ../ in the layout parameter and asserts the server leaks file contents or returns path-related errors
Create test_func.py that submits valid layout values and asserts the plugin renders correctly
Write solution.sh with a sed command that adds basename() sanitization to the layout parameter input
Build a docker-compose.yaml with WordPress + MariaDB, installing the vulnerable plugin version
Validate: functional tests pass, exploit test confirms traversal works, then apply patch and confirm all tests pass

Output:

CVE-2025-10686/
├── task.yaml          # First-person bug report about path traversal
├── Dockerfile         # WordPress 6.4 + vulnerable plugin v1.2
├── docker-compose.yaml # WordPress + MariaDB with health checks
├── solution.sh        # sed -i 's/include($layout)/include(basename($layout))/' ...
└── test/
    ├── test_func.py   # Valid layout rendering tests
    ├── test_vuln.py   # Path traversal exploit tests
    └── run-tests.sh   # pytest orchestration

Example 2: Building a multi-service vulnerability environment for an API

User: "Create a reproducible environment for a SQL injection in a Python Flask API that uses PostgreSQL"

Approach:

Identify the vulnerable endpoint, affected query construction, and the specific SQLi vector
Write the task description as a developer noticing unexpected database behavior when special characters appear in API input
Build docker-compose.yaml with three services: Flask app, PostgreSQL database, and an init container that seeds test data
Write test_vuln.py that sends crafted SQL payloads through the API and asserts data exfiltration or error-based injection succeeds
Write test_func.py that exercises the API with normal inputs and validates correct CRUD operations
Write solution.sh that replaces string formatting with parameterized queries
Validate the full vulnerable-then-fixed cycle

Output:

CVE-2025-XXXXX/
├── task.yaml
├── docker-compose.yaml   # flask + postgres + seed-data init
├── Dockerfile            # Python 3.11 + Flask + psycopg2
├── solution.sh           # sed replacing f-strings with parameterized queries
└── test/
    ├── test_func.py      # Normal API CRUD operations
    ├── test_vuln.py      # SQL injection payloads + exfiltration checks
    └── run-tests.sh

Example 3: Designing a multi-agent pipeline for batch CVE processing

User: "I need to process 50 CVEs into executable tasks automatically"

Approach:

Set up a Docker-in-Docker environment (dev-env/docker-compose.yaml) to isolate all vulnerable containers from the host
Configure the Orchestrator with concurrency limits: max_concurrent_cves: 3, builder_concurrency: 2 to prevent Docker resource exhaustion
Place CVE markdown files in original_cves_md/ as CVE-YYYY-NNNNN.md
Run the pipeline: python -m orchestrator.run — the Orchestrator spawns agents asynchronously, allowing faster tasks to complete without blocking slower ones
Monitor structured XML signals from agents: continue (stage complete), error (CVE irreproducible), pause (needs upstream revision)
Collect results: expect ~66% verified success rate on recent CVEs; failed tasks produce diagnostic logs identifying whether the failure is due to missing source code, complex multi-step reproduction, or environment limitations

Output:

cve_tasks/
├── CVE-2025-00001/  # ✓ Verified
├── CVE-2025-00002/  # ✓ Verified
├── CVE-2025-00003/  # ✗ Error: proprietary software
├── ...
└── summary.json     # 33/50 verified, 12 irreproducible, 5 failed validation

Best Practices

Do:

Blind the environment builder from tests and solutions — this is the single most important design decision to prevent mock implementations and solution leakage
Use real HTTP requests, CLI commands, or library calls in tests — never mock the vulnerable code path or use static assertions
Frame task descriptions as first-person bug reports, not CVE advisories — this produces more realistic and generalizable tasks
Add health checks with retry logic to all Docker services — vulnerable applications often have slow startup and flaky initialization
Prefer sed-based line-level patches over wholesale file replacements in solution.sh — surgical fixes are more verifiable and closer to real developer behavior

Avoid:

Letting any single agent see both the tests AND the environment configuration — information asymmetry between agents prevents circular dependencies and mock solutions
Trusting agent self-assessment for completion — always run objective verification scripts that execute real tests inside the container
Including version upgrades or dependency bumps as "fixes" — solutions must patch the vulnerable code path directly
Running vulnerable containers outside Docker-in-Docker isolation — always use a DinD wrapper to prevent accidental host compromise
Skipping the end-to-end rebuild check (Stage 6) — cached Docker layers and leftover filesystem state can mask reproducibility failures

Error Handling

When a validation stage fails, retry up to 3 times with diagnostic feedback from the verification script. If all retries fail, terminate the task and log the failure mode for analysis.

Limitations

Proprietary and closed-source software cannot be reproduced — the pipeline requires access to source code and build systems
Hardware-dependent vulnerabilities (firmware, device drivers, side-channel attacks) are out of scope — Docker cannot simulate hardware
Complex multi-step exploits requiring human interaction (social engineering, browser UI manipulation) have low success rates
Very large monorepo applications may exceed Docker build time limits or require impractical amounts of disk space
The 66.2% verified success rate means roughly 1 in 3 generated tasks will have issues — always run the end-to-end validation and manually inspect a sample
Language coverage is uneven — Python, JavaScript, PHP, and Go have the highest success rates; less common languages (Rust, Erlang) have fewer training examples
Cost: each CVE reproduction costs roughly one Claude API call per agent stage (~6 calls total), plus Docker compute time

Reference

Code: github.com/livecvebench/CVE-Factory — Full pipeline with agent definitions, orchestrator, and 1,000+ generated tasks.

Adoption

ndpvt-web/cve-factory-scaling-expert-level-agentic

$ install --global

Security Scan Results

SKILL.md

CVE-Factory: Multi-Agent Vulnerability Reproduction at Scale

When to Use

Key Technique

Step-by-Step Workflow

1. Collect and filter CVE metadata

2. Generate the task description as a first-person bug report

3. Write dual test suites (functional + vulnerability)

4. Write the reference solution patch

5. Build the Docker environment under blind constraints

6. Validate the vulnerable environment

7. Validate the solution

8. Run end-to-end quality check

9. Package the reproducible task

10. Run isolation and verify end-to-end in a clean DinD environment

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge

ndpvt-web/cve-factory-scaling-expert-level-agentic

$ install --global

Security Scan Results

SKILL.md

CVE-Factory: Multi-Agent Vulnerability Reproduction at Scale

When to Use

Key Technique

Step-by-Step Workflow

1. Collect and filter CVE metadata

2. Generate the task description as a first-person bug report

3. Write dual test suites (functional + vulnerability)

4. Write the reference solution patch

5. Build the Docker environment under blind constraints

6. Validate the vulnerable environment

7. Validate the solution

8. Run end-to-end quality check

9. Package the reproducible task

10. Run isolation and verify end-to-end in a clean DinD environment

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge