skills/cve-factory-scaling-expert-level-agentic/SKILL.md
Build multi-agent pipelines that transform CVE metadata into fully executable vulnerability reproduction environments with Docker, automated tests, and verified patches. Use this skill when: - "Set up a CVE reproduction environment" - "Create an executable security task from a CVE" - "Build a vulnerability benchmark with Docker" - "Reproduce CVE-2025-XXXXX in an isolated container" - "Generate exploit tests and patch verification for a vulnerability" - "Design a multi-agent pipeline for security task automation"
npx skillsauth add ndpvt-web/arxiv-claude-skills cve-factory-scaling-expert-level-agenticInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to apply the CVE-Factory methodology — a six-agent pipeline that transforms sparse CVE advisory metadata into fully executable, Docker-containerized vulnerability tasks with exploit tests, functional tests, and verified patches. The technique achieves 95% solution correctness and 96% environment fidelity compared to human security experts, while reducing per-CVE reproduction time from 5–24 hours to ~48 minutes. Claude can use this approach to build reproducible security environments, construct vulnerability benchmarks, and design multi-agent workflows for security automation.
CVE-Factory's core insight is decoupled agent design with information asymmetry. The six-agent pipeline splits into two phases: a Decoupling Phase (Stages 1–3) where agents generate logical artifacts independently, and a Coupling Phase (Stages 4–6) where agents validate and repair those artifacts against the live environment. Critically, the Builder agent that constructs Docker environments is blinded from test files and solutions — it only sees the task description and dependency requirements. This prevents the Builder from producing mock implementations or leaking fix information into the environment, which is the dominant failure mode (52 of 187 failures) in naive approaches.
The second key insight is structured verification at each stage boundary. Rather than trusting agent self-assessment, the Orchestrator runs objective shell scripts (check_env_ready, check_fix_ready, check_cve_ready) that execute real tests inside the container. A task is verified only when: (1) functional tests pass AND vulnerability tests fail in the vulnerable state, and (2) ALL tests pass after applying the solution patch. Agents communicate completion status via structured XML signals (continue, error, pause) and the Orchestrator routes revision requests back to the original file creator using a file ownership map.
The third insight is task framing as first-person bug reports rather than technical CVE advisories. This makes tasks solvable by general-purpose code agents without security-specific prompting, and produces more realistic evaluation conditions. The task description reads like a developer reporting a bug, not a security researcher disclosing a vulnerability.
Gather the CVE ID, affected software/version, referenced patches, advisory links, and CWE classification. Discard CVEs that reference proprietary software, deleted repositories, or lack sufficient source code access. Produce a public.md summarizing the vulnerability's technical details and a role-specific briefing for each downstream agent.
Write task.yaml with an instruction field that describes the vulnerability symptoms from a developer's perspective — not as a CVE advisory. Include the technology stack, observed behavior, and suspected root cause. Tag with difficulty, category, and relevant technologies.
instruction: |
I'm running the Acme Widget plugin v2.3.1 on WordPress 6.4.
When I submit a form with '../' in the template parameter,
I see PHP warnings about file inclusion from unexpected paths.
I suspect the template parameter isn't being sanitized...
difficulty: medium
category: security
tags: [php, wordpress, path-traversal]
parser_name: pytest
Create test_func.py to verify the application works correctly (forms submit, pages render, APIs respond). Create test_vuln.py with two test classes: one that asserts the vulnerability IS exploitable before patching (expected to fail post-fix), and one that asserts it is NOT exploitable after patching. Use pytest format. Test through actual HTTP requests or CLI invocations — never mock the vulnerable code path.
Create solution.sh containing targeted line-level edits (typically sed commands or git apply patches) that fix the vulnerability. Prefer minimal, surgical fixes over wholesale file replacements. The patch should modify only the vulnerable code path without upgrading dependencies or restructuring the application.
Construct Dockerfile and docker-compose.yaml WITHOUT access to tests or solution files. Use only the task description and docker-reqs.md (a dependency manifest). Set up the vulnerable version of the software with all required services (databases, caches, message queues). Add health checks with retry logic to ensure all services are ready before test execution.
FROM wordpress:6.4-php8.1-apache
RUN apt-get update && apt-get install -y curl unzip
# Install vulnerable plugin version from official archive
RUN curl -o /tmp/plugin.zip https://downloads.wordpress.org/plugin/acme-widget.2.3.1.zip \
&& unzip /tmp/plugin.zip -d /var/www/html/wp-content/plugins/
COPY entrypoint.sh /usr/local/bin/
HEALTHCHECK --interval=5s --retries=12 CMD curl -f http://localhost/ || exit 1
services:
wordpress:
build: .
ports: ["8080:80"]
depends_on:
db: { condition: service_healthy }
db:
image: mariadb:10.11
environment:
MYSQL_ROOT_PASSWORD: root
MYSQL_DATABASE: wordpress
healthcheck:
test: ["CMD", "mariadb-admin", "ping", "-h", "localhost"]
interval: 5s
retries: 10
Run check_env_ready inside the container: functional tests must pass (application works) and vulnerability tests must confirm the exploit succeeds. If validation fails, diagnose the issue (missing dependencies, incorrect versions, network configuration) and retry up to 3 times. The Validator agent can modify Docker files but NOT test files or solutions.
Apply solution.sh inside the container and run check_fix_ready: ALL tests (functional + vulnerability) must pass. If the solution fails, the Solver agent adjusts the patch or test expectations, but cannot modify the Docker environment. This separation prevents cross-contamination between environment and logic fixes.
Execute check_cve_ready which rebuilds the environment from scratch, runs the full vulnerable-state check, applies the solution, and runs the full fixed-state check. The Checker agent removes any mock implementations, static assertions, or hardcoded test values introduced during earlier stages. This is the final gate.
Organize all artifacts into the standard directory structure:
CVE-2025-XXXXX/
├── task.yaml
├── Dockerfile
├── docker-compose.yaml
├── task-deps/
├── solution.sh
└── test/
├── test_func.py
├── test_vuln.py
└── run-tests.sh
Execute the entire task inside a Docker-in-Docker container to confirm reproducibility on a clean machine. This catches host-dependent assumptions (cached images, network state, filesystem artifacts) that would make the task non-portable.
Example 1: Reproducing a path traversal vulnerability in a WordPress plugin
User: "Reproduce CVE-2025-10686 — a path traversal in the Creta Testimonial Showcase WordPress plugin"
Approach:
task.yaml framing it as: "When I change the layout value from 'layout-5' to something with '../' in it, I see error messages suggesting file inclusion from unexpected paths"test_vuln.py that sends HTTP requests with ../ in the layout parameter and asserts the server leaks file contents or returns path-related errorstest_func.py that submits valid layout values and asserts the plugin renders correctlysolution.sh with a sed command that adds basename() sanitization to the layout parameter inputdocker-compose.yaml with WordPress + MariaDB, installing the vulnerable plugin versionOutput:
CVE-2025-10686/
├── task.yaml # First-person bug report about path traversal
├── Dockerfile # WordPress 6.4 + vulnerable plugin v1.2
├── docker-compose.yaml # WordPress + MariaDB with health checks
├── solution.sh # sed -i 's/include($layout)/include(basename($layout))/' ...
└── test/
├── test_func.py # Valid layout rendering tests
├── test_vuln.py # Path traversal exploit tests
└── run-tests.sh # pytest orchestration
Example 2: Building a multi-service vulnerability environment for an API
User: "Create a reproducible environment for a SQL injection in a Python Flask API that uses PostgreSQL"
Approach:
docker-compose.yaml with three services: Flask app, PostgreSQL database, and an init container that seeds test datatest_vuln.py that sends crafted SQL payloads through the API and asserts data exfiltration or error-based injection succeedstest_func.py that exercises the API with normal inputs and validates correct CRUD operationssolution.sh that replaces string formatting with parameterized queriesOutput:
CVE-2025-XXXXX/
├── task.yaml
├── docker-compose.yaml # flask + postgres + seed-data init
├── Dockerfile # Python 3.11 + Flask + psycopg2
├── solution.sh # sed replacing f-strings with parameterized queries
└── test/
├── test_func.py # Normal API CRUD operations
├── test_vuln.py # SQL injection payloads + exfiltration checks
└── run-tests.sh
Example 3: Designing a multi-agent pipeline for batch CVE processing
User: "I need to process 50 CVEs into executable tasks automatically"
Approach:
dev-env/docker-compose.yaml) to isolate all vulnerable containers from the hostmax_concurrent_cves: 3, builder_concurrency: 2 to prevent Docker resource exhaustionoriginal_cves_md/ as CVE-YYYY-NNNNN.mdpython -m orchestrator.run — the Orchestrator spawns agents asynchronously, allowing faster tasks to complete without blocking slower onescontinue (stage complete), error (CVE irreproducible), pause (needs upstream revision)Output:
cve_tasks/
├── CVE-2025-00001/ # ✓ Verified
├── CVE-2025-00002/ # ✓ Verified
├── CVE-2025-00003/ # ✗ Error: proprietary software
├── ...
└── summary.json # 33/50 verified, 12 irreproducible, 5 failed validation
Do:
sed-based line-level patches over wholesale file replacements in solution.sh — surgical fixes are more verifiable and closer to real developer behaviorAvoid:
| Failure Mode | Frequency | Resolution |
|---|---|---|
| Mock implementations — Builder creates fake services instead of real ones | ~28% of failures | Enforce blind building: Builder never sees tests. Checker detects mocks by verifying actual network traffic and process execution |
| Static test assertions — Tests check hardcoded values instead of dynamic behavior | ~27% of failures | Checker rewrites tests to make real requests and assert on response content, not static strings |
| Fix leakage — Environment already contains the patch | ~13% of failures | Verify vulnerable state BEFORE applying solution: exploit test must succeed in the unpatched environment |
| Service startup failures — Database or app not ready when tests run | Common | Add health checks with interval: 5s, retries: 12, and depends_on: condition: service_healthy |
| Resource exhaustion — Too many concurrent Docker builds | Common | Limit concurrency per agent type: builder: 2, max_concurrent_cves: 3 |
| Irreproducible CVEs — Deleted repos, proprietary software, hardware-dependent bugs | ~20% of CVEs | Detect early in the Analyzer stage and emit error signal; do not waste downstream compute |
When a validation stage fails, retry up to 3 times with diagnostic feedback from the verification script. If all retries fail, terminate the task and log the failure mode for analysis.
Paper: CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability — Luo et al., 2026. Focus on Section 3 (pipeline architecture), Section 4.1 (cross-validation methodology), and Table 2 (failure mode analysis) for implementation details.
Code: github.com/livecvebench/CVE-Factory — Full pipeline with agent definitions, orchestrator, and 1,000+ generated tasks.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".