skills/davinci-agency-unlocking-long-horizon-agency/SKILL.md
Decompose complex, long-horizon coding tasks into PR-like chains of verifiable subtasks with cross-stage dependency tracking and iterative refinement. Use when: 'break this feature into PRs', 'plan a multi-step implementation', 'decompose this project into verifiable stages', 'help me build this feature incrementally with tests at each step', 'create a chain of PRs for this task', 'plan an implementation with bug-fix iterations'.
npx skillsauth add ndpvt-web/arxiv-claude-skills davinci-agency-unlocking-long-horizon-agencyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to tackle complex, multi-step software engineering tasks by structuring work as chains of Pull Requests--mirroring how experienced developers naturally decompose large objectives into verifiable, causally-linked submission units. Derived from the daVinci-Agency framework (Jiang et al., 2026), the core insight is that PR sequences from real software evolution encode three properties essential for long-horizon success: progressive decomposition through commits, consistency enforcement through unified functional objectives, and verifiable refinement through bug-fix trajectories. Claude applies this structure to plan and execute multi-stage implementations where each stage is independently testable yet contributes to a coherent whole.
Chain-of-PRs as Supervision Structure. The daVinci-Agency paper demonstrates that real Pull Request sequences are a natural source of long-horizon supervision signals. Unlike synthetic step-by-step plans that treat each action independently, PR chains preserve causal dependencies (PR-3 fixes a bug introduced by PR-2's interaction with PR-1's schema), iterative refinement (bug-fix commits within a PR demonstrate diagnosis-hypothesis-fix-validate loops), and functional coherence (all PRs in the chain serve a unified objective). The key insight: decomposition should produce units that are individually verifiable yet collectively coherent.
Three Interlocking Mechanisms. The framework operates through: (1) Progressive task decomposition -- breaking the objective into a sequence of commits/PRs where each builds on the last, analogous to how a developer submits incremental work. (2) Long-term consistency enforcement -- maintaining a unified functional objective across all stages so that local decisions align with the global goal. (3) Verifiable refinement -- explicitly modeling the bug-fix cycle where test failures after a stage trigger targeted corrections before proceeding, rather than hoping each stage is perfect on the first attempt.
Data-Efficient Execution. The paper shows that even 239 well-structured trajectories (averaging 85k tokens and 116 tool calls each) can yield a 47% relative gain on complex benchmarks. For Claude, this translates to a practical principle: invest heavily in the structure of the plan rather than the volume of output. A well-decomposed chain of 5 PRs with clear verification at each stage outperforms a monolithic implementation attempt.
Extract the unified functional objective. Before any decomposition, state the single overarching goal in one sentence. This is the "north star" that every subsequent PR must serve. Example: "Enable users to authenticate via GitHub OAuth and access role-gated API endpoints."
Identify the dependency graph. Map out which components depend on which: database schema before ORM models, models before API routes, routes before frontend integration. Sketch this as a DAG (directed acyclic graph) of work units.
Decompose into a chain of PRs. Convert the DAG into a linear (or minimally-branching) sequence of PRs, each representing the smallest unit of work that is independently testable and mergeable. Each PR should have: a title, a 1-2 sentence description of what it adds, files touched, and an explicit verification criterion (test, assertion, or observable behavior).
Define verification gates for each PR. For every PR in the chain, specify what "done" looks like: a passing test suite, a curl command that returns expected JSON, a UI element that renders correctly, or a type-check that passes. These gates are non-negotiable checkpoints.
Implement PR-1: the foundation. Start with the PR that has zero dependencies. Write the code, then immediately run verification. Do not proceed to PR-2 until PR-1's gate passes.
Run the bug-fix refinement loop. If verification fails after implementing a PR, enter the refinement cycle: diagnose the failure, form a hypothesis about the root cause, implement a targeted fix, and re-verify. This mirrors the authentic bug-fix trajectories that make daVinci-Agency's training data effective. Track each fix as an explicit commit within the current PR.
Enforce cross-stage consistency before advancing. Before starting the next PR, review the unified objective and confirm that the current state of the codebase still aligns with it. Check for unintended side effects: does the new code break assumptions that later PRs depend on?
Implement subsequent PRs sequentially. For each remaining PR in the chain, repeat steps 5-7. Carry forward context from prior PRs -- explicitly reference which files/functions from earlier stages are being extended or consumed.
Run integration verification after the final PR. Once all PRs are implemented, run a full integration check that exercises the entire chain end-to-end. This validates that the cross-stage dependencies hold and the unified objective is met.
Document the chain. Produce a summary listing each PR in order, what it accomplished, and any refinement iterations that occurred. This serves as both a review artifact and a reusable template for similar tasks.
Example 1: Multi-stage API feature
User: "Build a notification system that supports email and Slack,
with user preferences and rate limiting."
Approach (Chain-of-PRs):
PR-1: Notification preferences schema and model
- Add `notification_preferences` table with columns: user_id, channel
(email/slack), enabled, frequency_limit
- Migration file + model definition
- Verification: migration runs, model CRUD tests pass
PR-2: Core notification dispatcher
- NotificationService class with send(user, event) method
- Channel routing based on preferences from PR-1
- Verification: unit tests with mocked channels pass
PR-3: Email channel implementation
- EmailAdapter implementing ChannelInterface from PR-2
- Template rendering for notification body
- Verification: integration test sends email via test SMTP
PR-4: Slack channel implementation
- SlackAdapter implementing ChannelInterface from PR-2
- Webhook configuration and message formatting
- Verification: integration test posts to Slack test channel
PR-5: Rate limiting middleware
- RateLimiter wrapping NotificationService from PR-2
- Uses frequency_limit from preferences (PR-1)
- Verification: test that >N sends within window are throttled
PR-6: API endpoints and integration
- REST endpoints for preference CRUD + manual notification trigger
- End-to-end test: set preferences, trigger notification, verify delivery
- Verification: full API test suite passes
Example 2: Refactoring with preserved behavior
User: "Refactor our monolithic request handler into middleware layers
without breaking existing API contracts."
Approach (Chain-of-PRs):
PR-1: Extract authentication into middleware
- Move auth logic from handler to AuthMiddleware class
- Wire into request pipeline at same position
- Verification: existing auth test suite passes unchanged
PR-2: Extract validation into middleware
- Move input validation to ValidationMiddleware
- Depends on PR-1 (auth runs before validation)
- Verification: all validation edge-case tests pass
PR-3: Extract rate limiting into middleware
- Move rate limiting to RateLimitMiddleware
- Verification: rate limit tests pass, load test shows same thresholds
PR-4: Slim down the core handler
- Remove extracted logic, handler now only does business logic
- Verification: full integration test suite passes, no behavior change
Bug-fix refinement example (within PR-2):
- Initial implementation breaks when auth middleware rejects --
validation middleware receives null user context
- Fix: add early-return guard in validation middleware for
unauthenticated requests
- Re-run tests: all pass
Example 3: Greenfield project bootstrapping
User: "Set up a new CLI tool that fetches data from an API,
caches it locally, and generates reports."
Approach (Chain-of-PRs):
PR-1: Project scaffolding and CLI argument parsing
- Initialize project, configure CLI framework (e.g., argparse/click)
- Verification: `tool --help` prints usage, `tool --version` works
PR-2: API client with error handling
- HTTP client wrapping the target API, retry logic, auth
- Verification: unit tests with mocked responses cover success,
timeout, and auth-failure cases
PR-3: Local caching layer
- SQLite-backed cache keyed by query parameters
- Cache invalidation by TTL
- Verification: test cache hit/miss/expiry behavior
PR-4: Report generation engine
- Transform cached data into formatted output (CSV, JSON, table)
- Verification: snapshot tests comparing generated reports to fixtures
PR-5: End-to-end integration
- Wire CLI args -> API client -> cache -> report generator
- Verification: `tool fetch --format csv` produces expected output
against a local mock server
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".