skills/deepplanning-benchmarking-long-horizon-agentic/SKILL.md
Solve long-horizon planning tasks with verifiable constraints using the DeepPlanning methodology: proactive information gathering, local constraint reasoning, and global constrained optimization. Use when asked to 'plan a multi-day trip with budget constraints', 'build a shopping optimizer with coupons and sizing', 'create an agent that handles complex multi-step planning', 'design a constraint-satisfaction planner', 'optimize across interdependent decisions with budgets', or 'build a planning benchmark with verifiable solutions'.
npx skillsauth add ndpvt-web/arxiv-claude-skills deepplanning-benchmarking-long-horizon-agenticInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to tackle complex, long-horizon planning problems that require simultaneously satisfying local constraints (e.g., business hours, seat availability, item sizing) and global optimization objectives (e.g., time budgets, financial budgets, cross-task dependencies). Based on the DeepPlanning benchmark, it teaches a three-layer planning discipline: proactive information acquisition before committing to decisions, fine-grained local constraint checking at each step, and global feasibility verification across the entire solution. The methodology applies to any domain where an agent must gather information from APIs/tools, reason about overlapping constraints, and produce a holistic plan rather than a sequence of locally-greedy choices.
DeepPlanning identifies three capability layers that separate effective planners from failing ones. Layer 1: Proactive Information Acquisition -- the agent must recognize what information it lacks and actively query for it before committing. The benchmark's error analysis shows "insufficient search" is a dominant failure mode: agents skip critical API calls (e.g., checking seat availability, querying transit times) and then build plans on false assumptions. The fix is to treat information gathering as a first-class planning phase, not an afterthought.
Layer 2: Local Constrained Reasoning -- each individual decision must satisfy both explicit constraints (user preferences like "4-star hotel", "brand X only") and implicit environmental constraints (the hotel is fully booked, the attraction closes at 5 PM, only 2 flight seats remain). Models fail here by treating user requirements as the only constraints, ignoring what the environment actually permits. The solution is to validate every choice against both the user's stated preferences and the retrieved environmental state.
Layer 3: Global Constrained Optimization -- individual valid choices can still produce an infeasible plan when combined. A hotel check-in at 3 PM is fine locally, but not if the preceding flight lands at 5 PM. A coupon saves money on item A, but using it there prevents applying it to item B where the savings are larger. This layer requires the agent to verify cross-decision consistency (temporal overlaps, budget summation, dependency chains) and backtrack when local optimality conflicts with global feasibility. The DeepPlanning findings show this is where even frontier models fail most often (101 of 120 travel task errors, 52 of 120 shopping errors involve global optimization failures).
Decompose the planning problem into decision variables and constraint categories. Identify what must be decided (e.g., which flights, which hotels, which products), what the local constraints are per decision (availability, hours, sizing, preferences), and what global constraints span across decisions (total budget, time continuity, dependency ordering).
Map required information sources for each decision variable. Before any planning, list every API call or data query needed. For travel: transit times, flight/train schedules, hotel availability, attraction hours, restaurant ratings. For shopping: product catalogs, user profile (size, address), coupon rules, shipping times. Err on the side of over-gathering -- "insufficient search" is the #1 failure mode.
Execute information-gathering calls in parallel where independent. Group API calls that don't depend on each other and issue them concurrently. For example, querying flight options and hotel options for the same city can happen in parallel; querying hotels for city B must wait until the arrival time in city B is known from the flight selection. This parallel-then-sequential pattern achieves the best effectiveness-efficiency trade-off.
Build a candidate solution skeleton using gathered information. Create a draft plan structure (day-by-day itinerary, or cart contents) by selecting the most promising option for each decision variable, respecting local constraints. Do not finalize -- this is a draft.
Validate each local decision against explicit AND implicit constraints. For every choice in the skeleton, verify: (a) it satisfies stated user preferences, (b) it is actually available/feasible given the environment state retrieved in step 3. Flag any violations. Replace violated choices with the next-best feasible alternative.
Run global constraint verification across the entire solution. Check cross-decision consistency: Do time windows overlap? Does the total cost exceed budget? Are there dependency violations (arriving after a venue closes, selecting a coupon that conflicts with another)? Sum up all costs, verify temporal chains, and check coupling constraints.
If global verification fails, identify the conflicting decisions and backtrack. Do not patch locally -- re-examine which combination of decisions is globally infeasible. Consider alternative allocations (e.g., swap the order of two attractions, choose a different coupon assignment, pick a later flight to unlock a cheaper hotel). Re-validate from step 5.
Produce the final plan with explicit verification evidence. Output the plan with per-decision justification: which constraints each choice satisfies, total budget consumed, time feasibility proof. This makes the solution auditable and debuggable.
If building a benchmark or evaluation system, use solution-centric reverse generation. Design tasks by starting from a known-feasible solution, then generating constraints that make it the unique optimum. Inject environmental constraints (limited availability, closures) to ensure only one valid solution exists, enabling automated rule-based verification instead of LLM-based scoring.
Example 1: Multi-Day Travel Planning
User: "Plan a 3-day trip to Tokyo for 2 people. Budget: $2000 total. Must visit TeamLab, Tsukiji Market, and Meiji Shrine. One person is vegetarian. Prefer 4-star hotels."
Approach:
Output:
Day 1: Tsukiji Outer Market (7:00-11:00) → [40min transit] → Lunch at Afuri Ramen Shinjuku
(vegetarian options, 11:40-12:40) → TeamLab Borderless (13:30-18:00)
Day 2: Meiji Shrine (9:00-11:00) → Harajuku exploration → Ain Soph Ripple (vegetarian, lunch)
→ Shibuya afternoon → Hotel
Day 3: Morning free → Airport departure
Budget verification: Flights $800 + Hotel $450 + Attractions $120 + Dining $280 + Transit $95 = $1,745 / $2,000
Time verification: All transitions verified with transit queries. No overlaps detected.
Example 2: Multi-Product Shopping Optimization
User: "I need running shoes, a waterproof jacket, and a fitness tracker. Budget: $350. I have a 15% off coupon for electronics and a $20-off-$100 coupon for apparel. Optimize total cost."
Approach:
Output:
Optimized Cart:
1. Nike Pegasus 41 Running Shoes — $120.00
2. Columbia Watertight II Jacket — $130.00 → $110.00 ($20-off-$100 apparel coupon)
3. Fitbit Inspire 3 Fitness Tracker — $110.00 → $93.50 (15% electronics coupon)
Total: $323.50 / $350.00 budget (saved $46.50 via coupons)
Coupon assignment verified: no stacking conflicts, both coupons applied to eligible categories.
Example 3: Building a Planning Evaluation Benchmark
User: "I want to create a benchmark to test if my LLM agent can handle constrained planning. How should I design the tasks?"
Approach:
Output:
Benchmark Task Template:
- Scenario: [domain description with 3-5 subtasks]
- Explicit constraints: [user-stated preferences, 4-8 per task]
- Implicit constraints: [discoverable only via API calls, 2-4 per task]
- Global constraints: [budget cap, time window, dependency ordering]
- Tools available: [list of 8-15 APIs with input/output schemas]
- Gold solution: [the unique valid plan, used for automated scoring]
- Scoring: [rule-based checklist across all constraint categories]
| Error Pattern | Cause | Fix | |---|---|---| | Insufficient search | Agent skips API calls, plans on assumptions | Add a pre-planning checklist: list every unknown, query each one | | Tool misuse | Wrong API called or wrong parameters | Validate API schemas before calls; retry with corrected params | | Fact displacement | Agent retrieves info but uses wrong values later | Anchor each decision to the specific retrieved data point; cite sources | | Explicit constraint violation | User preference ignored | Post-check every choice against the original requirement list | | Implicit constraint failure | Environmental limit missed (sold out, closed) | Always query availability/status; never assume from past data | | Global optimization failure | Valid parts, infeasible whole | Run end-to-end verification: sum budgets, check time chains, validate dependencies |
Paper: DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints (Zhang et al., 2026). Focus on Section 4 (error taxonomy: insufficient search, implicit constraint failures, global optimization breakdowns) and Section 5 (the effectiveness-efficiency frontier showing reasoning models with parallel tool use dominate).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".