skills/adareasoner-dynamic-tool-orchestration/SKILL.md
Adaptive multi-step tool orchestration for complex reasoning tasks. Dynamically selects, sequences, and composes tools based on task context and intermediate results rather than fixed pipelines. Use when: 'orchestrate tools for this task', 'figure out which tools to use', 'multi-step reasoning with tools', 'adaptive tool pipeline', 'dynamic tool selection', 'chain tools together intelligently'.
npx skillsauth add ndpvt-web/arxiv-claude-skills adareasoner-dynamic-tool-orchestrationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to orchestrate multiple tools adaptively across multi-step reasoning chains, inspired by the AdaReasoner framework (arXiv:2601.18631v2). Instead of following rigid, pre-defined tool pipelines, Claude learns to select tools based on task context, evaluate intermediate results, suppress irrelevant tools, adopt beneficial ones mid-chain, and backtrack when a tool call fails or produces unhelpful output. The core insight: treat tool use as a general reasoning skill where each invocation is conditioned on cumulative observations, not as a fixed mapping from task type to tool sequence.
Adaptive Tool Orchestration replaces static tool pipelines with a dynamic reasoning loop. At each step, the agent observes the current state (task description + all prior tool outputs), decides whether to invoke a tool or reason directly, selects which tool based on functional relevance (not name memorization), and evaluates the result before deciding the next action. This mirrors AdaReasoner's state-action-observation tuples: trajectory = {(s0, a0, o0), (s1, a1, o1), ..., (sT, aT, oT)} where each action is conditioned on the full history.
Three emergent behaviors distinguish this from naive tool chaining: (1) Tool Adoption — when a tool proves beneficial mid-chain, increase its usage frequency; (2) Tool Suppression — when a tool adds no value or introduces noise, stop calling it even if it seems topically relevant; (3) Frequency Modulation — adjust how many times a tool is called based on task complexity (a simple query needs one search; a complex investigation needs iterative refinement). These behaviors arise naturally from evaluating intermediate results rather than from explicit rules.
Backtracking and Fallback is critical. When a tool returns an error, produces garbage, or yields results that contradict other evidence, the agent must recognize this, discard the failed result, and either retry with different parameters, switch to an alternative tool, or fall back to direct reasoning. The data curation pipeline in AdaReasoner explicitly trains on failure trajectories — this skill encodes that same resilience pattern.
Decompose the task into observable subgoals. Break the user's request into a sequence of intermediate states, each with a clear success criterion. Write these as a checklist. Example: for "find performance bottlenecks in this API," subgoals are: identify endpoint handlers → profile each → rank by latency → trace root causes → propose fixes.
Inventory available tools and assess relevance. List every tool/capability available (bash, grep, read, web fetch, code execution, etc.). For each subgoal, rate which tools could help — but do not commit to a fixed mapping yet. Focus on functional descriptions, not tool names.
Execute the first subgoal with the most promising tool. Pick the tool most likely to produce useful output for the first subgoal. Invoke it with specific, well-formed parameters. Capture the full output.
Evaluate the intermediate result against the subgoal criterion. Ask: Did this tool call succeed? Is the output complete, partial, or erroneous? Does it change my understanding of subsequent subgoals? Score the result: fully satisfactory, partially useful, or failed.
Adapt the plan based on observations. If the result is fully satisfactory, advance to the next subgoal. If partially useful, decide whether to re-invoke with refined parameters or supplement with another tool. If failed, backtrack: try an alternative tool, adjust the subgoal decomposition, or fall back to direct reasoning.
Apply tool suppression for irrelevant capabilities. If a tool was invoked but added no value (e.g., a web search returned nothing useful for a local code question), mark it as suppressed for this task. Do not re-invoke it unless the task context shifts significantly.
Apply tool adoption for newly-valuable capabilities. If an intermediate result reveals that a tool not originally planned is now relevant (e.g., discovering a CSV file mid-analysis means a data-parsing tool is now useful), adopt it into the active plan.
Modulate tool frequency by complexity. For simple subgoals, one tool call suffices. For complex subgoals (e.g., searching a large codebase), iterate the same tool with progressively refined queries — but set a maximum iteration count (typically 3-5) to avoid infinite loops.
Verify the final result against the original task. After all subgoals are addressed, synthesize the accumulated observations into a coherent answer. Cross-check: does the final output actually answer what the user asked? If gaps remain, run one targeted verification step.
Report the reasoning chain transparently. Show the user which tools were used at each step, why alternatives were chosen or rejected, and where backtracking occurred. This builds trust and allows the user to refine the approach.
Example 1: Diagnosing a failing test in an unfamiliar codebase
User: "This test is failing but I have no idea why. Can you figure it out? tests/integration/test_auth.py::test_login_redirect"
Approach:
1. Decompose: Read the failing test → Identify what it asserts → Find the code under test →
Trace the execution path → Identify the discrepancy → Propose a fix.
2. Inventory: Read (files), Grep (search), Bash (run test with verbose output), Glob (find files).
3. Step 1 — Read the test file to understand assertions.
Result: Test expects a 302 redirect to "/dashboard" after login.
4. Step 2 — Run the test with verbose output (Bash: pytest -xvs).
Result: Getting 302 to "/home" instead of "/dashboard". Partially useful — now know the symptom.
5. Step 3 — Grep for the redirect URL in the auth module.
Result: Found redirect logic in src/auth/views.py:47 — redirects to settings.LOGIN_REDIRECT_URL.
6. Step 4 — Grep for LOGIN_REDIRECT_URL in settings.
Result: settings/base.py has "/dashboard", but settings/test.py overrides to "/home".
** Tool Adoption: didn't plan to read settings/test.py but intermediate result pointed here. **
7. Step 5 — Verify by reading settings/test.py.
Result: Confirmed. LOGIN_REDIRECT_URL = "/home" in test settings.
8. Synthesize: The test expects "/dashboard" but test settings override the redirect to "/home".
Output:
The test fails because `settings/test.py:12` overrides `LOGIN_REDIRECT_URL = "/home"`,
while the test asserts a redirect to "/dashboard". Either update the test expectation
to match the test settings, or remove the override in test settings.
Example 2: Building a data pipeline from heterogeneous sources
User: "I need to combine data from a CSV file, a JSON API endpoint, and a SQLite database
into a single summary report."
Approach:
1. Decompose: Parse CSV → Fetch API data → Query SQLite → Normalize schemas →
Merge datasets → Generate report.
2. Inventory: Read (CSV), WebFetch (API), Bash (sqlite3 queries), Write (output report).
3. Step 1 — Read the CSV file, identify columns and data types.
Result: CSV has columns [user_id, name, signup_date]. 500 rows. Success.
4. Step 2 — Fetch JSON API endpoint.
Result: API returns 403 Forbidden.
** Backtrack: API needs auth. Ask user for credentials or API key. **
Fallback: Ask user via AskUserQuestion for the API key.
Re-invoke with auth header. Result: JSON array with [user_id, purchase_total]. Success.
5. Step 3 — Query SQLite for user activity.
Result: Table "activity" has [user_id, last_login, session_count]. Success.
6. Step 4 — Normalize: all three sources share user_id as join key. Convert dates
to ISO format. No schema conflicts.
7. Step 5 — Write a Python script that merges on user_id using pandas.
** Tool Suppression: considered using bash join/awk but pandas is cleaner
for three-way merge with type coercion. **
8. Step 6 — Execute the script, write output to summary_report.csv.
9. Verify: spot-check 5 rows across all three sources. All match.
Output:
Generated summary_report.csv with 500 rows and columns:
[user_id, name, signup_date, purchase_total, last_login, session_count]
Example 3: Adaptive search across a large monorepo
User: "Where is the rate limiting logic implemented? I need to change the default limit."
Approach:
1. Decompose: Find rate-limiting code → Understand the configuration → Locate default value.
2. Step 1 — Grep for "rate.limit" across the repo.
Result: 47 matches, mostly in tests and docs. Partially useful — too noisy.
3. Step 2 — Refine: Grep for "RateLimit" in source files only (exclude tests/).
Result: 8 matches. Found src/middleware/rate_limiter.py and src/config/defaults.py.
** Frequency Modulation: narrowed search from broad to targeted in 2 iterations. **
4. Step 3 — Read src/middleware/rate_limiter.py.
Result: Class RateLimiter reads limit from config, defaults to self.default_limit = 100.
5. Step 4 — Read src/config/defaults.py.
Result: RATE_LIMIT_PER_MINUTE = 100 at line 34.
** Tool Suppression: no need to search further — found the authoritative source. **
6. Verify: Grep for RATE_LIMIT_PER_MINUTE to confirm no other overrides.
Result: Only referenced in defaults.py and rate_limiter.py. Confirmed.
Output:
The default rate limit is set at `src/config/defaults.py:34` as
`RATE_LIMIT_PER_MINUTE = 100`. The middleware at `src/middleware/rate_limiter.py:18`
reads this value. Change the value in defaults.py to update the limit.
| Failure Mode | Response Strategy | |---|---| | Tool returns an error (e.g., file not found, permission denied) | Log the error, assess whether the path/parameter was wrong vs. a systemic issue. Retry with corrected input or switch to an alternative tool. | | Tool returns empty or irrelevant results | Broaden or narrow the query. If two attempts yield nothing, suppress the tool and try a different approach entirely. | | Tool output contradicts prior observations | Do not silently accept. Re-verify the earlier observation with a different tool. Resolve the contradiction before proceeding. | | Tool call hangs or times out | Set explicit timeouts. If a tool is unresponsive, fall back to direct reasoning or a lighter-weight alternative. | | Circular reasoning (tool A suggests tool B, tool B suggests tool A) | Detect the loop by tracking the last 3-5 actions. Break out by either making a direct decision or asking the user for guidance. |
Paper: AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning (Song et al., 2026) Key insight to extract: The Tool-GRPO reward structure (Section 3.2) and the three emergent behaviors — tool adoption, suppression, and frequency modulation — which demonstrate that adaptive tool use can be learned as a general reasoning skill rather than hard-coded per task. The token-level randomization technique (Section 3.3) is particularly relevant for building tool-agnostic orchestration that generalizes to unfamiliar tools.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".