skills/evolving-tool-user-creator/SKILL.md
Transform Claude from a static tool user into a dynamic tool creator using the UCT (User-to-Creator Transformation) framework. Harvests reasoning traces from problem-solving sessions and distills them into reusable utility functions, scripts, and helpers that grow a persistent tool library over time. Trigger phrases: 'create a reusable tool from this solution', 'build a helper I can reuse', 'evolve my toolset', 'extract a utility from this workflow', 'self-improving agent pipeline', 'turn this reasoning into a tool'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evolving-tool-user-creatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to go beyond using existing tools by creating new reusable tools on-the-fly during problem-solving. Based on the UCT (User-to-Creator Transformation) framework, Claude harvests its own reasoning traces—the step-by-step logic it produces while solving a problem—and distills them into tested, reusable code artifacts (functions, scripts, CLI utilities). A memory consolidation mechanism maintains this growing tool library: merging duplicates, deprecating broken tools, and ensuring high reuse rates across future tasks.
The UCT framework reframes the agent's relationship with tools. Traditional tool-integrated reasoning treats tools as a fixed inventory—the agent picks from what exists. UCT instead treats every reasoning trace as a potential tool blueprint. When Claude works through a complex problem (geometric calculation, data transformation, API orchestration), the intermediate steps encode implicit problem-solving capabilities. UCT harvests these into explicit, executable code.
The framework operates in three coupled phases. The Online Task Loop uses ReAct-style reasoning where, at each step, the agent decides whether to use an existing tool, create a new one, or reason further. When creation is triggered, a Build Loop enters an isolated environment: it generates both the implementation code and test scripts simultaneously, then iterates using sandbox execution feedback and critic review until the tool passes. Finally, an Offline Memory Consolidation phase periodically cleans the library—merging similar tools, eliminating duplicates, and deprecating tools with high failure rates or low reuse.
The critical insight is the dual-verification gate in the Build Loop. Tools are not accepted based on generation alone. They must pass runtime sandbox tests AND a critic review that checks for edge cases, correctness, and API contract adherence. This prevents erroneous tool outputs from poisoning downstream reasoning—a key failure mode in naive tool-creation approaches. The result: 93%+ reuse rates and 20-23% performance gains on math and science benchmarks.
Receive the task and assess the tool gap. Before jumping to a solution, inventory what existing tools/functions are available. Identify sub-problems where no current tool fits—these are tool-creation candidates. Ask: "Would a reusable function here save effort on this or future tasks?"
Solve the task using ReAct-style reasoning. Work through the problem step-by-step, explicitly documenting each reasoning step and intermediate computation. Keep the reasoning trace detailed—it becomes the blueprint for tool extraction.
Identify reusable sub-capabilities in the reasoning trace. Scan the trace for steps that are (a) self-contained, (b) parameterizable, and (c) likely to recur. Examples: a coordinate geometry calculation, a specific data parsing routine, a validation check pattern.
Generate a build ticket for each tool candidate. Write a concise specification: function name, purpose, input parameters with types, expected output, and 2-3 concrete test cases derived from the problem you just solved.
Implement the tool as executable code with co-generated tests. Write the function AND a test script in the same pass. The function should be pure where possible (no hidden state), well-typed, and documented with a one-line docstring.
Run sandbox validation. Execute the test script. If tests fail, analyze the failure, apply critic feedback (check edge cases, boundary conditions, type mismatches), and iterate. Do not accept a tool that fails any test case.
Register the tool in the library with metadata. Store the tool with: name, category, description, usage signature, creation context (what problem spawned it), and initial usage count of 1.
Resume the original task using the newly created tool. Invoke the tool to complete the step that triggered its creation. Verify the tool's output matches expectations in context.
Consolidate the library periodically. After completing a session or batch of tasks, review the tool library: merge tools with overlapping functionality, flag tools that have never been reused, and deprecate tools that produced errors in subsequent use.
Retrieve before creating on subsequent tasks. For every new sub-problem, search the existing library first (by category and description match). Only trigger the Build Loop if no existing tool fits. Log each retrieval to track reuse rates.
Example 1: Extracting a geometry utility from a math problem
User: "I need to calculate the area of intersection between two circles given their centers and radii. I'll have many similar problems."
Approach:
circle_intersection_area(x1, y1, r1, x2, y2, r2) -> float# Tool: circle_intersection_area
# Category: geometry/area
# Created from: circle intersection reasoning trace
import math
def circle_intersection_area(x1: float, y1: float, r1: float,
x2: float, y2: float, r2: float) -> float:
"""Compute area of intersection of two circles."""
d = math.hypot(x2 - x1, y2 - y1)
if d >= r1 + r2:
return 0.0 # disjoint
if d + min(r1, r2) <= max(r1, r2):
return math.pi * min(r1, r2) ** 2 # one inside the other
a = (r1**2 - r2**2 + d**2) / (2 * d)
h = math.sqrt(r1**2 - a**2)
sector1 = r1**2 * math.acos(a / r1)
sector2 = r2**2 * math.acos((d - a) / r2)
triangle = d * h
return sector1 + sector2 - triangle
# Co-generated tests
assert circle_intersection_area(0, 0, 1, 10, 0, 1) == 0.0 # disjoint
assert abs(circle_intersection_area(0, 0, 5, 1, 0, 2) - math.pi * 4) < 1e-6 # contained
assert circle_intersection_area(0, 0, 1, 1, 0, 1) > 0 # overlapping
Example 2: Building a data parsing tool during CSV analysis
User: "Parse this messy CSV where some rows have quoted fields with commas inside, some have missing columns, and dates are in mixed formats. I'll process 50 similar files."
Approach:
robust_csv_parse(filepath, date_columns) -> DataFrame and normalize_date(date_str) -> datetime# Tool: normalize_date
# Category: parsing/datetime
# Created from: mixed-format CSV parsing trace
from datetime import datetime
FORMATS = ["%Y-%m-%d", "%m/%d/%Y", "%d-%b-%Y", "%B %d, %Y", "%Y%m%d"]
def normalize_date(date_str: str) -> datetime:
"""Parse a date string in common formats into a datetime object."""
date_str = date_str.strip()
for fmt in FORMATS:
try:
return datetime.strptime(date_str, fmt)
except ValueError:
continue
raise ValueError(f"Unrecognized date format: {date_str!r}")
# Tests
assert normalize_date("2025-03-15").day == 15
assert normalize_date("03/15/2025").month == 3
assert normalize_date("15-Mar-2025").year == 2025
normalize_date fails on a new format ("March 15th, 2025"). Self-update: add the format to the tool, re-run tests, re-register.Example 3: Agent pipeline that accumulates tools across tasks
User: "I'm building an agent that answers science questions. Start with this physics problem, but design it so the agent gets better over time."
Approach:
projectile_range(v0, angle_deg, g=9.81) -> floatsnells_law(n1, theta1, n2) -> floatprojectile_range from library, find it needs extension. Self-update the tool to accept an incline_deg parameter. Run existing + new tests.Consolidation check:
Tool Library Status:
- projectile_range | physics/kinematics | uses: 3 | failures: 0 | status: active
- snells_law | physics/optics | uses: 1 | failures: 0 | status: active
- (no merges needed, no deprecations)
Do:
normalize_date + parse_csv_row is better than a single do_everything functionAvoid:
| Failure Mode | Response | |---|---| | Generated tool fails sandbox tests | Iterate: feed execution errors + critic feedback back into the Build Loop. Cap at 3 iterations; if still failing, fall back to inline reasoning for this task and log the failed attempt. | | Tool produces wrong output during reuse on a new task | Check if the input is within the tool's designed domain. If yes, fix the tool (self-update) and add the new case as a regression test. If no, create a new tool instead. | | Tool library grows too large (>50 tools) | Trigger immediate consolidation: merge tools with >70% functional overlap, deprecate tools with 0 reuses after 10+ tasks, archive rather than delete. | | Retrieval returns the wrong tool for a sub-problem | Improve tool metadata (add usage examples to descriptions). If retrieval consistently fails, switch to keyword + category-based lookup instead of purely semantic matching. | | Circular dependency between tools | Flatten: inline the dependency or restructure into a shared utility. Tools should be independently executable. |
Paper: Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning — Shen et al., 2026. Look for: Algorithm 1 (Online Task Loop), Algorithm 2 (Build Loop with dual verification), and Table 2 (ablation showing each component's contribution). The memory consolidation formalization in Section 3.3 provides the theoretical grounding for the library maintenance protocol.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".