skills/skillcraft/SKILL.md
Evaluate and analyze LLM agents' ability to form, abstract, and reuse higher-level tool compositions. Use when researching agent skill discovery, tool composition patterns, or evaluating skill caching efficiency.
npx skillsauth add dwsy/agent skillcraftInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate and analyze LLM agents' ability to form, abstract, and reuse higher-level tool compositions (Skills). Use this skill when researching agent skill discovery, tool composition patterns, or evaluating skill caching efficiency.
触发词: SkillCraft, skill discovery, tool composition, agent skills, skill caching, LLM benchmark, 技能发现, 工具组合
Traditional: Tool → LLM → Result
SkillCraft: Tool → LLM → Skill Abstract → Skill Cache → Reuse
Pattern: read → analyze → edit → test → commit
Skill: One-click workflow execution
Pattern: load → clean → analyze → visualize → report
Skill: Data type-specific analysis templates
Pattern: search → filter → summarize → compare
Skill: Literature review automation
uv package managernpx# Clone
git clone https://github.com/shiqichen17/SkillCraft
cd SkillCraft
# Install
uv sync
# Configure .env
TOOLATHLON_OPENAI_API_KEY=YOUR_KEY
TOOLATHLON_OPENAI_BASE_URL=https://openrouter.ai/api/v1
TOOLATHLON_MODEL=deepseek-v3.2-exp
TOOLATHLON_PROVIDER=openrouter
# Run complete evaluation
uv run python test_all_tasks.py \
--scaled-tasks \
--mode base,skill \
--model deepseek-v3.2-exp \
--provider openrouter
# Base mode
bash run.sh scaled_tasks/cat-facts-collector/e1 base --model deepseek-v3.2-exp --provider openrouter
# Skill mode
bash run.sh scaled_tasks/cat-facts-collector/e1 skill --model deepseek-v3.2-exp --provider openrouter
Scaled tasks include:
gitlab-deep-analysiscountries-encyclopediatvmaze-series-analyzerpokeapi-pokedexcat-facts-collectorEach run produces:
test_runs/run_YYYYMMDD_HHMMSS/
├── run_info.json
├── test_results_<provider>_<model>.json
├── summary_<provider>_<model>.json
├── dumps_base_test/
└── dumps_skill_test/
| Level | Type | SkillCraft Test | |-------|------|-----------------| | 1 | Knowledge | ❌ Not tested | | 2 | Understanding | ❌ Not tested | | 3 | Application | ⚠️ Prerequisite | | 4 | Analysis | ⚠️ Prerequisite | | 5 | Synthesis | ✅ Core test | | 6 | Evaluation | ⚠️ Implicit |
Core test: Can the agent synthesize new skills from tool combinations?
skill:
type: human-defined # Current: human-written
type: auto-discovered # Future: pattern mining from trajectories
source:
- pattern_mining # Discover from success trajectories
- composition # Abstract tool combinations
- optimization # Auto-optimize existing skills
metrics:
- token_savings # Efficiency gain
- success_rate # Task completion
- transferability # Cross-domain applicability
@misc{chen2026skillcraftllmagentslearn,
title={SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?},
author={Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and Cong Lu and Manling Li and Junxian He and Yee Whye Teh},
year={2026},
eprint={2603.00718},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.00718},
}
SkillCraft tests whether agents can evolve from "tool users" to "skill creators" - a qualitative leap from execution to learning.
Last updated: 2026-03-21
testing
Best practices for writing and maintaining high-quality role memories.
documentation
工作文档枢纽,强制执行 SSOT(Single Source of Truth)原则,管理 `docs/` 目录下的架构决策、设计文档、Issues(任务规划)、PRs(变更记录)。支持 GitHub 协作开发模式。
tools
Allows to interact with web pages by performing actions such as clicking buttons, filling out forms, and navigating links. It works by remote controlling Google Chrome or Chromium browsers using the Chrome DevTools Protocol (CDP). When Claude needs to browse the web, it can use this skill to do so.
development
Vercel 设计指南 - 构建高质量 Web 应用的最佳实践,包含现代 UI/UX 原则、性能优化和无障碍标准。