Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

dwsy/skillcraft

Name: skillcraft
Author: dwsy

skills/skillcraft/SKILL.md

npx skillsauth add dwsy/agent skillcraft

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

SkillCraft - LLM Agent Tool Composition Benchmark

Description

Evaluate and analyze LLM agents' ability to form, abstract, and reuse higher-level tool compositions (Skills). Use this skill when researching agent skill discovery, tool composition patterns, or evaluating skill caching efficiency.

触发词： SkillCraft, skill discovery, tool composition, agent skills, skill caching, LLM benchmark, 技能发现, 工具组合

Core Concepts

Problem Statement

Traditional benchmarks test "can the agent call the right tool?"
SkillCraft tests "can the agent abstract and reuse tool combinations?"
This is the difference between tool usage and tool mastery

Dual Difficulty Dimensions

Quantitative Scaling - Increase number of entities/items to process
Structural Scaling - Compose subtasks into longer, more complex tool chains

Key Findings

Up to 80% token reduction through skill caching
Success rate correlates with tool composition ability
Real-world stress test - recurring patterns in long-horizon workflows

Architecture Paradigm Shift

Traditional: Tool → LLM → Result

SkillCraft:  Tool → LLM → Skill Abstract → Skill Cache → Reuse

Practical Applications

Code Agent Skills

Pattern: read → analyze → edit → test → commit
Skill: One-click workflow execution

Data Analysis Skills

Pattern: load → clean → analyze → visualize → report
Skill: Data type-specific analysis templates

Research Skills

Pattern: search → filter → summarize → compare
Skill: Literature review automation

Reproduction

Prerequisites

Linux (recommended)
Python 3.10+
uv package manager
Node.js 22+ with npx
OpenRouter/Toolathlon API endpoint

Install & Run

# Clone
git clone https://github.com/shiqichen17/SkillCraft
cd SkillCraft

# Install
uv sync

# Configure .env
TOOLATHLON_OPENAI_API_KEY=YOUR_KEY
TOOLATHLON_OPENAI_BASE_URL=https://openrouter.ai/api/v1
TOOLATHLON_MODEL=deepseek-v3.2-exp
TOOLATHLON_PROVIDER=openrouter

# Run complete evaluation
uv run python test_all_tasks.py \
  --scaled-tasks \
  --mode base,skill \
  --model deepseek-v3.2-exp \
  --provider openrouter

Single Task Test

# Base mode
bash run.sh scaled_tasks/cat-facts-collector/e1 base --model deepseek-v3.2-exp --provider openrouter

# Skill mode
bash run.sh scaled_tasks/cat-facts-collector/e1 skill --model deepseek-v3.2-exp --provider openrouter

Benchmark Tasks

Scaled tasks include:

gitlab-deep-analysis
countries-encyclopedia
tvmaze-series-analyzer
pokeapi-pokedex
cat-facts-collector
And more...

Output Structure

Each run produces:

test_runs/run_YYYYMMDD_HHMMSS/
├── run_info.json
├── test_results_<provider>_<model>.json
├── summary_<provider>_<model>.json
├── dumps_base_test/
└── dumps_skill_test/

Cognitive Level Mapping

| Level | Type | SkillCraft Test | |-------|------|-----------------| | 1 | Knowledge | ❌ Not tested | | 2 | Understanding | ❌ Not tested | | 3 | Application | ⚠️ Prerequisite | | 4 | Analysis | ⚠️ Prerequisite | | 5 | Synthesis | ✅ Core test | | 6 | Evaluation | ⚠️ Implicit |

Core test: Can the agent synthesize new skills from tool combinations?

Implications for Agent Development

Current Agent Skills Limitations

Depend on human pre-definition (SKILL.md)
Cannot auto-discover skill patterns
Lack cross-task reuse mechanism

Future Evolution

skill:
  type: human-defined      # Current: human-written
  type: auto-discovered    # Future: pattern mining from trajectories
  
  source:
    - pattern_mining       # Discover from success trajectories
    - composition          # Abstract tool combinations
    - optimization         # Auto-optimize existing skills
    
  metrics:
    - token_savings        # Efficiency gain
    - success_rate         # Task completion
    - transferability      # Cross-domain applicability

Actionable Insights for pi Skills

Trajectory Analysis - Mine pi logs for high-frequency tool combinations
Skill Extraction - Auto-generate SKILL.md candidates
Effect Validation - Use SkillCraft methodology for quality assessment

Resources

Paper: SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Code: github.com/shiqichen17/SkillCraft
Project: skillcraft-website.github.io/page

Citation

@misc{chen2026skillcraftllmagentslearn,
      title={SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?},
      author={Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and Cong Lu and Manling Li and Junxian He and Yee Whye Teh},
      year={2026},
      eprint={2603.00718},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.00718},
}

Key Insight

SkillCraft tests whether agents can evolve from "tool users" to "skill creators" - a qualitative leap from execution to learning.

Last updated: 2026-03-21

dwsy/skillcraft

skills/skillcraft/SKILL.md

Evaluate and analyze LLM agents' ability to form, abstract, and reuse higher-level tool compositions. Use when researching agent skill discovery, tool composition patterns, or evaluating skill caching efficiency.

10 stars

tools

Updated Apr 21, 2026

$ install --global

skillsauth

npx skillsauth add dwsy/agent skillcraft

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 21, 2026, 10:01 AM55.5s1 file scanned

SKILL.md

name:: skillcraft
description:: Evaluate and analyze LLM agents' ability to form, abstract, and reuse higher-level tool compositions. Use when researching agent skill discovery, tool composition patterns, or evaluating skill caching efficiency.

SkillCraft - LLM Agent Tool Composition Benchmark

Description

触发词： SkillCraft, skill discovery, tool composition, agent skills, skill caching, LLM benchmark, 技能发现, 工具组合

Core Concepts

Problem Statement

Traditional benchmarks test "can the agent call the right tool?"
SkillCraft tests "can the agent abstract and reuse tool combinations?"
This is the difference between tool usage and tool mastery

Dual Difficulty Dimensions

Quantitative Scaling - Increase number of entities/items to process
Structural Scaling - Compose subtasks into longer, more complex tool chains

Key Findings

Up to 80% token reduction through skill caching
Success rate correlates with tool composition ability
Real-world stress test - recurring patterns in long-horizon workflows

Architecture Paradigm Shift

Traditional: Tool → LLM → Result

SkillCraft:  Tool → LLM → Skill Abstract → Skill Cache → Reuse

Practical Applications

Code Agent Skills

Pattern: read → analyze → edit → test → commit
Skill: One-click workflow execution

Data Analysis Skills

Pattern: load → clean → analyze → visualize → report
Skill: Data type-specific analysis templates

Research Skills

Pattern: search → filter → summarize → compare
Skill: Literature review automation

Reproduction

Prerequisites

Linux (recommended)
Python 3.10+
uv package manager
Node.js 22+ with npx
OpenRouter/Toolathlon API endpoint

Install & Run

# Clone
git clone https://github.com/shiqichen17/SkillCraft
cd SkillCraft

# Install
uv sync

# Configure .env
TOOLATHLON_OPENAI_API_KEY=YOUR_KEY
TOOLATHLON_OPENAI_BASE_URL=https://openrouter.ai/api/v1
TOOLATHLON_MODEL=deepseek-v3.2-exp
TOOLATHLON_PROVIDER=openrouter

# Run complete evaluation
uv run python test_all_tasks.py \
  --scaled-tasks \
  --mode base,skill \
  --model deepseek-v3.2-exp \
  --provider openrouter

Single Task Test

# Base mode
bash run.sh scaled_tasks/cat-facts-collector/e1 base --model deepseek-v3.2-exp --provider openrouter

# Skill mode
bash run.sh scaled_tasks/cat-facts-collector/e1 skill --model deepseek-v3.2-exp --provider openrouter

Benchmark Tasks

Scaled tasks include:

gitlab-deep-analysis
countries-encyclopedia
tvmaze-series-analyzer
pokeapi-pokedex
cat-facts-collector
And more...

Output Structure

Each run produces:

test_runs/run_YYYYMMDD_HHMMSS/
├── run_info.json
├── test_results_<provider>_<model>.json
├── summary_<provider>_<model>.json
├── dumps_base_test/
└── dumps_skill_test/

Cognitive Level Mapping

Core test: Can the agent synthesize new skills from tool combinations?

Implications for Agent Development

Current Agent Skills Limitations

Depend on human pre-definition (SKILL.md)
Cannot auto-discover skill patterns
Lack cross-task reuse mechanism

Future Evolution

skill:
  type: human-defined      # Current: human-written
  type: auto-discovered    # Future: pattern mining from trajectories
  
  source:
    - pattern_mining       # Discover from success trajectories
    - composition          # Abstract tool combinations
    - optimization         # Auto-optimize existing skills
    
  metrics:
    - token_savings        # Efficiency gain
    - success_rate         # Task completion
    - transferability      # Cross-domain applicability

Actionable Insights for pi Skills

Trajectory Analysis - Mine pi logs for high-frequency tool combinations
Skill Extraction - Auto-generate SKILL.md candidates
Effect Validation - Use SkillCraft methodology for quality assessment

Resources

Paper: SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?
Code: github.com/shiqichen17/SkillCraft
Project: skillcraft-website.github.io/page

Citation

@misc{chen2026skillcraftllmagentslearn,
      title={SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?},
      author={Shiqi Chen and Jingze Gai and Ruochen Zhou and Jinghan Zhang and Tongyao Zhu and Junlong Li and Kangrui Wang and Zihan Wang and Zhengyu Chen and Klara Kaleb and Ning Miao and Siyang Gao and Cong Lu and Manling Li and Junxian He and Yee Whye Teh},
      year={2026},
      eprint={2603.00718},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.00718},
}

Key Insight

SkillCraft tests whether agents can evolve from "tool users" to "skill creators" - a qualitative leap from execution to learning.

Last updated: 2026-03-21

Related Skills

dwsy/memory-best-practices

testing

VerifiedTrustedCommunity

Best practices for writing and maintaining high-quality role memories.

11SKILL.mdUpdated Apr 30, 2026

dwsy/memory-best-practices

dwsy/workhub

documentation

VerifiedTrustedCommunity

工作文档枢纽，强制执行 SSOT（Single Source of Truth）原则，管理 `docs/` 目录下的架构决策、设计文档、Issues（任务规划）、PRs（变更记录）。支持 GitHub 协作开发模式。

11SKILL.mdUpdated Apr 28, 2026

dwsy/web-browser

tools

VerifiedTrustedCommunity

Allows to interact with web pages by performing actions such as clicking buttons, filling out forms, and navigating links. It works by remote controlling Google Chrome or Chromium browsers using the Chrome DevTools Protocol (CDP). When Claude needs to browse the web, it can use this skill to do so.

11SKILL.mdUpdated Apr 28, 2026

dwsy/vercel-design

development

VerifiedTrustedCommunity

Vercel 设计指南 - 构建高质量 Web 应用的最佳实践，包含现代 UI/UX 原则、性能优化和无障碍标准。

11SKILL.mdUpdated Apr 28, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/dwsy/agent.git

# Copy into Claude Code skills folder (global)
cp -r agent/skills/skillcraft ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

dwsy/agent

10 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT