skills/pinchbench/pinchbench/SKILL.md
Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.
npx skillsauth add aiskillstore/marketplace pinchbenchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.
cd <skill_directory>
# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4
# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only
# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock
# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload
| Task | Category | Description |
|------|----------|-------------|
| task_00_sanity | Basic | Verify agent works |
| task_01_calendar | Productivity | Calendar event creation |
| task_02_stock | Research | Stock price lookup |
| task_03_blog | Writing | Blog post creation |
| task_04_weather | Coding | Weather script |
| task_05_summary | Analysis | Document summarization |
| task_06_events | Research | Conference research |
| task_07_email | Writing | Email drafting |
| task_08_memory | Memory | Context retrieval |
| task_09_files | Files | File structure creation |
| task_10_workflow | Integration | Multi-step API workflow |
| task_11_clawdhub | Skills | ClawHub interaction |
| task_12_skill_search | Skills | Skill discovery |
| task_13_image_gen | Creative | Image generation |
| task_14_humanizer | Writing | Text humanization |
| task_15_daily_summary | Productivity | Daily digest |
| task_16_email_triage | Email | Inbox triage |
| task_17_email_search | Email | Email search |
| task_18_market_research | Research | Market analysis |
| task_19_spreadsheet_summary | Analysis | Spreadsheet analysis |
| task_20_eli5_pdf_summary | Analysis | PDF simplification |
| task_21_openclaw_comprehension | Knowledge | OpenClaw docs comprehension |
| task_22_second_brain | Memory | Knowledge management |
| Option | Description |
|--------|-------------|
| --model | Model identifier (e.g., anthropic/claude-sonnet-4) |
| --suite | all, automated-only, or comma-separated task IDs |
| --output-dir | Results directory (default: results/) |
| --timeout-multiplier | Scale task timeouts for slower models |
| --runs | Number of runs per task for averaging |
| --no-upload | Skip uploading to leaderboard |
| --register | Request new API token for submissions |
| --upload FILE | Upload previous results JSON |
To submit results to the leaderboard:
# Register for an API token (one-time)
uv run benchmark.py --register
# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4
Results are saved as JSON in the output directory:
# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json
# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json
# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json
Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:
View results at pinchbench.com. The leaderboard shows:
development
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.