skills-experimental/benchmark-analyzer/SKILL.md
# benchmark-analyzer ## Description Analyze benchmark test results with failure classification, statistics reporting, and retry command generation. ## Usage This skill is used to analyze Terminal-Bench 2.0 or other benchmark test results, classify failures, generate statistics, and provide retry recommendations. ## How It Works ### Failure Classification The analyzer classifies failures into these categories: | Type | Description | |------|-------------| | `rate_limit` | API rate limit e
npx skillsauth add bianhaifeng789-hue/openclaw-config skills-experimental/benchmark-analyzerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze benchmark test results with failure classification, statistics reporting, and retry command generation.
This skill is used to analyze Terminal-Bench 2.0 or other benchmark test results, classify failures, generate statistics, and provide retry recommendations.
The analyzer classifies failures into these categories:
| Type | Description |
|------|-------------|
| rate_limit | API rate limit exceeded (429) |
| timeout | Agent or task timed out |
| missing_tool | Required tool/command not found |
| missing_module | Python module not installed |
| docker_conflict | Docker container name conflict |
| api_error | API connection error |
| instant_exit | Agent exited immediately (<10s) |
| task_failure | Task logic failed |
| other_exception | Unclassified exception |
impl/bin/benchmark-analyzer.js: Main analyzer scriptclassifyFailure(trialDir): Determine failure typeanalyzeJob(jobDir, failedOnly): Analyze all trialsgenerateRetryCmd(trials, jobDir): Create retry commandgenerateStats(trials): JSON statistics outputUse this skill:
Don't use this skill:
- name: benchmark-analysis-check
interval: 6h
priority: low
prompt: "Run benchmark-analyzer.js on latest job directory. Report failure stats and retry recommendations."
# Analyze job directory
node impl/bin/benchmark-analyzer.js jobs/2026-04-17__15-00-00
# Show only failed tasks
node impl/bin/benchmark-analyzer.js jobs/... --failed-only
# Generate retry command
node impl/bin/benchmark-analyzer.js jobs/... --retry-cmd
# Output JSON stats
node impl/bin/benchmark-analyzer.js jobs/... --json
============================================================
JOB: 2026-04-17__15-00-00
============================================================
Total: 89 Passed: 45 Failed: 44 Rate: 50.6%
FAILURE BREAKDOWN:
timeout : 15 tasks
- circuit-fibsqrt
- compile-compcert
...
rate_limit : 10 tasks
- frontend-design
...
task_failure : 19 tasks
...
ALL TASKS (89):
✅ chess-best-move 120s
✅ distribution-search 85s
❌ circuit-fibsqrt N/A [timeout]
❌ frontend-design 15s [rate_limit]
...
RETRY COMMAND (29 tasks, excluding timeouts):
harbor run -d "[email protected]" \
--agent-import-path benchmarks.harbor_agent:HarnessAgent \
-k 1 \
--n-concurrent 1 \
--agent-setup-timeout-multiplier 2 \
--max-retries 3 \
--retry-include DaytonaError \
--retry-include AgentSetupTimeoutError \
--retry-include AddTestsDirError \
--task-name frontend-design \
--task-name distribution-search \
...
{
"summary": {
"total": 89,
"passed": 45,
"failed": 44,
"passRate": "50.6%"
},
"failures": {
"timeout": 15,
"rate_limit": 10,
"task_failure": 19
},
"timing": {
"avgDuration": 85,
"minDuration": 15,
"maxDuration": 1200
},
"timestamp": 1776409234567
}
Extend classifyFailure() to detect additional patterns:
if (text.includes('out of memory')) {
return 'memory_limit';
}
if (text.includes('disk full')) {
return 'disk_space';
}
- name: Analyze results
run: |
node impl/bin/benchmark-analyzer.js jobs/latest --json > stats.json
- name: Check pass rate
run: |
PASS_RATE=$(cat stats.json | jq -r '.summary.passRate' | sed 's/%//')
if [ "$PASS_RATE" -lt 50 ]; then
echo "Pass rate too low: $PASS_RATE%"
exit 1
fi
Derived from Harness Engineering's analyze_results.py:
https://github.com/lazyFrogLOL/Harness_Engineering
Reference: scripts/analyze_results.py
business
IAA 日报飞书输出能力。 支持把固定 CSV 模板一键转换成: - 中文运营结论 - 飞书卡片 JSON - 飞书发送载荷 Use when: - 需要把 IAA 日报直接发到飞书 - 需要从 CSV 一键生成运营日报
data-ai
IAA日报分析模型 功能: - 渠道日报自动分析 - 小时级+日级ROI联动判断 - 按地区输出加量/降量/停投建议 - 按产品类型输出阈值 - 自动识别利润区/观察区/止损区 Use when: - 分析每天投放数据 - 生成运营日报结论 - 判断是否加量/降量/停投 - 对比美加澳/日韩表现 Keywords: - 日报模型, 投放日报, 加量, 降量, 停投, ROI日报, 分地区分析
data-ai
IAA固定日报分析模板 功能: - 固定字段模板(可直接贴每天数据) - 自动输出总盘结论 - 自动输出美加澳/日韩结论 - 自动给出加量/降量/停投建议 - 适配文件修复/清理两类产品 Use when: - 需要固定日报格式 - 每天复盘渠道表现 - 给运营团队出统一结论 Keywords: - 固定模板, 日报模板, ROI模板, IAA日报, 运营模板
development
# HyperlinkPool Pattern Skill HyperlinkPool Pattern - HyperlinkPool class + strings array + stringMap + Index 0 no hyperlink + intern(hyperlink) + get(id) + undefined handling + 5-minute reset + OSC8 hyperlink interning。 ## 功能概述 从Claude Code的ink/screen.ts提取的HyperlinkPool模式,用于OpenClaw的OSC8超链接池管理。 ## 核心机制 ### HyperlinkPool Class ```typescript export class HyperlinkPool { private strings: string[] = [''] // Index 0 = no hyperlink private stringMap = new Map<string, number>() // strings