Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bianhaifeng789-hue/skills-experimental/benchmark-analyzer

Name: skills-experimental/benchmark-analyzer
Author: bianhaifeng789-hue

skills-experimental/benchmark-analyzer/SKILL.md

npx skillsauth add bianhaifeng789-hue/openclaw-config skills-experimental/benchmark-analyzer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

benchmark-analyzer

Description

Analyze benchmark test results with failure classification, statistics reporting, and retry command generation.

Usage

This skill is used to analyze Terminal-Bench 2.0 or other benchmark test results, classify failures, generate statistics, and provide retry recommendations.

How It Works

Failure Classification

The analyzer classifies failures into these categories:

| Type | Description | |------|-------------| | rate_limit | API rate limit exceeded (429) | | timeout | Agent or task timed out | | missing_tool | Required tool/command not found | | missing_module | Python module not installed | | docker_conflict | Docker container name conflict | | api_error | API connection error | | instant_exit | Agent exited immediately (<10s) | | task_failure | Task logic failed | | other_exception | Unclassified exception |

Analysis Process

Read job directory: Find all trial subdirectories
Parse results: Extract reward, duration, task_name
Classify failures: Analyze exception.txt and result.json
Generate stats: Pass rate, failure breakdown, timing
Retry command: Generate harbor retry command (excluding timeouts)

Implementation

Files

impl/bin/benchmark-analyzer.js: Main analyzer script

Key Functions

classifyFailure(trialDir): Determine failure type
analyzeJob(jobDir, failedOnly): Analyze all trials
generateRetryCmd(trials, jobDir): Create retry command
generateStats(trials): JSON statistics output

When to Use

Use this skill:

After running benchmark tests
Need to understand failure patterns
Want retry recommendations
Generate test reports

Don't use this skill:

No benchmark results to analyze
Real-time monitoring (use heartbeat instead)

Integration

Heartbeat Task

- name: benchmark-analysis-check
  interval: 6h
  priority: low
  prompt: "Run benchmark-analyzer.js on latest job directory. Report failure stats and retry recommendations."

CLI Commands

# Analyze job directory
node impl/bin/benchmark-analyzer.js jobs/2026-04-17__15-00-00

# Show only failed tasks
node impl/bin/benchmark-analyzer.js jobs/... --failed-only

# Generate retry command
node impl/bin/benchmark-analyzer.js jobs/... --retry-cmd

# Output JSON stats
node impl/bin/benchmark-analyzer.js jobs/... --json

Example Output

Analysis Report

============================================================
JOB: 2026-04-17__15-00-00
============================================================
Total: 89  Passed: 45  Failed: 44  Rate: 50.6%

FAILURE BREAKDOWN:
  timeout              : 15 tasks
    - circuit-fibsqrt
    - compile-compcert
    ...
  rate_limit           : 10 tasks
    - frontend-design
    ...
  task_failure         : 19 tasks
    ...

ALL TASKS (89):
  ✅ chess-best-move                          120s
  ✅ distribution-search                       85s
  ❌ circuit-fibsqrt                          N/A [timeout]
  ❌ frontend-design                          15s  [rate_limit]
  ...

Retry Command

RETRY COMMAND (29 tasks, excluding timeouts):

harbor run -d "[email protected]" \
  --agent-import-path benchmarks.harbor_agent:HarnessAgent \
  -k 1 \
  --n-concurrent 1 \
  --agent-setup-timeout-multiplier 2 \
  --max-retries 3 \
  --retry-include DaytonaError \
  --retry-include AgentSetupTimeoutError \
  --retry-include AddTestsDirError \
  --task-name frontend-design \
  --task-name distribution-search \
  ...

JSON Stats

{
  "summary": {
    "total": 89,
    "passed": 45,
    "failed": 44,
    "passRate": "50.6%"
  },
  "failures": {
    "timeout": 15,
    "rate_limit": 10,
    "task_failure": 19
  },
  "timing": {
    "avgDuration": 85,
    "minDuration": 15,
    "maxDuration": 1200
  },
  "timestamp": 1776409234567
}

Benefits

Insight into failures: Understand why tasks failed
Retry optimization: Exclude tasks unlikely to pass on retry
Performance analysis: Identify slow tasks
Reporting: Generate statistics for documentation

Advanced Usage

Custom Failure Classification

Extend classifyFailure() to detect additional patterns:

if (text.includes('out of memory')) {
  return 'memory_limit';
}
if (text.includes('disk full')) {
  return 'disk_space';
}

Integration with CI/CD

- name: Analyze results
  run: |
    node impl/bin/benchmark-analyzer.js jobs/latest --json > stats.json
    
- name: Check pass rate
  run: |
    PASS_RATE=$(cat stats.json | jq -r '.summary.passRate' | sed 's/%//')
    if [ "$PASS_RATE" -lt 50 ]; then
      echo "Pass rate too low: $PASS_RATE%"
      exit 1
    fi

Source

Derived from Harness Engineering's analyze_results.py: https://github.com/lazyFrogLOL/Harness_Engineering

Reference: scripts/analyze_results.py

bianhaifeng789-hue/skills-experimental/benchmark-analyzer

skills-experimental/benchmark-analyzer/SKILL.md

# benchmark-analyzer ## Description Analyze benchmark test results with failure classification, statistics reporting, and retry command generation. ## Usage This skill is used to analyze Terminal-Bench 2.0 or other benchmark test results, classify failures, generate statistics, and provide retry recommendations. ## How It Works ### Failure Classification The analyzer classifies failures into these categories: | Type | Description | |------|-------------| | `rate_limit` | API rate limit e

development

Updated Apr 20, 2026

$ install --global

skillsauth

npx skillsauth add bianhaifeng789-hue/openclaw-config skills-experimental/benchmark-analyzer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 3:06 PM4.5s1 file scanned

SKILL.md

benchmark-analyzer

Description

Analyze benchmark test results with failure classification, statistics reporting, and retry command generation.

Usage

This skill is used to analyze Terminal-Bench 2.0 or other benchmark test results, classify failures, generate statistics, and provide retry recommendations.

How It Works

Failure Classification

The analyzer classifies failures into these categories:

Analysis Process

Read job directory: Find all trial subdirectories
Parse results: Extract reward, duration, task_name
Classify failures: Analyze exception.txt and result.json
Generate stats: Pass rate, failure breakdown, timing
Retry command: Generate harbor retry command (excluding timeouts)

Implementation

Files

impl/bin/benchmark-analyzer.js: Main analyzer script

Key Functions

classifyFailure(trialDir): Determine failure type
analyzeJob(jobDir, failedOnly): Analyze all trials
generateRetryCmd(trials, jobDir): Create retry command
generateStats(trials): JSON statistics output

When to Use

Use this skill:

After running benchmark tests
Need to understand failure patterns
Want retry recommendations
Generate test reports

Don't use this skill:

No benchmark results to analyze
Real-time monitoring (use heartbeat instead)

Integration

Heartbeat Task

- name: benchmark-analysis-check
  interval: 6h
  priority: low
  prompt: "Run benchmark-analyzer.js on latest job directory. Report failure stats and retry recommendations."

CLI Commands

# Analyze job directory
node impl/bin/benchmark-analyzer.js jobs/2026-04-17__15-00-00

# Show only failed tasks
node impl/bin/benchmark-analyzer.js jobs/... --failed-only

# Generate retry command
node impl/bin/benchmark-analyzer.js jobs/... --retry-cmd

# Output JSON stats
node impl/bin/benchmark-analyzer.js jobs/... --json

Example Output

Analysis Report

============================================================
JOB: 2026-04-17__15-00-00
============================================================
Total: 89  Passed: 45  Failed: 44  Rate: 50.6%

FAILURE BREAKDOWN:
  timeout              : 15 tasks
    - circuit-fibsqrt
    - compile-compcert
    ...
  rate_limit           : 10 tasks
    - frontend-design
    ...
  task_failure         : 19 tasks
    ...

ALL TASKS (89):
  ✅ chess-best-move                          120s
  ✅ distribution-search                       85s
  ❌ circuit-fibsqrt                          N/A [timeout]
  ❌ frontend-design                          15s  [rate_limit]
  ...

Retry Command

RETRY COMMAND (29 tasks, excluding timeouts):

harbor run -d "[email protected]" \
  --agent-import-path benchmarks.harbor_agent:HarnessAgent \
  -k 1 \
  --n-concurrent 1 \
  --agent-setup-timeout-multiplier 2 \
  --max-retries 3 \
  --retry-include DaytonaError \
  --retry-include AgentSetupTimeoutError \
  --retry-include AddTestsDirError \
  --task-name frontend-design \
  --task-name distribution-search \
  ...

JSON Stats

{
  "summary": {
    "total": 89,
    "passed": 45,
    "failed": 44,
    "passRate": "50.6%"
  },
  "failures": {
    "timeout": 15,
    "rate_limit": 10,
    "task_failure": 19
  },
  "timing": {
    "avgDuration": 85,
    "minDuration": 15,
    "maxDuration": 1200
  },
  "timestamp": 1776409234567
}

Benefits

Insight into failures: Understand why tasks failed
Retry optimization: Exclude tasks unlikely to pass on retry
Performance analysis: Identify slow tasks
Reporting: Generate statistics for documentation

Advanced Usage

Custom Failure Classification

Extend classifyFailure() to detect additional patterns:

if (text.includes('out of memory')) {
  return 'memory_limit';
}
if (text.includes('disk full')) {
  return 'disk_space';
}

Integration with CI/CD

- name: Analyze results
  run: |
    node impl/bin/benchmark-analyzer.js jobs/latest --json > stats.json
    
- name: Check pass rate
  run: |
    PASS_RATE=$(cat stats.json | jq -r '.summary.passRate' | sed 's/%//')
    if [ "$PASS_RATE" -lt 50 ]; then
      echo "Pass rate too low: $PASS_RATE%"
      exit 1
    fi

Source

Derived from Harness Engineering's analyze_results.py: https://github.com/lazyFrogLOL/Harness_Engineering

Reference: scripts/analyze_results.py

Related Skills

bianhaifeng789-hue/iaa-feishu-reporting

business

VerifiedTrustedCommunity

IAA 日报飞书输出能力。支持把固定 CSV 模板一键转换成： - 中文运营结论 - 飞书卡片 JSON - 飞书发送载荷 Use when: - 需要把 IAA 日报直接发到飞书 - 需要从 CSV 一键生成运营日报

SKILL.mdUpdated Apr 21, 2026

bianhaifeng789-hue/iaa-feishu-reporting

bianhaifeng789-hue/iaa-daily-report-model

data-ai

VerifiedTrustedCommunity

IAA日报分析模型功能： - 渠道日报自动分析 - 小时级+日级ROI联动判断 - 按地区输出加量/降量/停投建议 - 按产品类型输出阈值 - 自动识别利润区/观察区/止损区 Use when: - 分析每天投放数据 - 生成运营日报结论 - 判断是否加量/降量/停投 - 对比美加澳/日韩表现 Keywords: - 日报模型, 投放日报, 加量, 降量, 停投, ROI日报, 分地区分析

SKILL.mdUpdated Apr 21, 2026

bianhaifeng789-hue/iaa-daily-report-model

bianhaifeng789-hue/iaa-daily-fixed-template

data-ai

VerifiedTrustedCommunity

IAA固定日报分析模板功能： - 固定字段模板（可直接贴每天数据） - 自动输出总盘结论 - 自动输出美加澳/日韩结论 - 自动给出加量/降量/停投建议 - 适配文件修复/清理两类产品 Use when: - 需要固定日报格式 - 每天复盘渠道表现 - 给运营团队出统一结论 Keywords: - 固定模板, 日报模板, ROI模板, IAA日报, 运营模板

SKILL.mdUpdated Apr 21, 2026

bianhaifeng789-hue/iaa-daily-fixed-template

bianhaifeng789-hue/skills-experimental/hyperlink-pool-pattern

development

VerifiedTrustedCommunity

# HyperlinkPool Pattern Skill HyperlinkPool Pattern - HyperlinkPool class + strings array + stringMap + Index 0 no hyperlink + intern(hyperlink) + get(id) + undefined handling + 5-minute reset + OSC8 hyperlink interning。 ## 功能概述从Claude Code的ink/screen.ts提取的HyperlinkPool模式，用于OpenClaw的OSC8超链接池管理。 ## 核心机制 ### HyperlinkPool Class ```typescript export class HyperlinkPool { private strings: string[] = [''] // Index 0 = no hyperlink private stringMap = new Map<string, number>() // strings

SKILL.mdUpdated Apr 21, 2026

bianhaifeng789-hue/skills-experimental/hyperlink-pool-pattern

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bianhaifeng789-hue/openclaw-config.git

# Copy into Claude Code skills folder (global)
cp -r openclaw-config/skills-experimental/benchmark-analyzer ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bianhaifeng789-hue/openclaw-config

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT