skills-catalog/ln-840-benchmark-compare/SKILL.md
Use when benchmarking hex-line MCP against Claude built-in tools with scenario manifests, activation checks, and diff-based correctness.
npx skillsauth add levnikolaevich/claude-code-skills ln-840-benchmark-compareInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Paths: File paths (
shared/,references/) are relative to skills repo root. Locate this SKILL.md directory and go up one level for repo root.
Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark
Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with hex-line. The benchmark is scenario-based, diff-validated, manifest-driven, and runtime-backed. It measures activation, correctness, time, cost, and tokens. The current runner is intentionally scoped to this internal A/B. It does not, by itself, prove best-in-class against external alternatives.
| Direction | Content |
|-----------|----------|
| Input | Repo checkout containing mcp/hex-line-mcp/, optional references/goals.md, optional references/expectations.json |
| Output | Comparison report in skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md plus machine-readable benchmark summary artifact |
claude --version succeedsgit succeedsmcp/hex-line-mcp/server.mjs existsmcp/hex-line-mcp/hook.mjs existsskills-catalog/ln-840-benchmark-compare/references/goals.md existsskills-catalog/ln-840-benchmark-compare/references/expectations.json existsskills-catalog/ln-840-benchmark-compare/references/mcp-bench.json existsbash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh \
[skills-catalog/ln-840-benchmark-compare/references/goals.md] \
[skills-catalog/ln-840-benchmark-compare/references/expectations.json]
Optional extra session profile:
EXTRA_SESSION_ID=other-mcp \
EXTRA_SESSION_LABEL="Other MCP" \
EXTRA_MCP_CONFIG=/abs/path/to/other-mcp.json \
EXTRA_SETTINGS='{"disableAllHooks":true}' \
bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh
MANDATORY READ: Load shared/references/monitor_integration_pattern.md
Stream benchmark progress:
Monitor(command="bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh 2>&1 | grep --line-buffered -E 'scenario|PASS|FAIL|error|session'", timeout_ms=3600000, description="benchmark run")
Fallback: Bash(run_in_background=true).
The runner handles:
goals.mdCurrent scope:
hex-lineEXTRA_SESSION_* environment variablesExternal baseline note:
goals.md and expectations.jsonstream-json log shape and diff artifactsUse one canonical pair owned by this skill:
skills-catalog/ln-840-benchmark-compare/references/goals.mdskills-catalog/ln-840-benchmark-compare/references/expectations.jsonRules:
hex-line.goals.md must have a matching entry in expectations.json.expectations.json is the source of truth for correctness.Supported expectation fields per scenario:
| Field | Meaning |
|-------|---------|
| id | Scenario identifier used in result filenames |
| expectedChangedFiles | Files that must change |
| forbiddenChangedFiles | Files that must not change |
| requiredDiffPatterns | Regex patterns required in the saved diff |
| forbiddenDiffPatterns | Regex patterns that must not appear in the diff |
| requiredResultPatterns | Regex patterns required in the final assistant result text |
| requiredCommands | Regex patterns that must match at least one Bash command |
| exactChangedFiles | If true, no extra changed files are allowed |
The runner must pass:
node --check server.mjsnode --check hook.mjsnode --check extract-scenarios.mjsnode --check parse-results.mjshook.mjsIf preflight fails, the benchmark is invalid and must stop before scenarios run.
For each ## scenario in goals.md:
.jsonl logs and .diff.txt artifactsBuilt-in session:
Hex-line session:
server.mjsoutputStyle: "hex-line"PreToolUse hook through hook.mjsparse-results.mjs evaluates each scenario for both sessions.
Scenario pass requires:
The final report has these sections:
Interpretation rules:
invalid run means setup/adoption failure, not product performanceFAIL means correctness contract was not methex-line, not external noiseskills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md must answer:
hex-line activate cleanly without discovery drift?Do not treat raw time/cost as sufficient without scenario correctness.
hex-line against external alternatives, they must reuse the same goals.md, expectations.json, and diff-based evaluation rules.hex-line must say so explicitly.MANDATORY READ: Load shared/references/benchmark_worker_runtime_contract.md, shared/references/coordinator_summary_contract.md
Runtime CLI:
node shared/scripts/benchmark-worker-runtime/cli.mjs start --skill ln-840-benchmark-compare --identifier suite-default --manifest-file <file>
node shared/scripts/benchmark-worker-runtime/cli.mjs checkpoint --skill ln-840-benchmark-compare --identifier suite-default --phase PHASE_0_CONFIG --payload '{...}'
node shared/scripts/benchmark-worker-runtime/cli.mjs record-summary --skill ln-840-benchmark-compare --identifier suite-default --payload '{...}'
node shared/scripts/benchmark-worker-runtime/cli.mjs complete --skill ln-840-benchmark-compare --identifier suite-default
Required state fields:
report_readysummary_recordedfinal_resultself_check_passedDomain checkpoints:
PHASE_0_CONFIGPHASE_1_PREFLIGHTPHASE_2_LOAD_SUITEPHASE_3_RUN_SCENARIOSPHASE_4_PARSE_RESULTSPHASE_5_WRITE_REPORTPHASE_6_WRITE_SUMMARYPHASE_7_SELF_CHECKGuard rules:
benchmark-worker summary is recordedrunId and exact summaryArtifactPath.benchmark-worker family.MANDATORY READ: Load shared/references/coordinator_summary_contract.md
Emit a benchmark-worker summary envelope after the comparison report is written.
Managed mode:
summaryArtifactPathStandalone mode:
.hex-skills/runtime-artifacts/runs/{run_id}/benchmark-worker/ln-840-benchmark-compare--{identifier}.jsonRecommended payload:
scenarios_totalscenarios_passedscenarios_failedactivation_validvalidity_verdictreport_pathwarningsmetrics| Pitfall | Solution |
|---------|----------|
| SessionStart not present in hex-line run | Fail preflight and stop |
| Agent drifts into ToolSearch before hex-line use | Treat as activation problem and capture in report |
| Worktree already exists from prior crash | Remove it before adding a new one |
| Diff artifacts missing | Treat scenario correctness as failed |
| Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking |
| External comparison uses edited scenarios or relaxed expectations | Treat the comparison as invalid |
goals.md defines the canonical balanced suiteexpectations.json fully describes scenario correctnessskills-catalog/ln-840-benchmark-compare/results/benchmark-worker summary artifact is written to the managed or standalone runtime pathVersion: 2.0.0 Last Updated: 2026-03-24
testing
Audits architecture config boundaries: typed settings, scattered env reads, config leakage, and layer ownership. Use for config architecture.
tools
Finds architecture-level modernization opportunities: obsolete custom mechanisms, overbuilt extension points, and simplifiable architecture. Use when auditing architecture evolution.
development
Builds dependency topology, detects cycles, validates import rules, and calculates coupling metrics. Use when auditing architecture topology.
testing
Checks layer, resource ownership, and orchestration boundaries. Use when auditing architecture boundary enforcement.