skills/auto-benchmark/SKILL.md
Use this skill when building or operating a continuous, automated benchmarking system that tracks competitor performance, ingests the latest research, generates improvement hypotheses, runs experiments autonomously, and keeps a solution ranked
npx skillsauth add aviskaar/open-org auto-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A continuous, autonomous benchmarking system that monitors the competitive landscape, extracts insights from the latest research, proposes and runs improvement experiments, and keeps your solution ranked #1 — so engineers and researchers can focus on building product rather than running benchmarks manually.
The system operates as a closed loop that runs on a schedule (daily, weekly, or on trigger):
┌─────────────────────────────────────────────────────────────────┐
│ CONTINUOUS LOOP │
│ │
│ [1] Monitor [2] Ingest [3] Hypothesize │
│ Competitors → Research → Improvements │
│ & Leaderboards Papers from Gap │
│ ↑ ↓ │
│ [6] Report ← [5] Promote ← [4] Experiment │
│ Stakeholders Winners Autonomously │
└─────────────────────────────────────────────────────────────────┘
Each iteration of the loop answers one question: "What can we do right now to move from our current rank to #1?"
Do this once at system initialization; update the registry whenever new competitors emerge.
Store a competitive_registry.yaml that is the system's single source of truth:
domain: memory # e.g., memory, retrieval, reasoning, vision
target_leaderboards:
- name: MemoryBench
url: https://...
scrape_method: html_table # or api, rss, manual
primary_metric: accuracy
higher_is_better: true
- name: LongContextEval
url: https://...
scrape_method: api
primary_metric: f1_score
higher_is_better: true
competitors:
- name: CompetitorA
latest_score: 0.847
source: MemoryBench
last_updated: 2026-02-01
- name: CompetitorB
latest_score: 0.831
source: MemoryBench
last_updated: 2026-02-05
our_solution:
name: OurSystem
current_scores:
MemoryBench: 0.823
LongContextEval: 0.791
promotion_threshold: 0.005 # minimum improvement over current score to promote
State explicitly what "#1" means for each leaderboard:
schedule:
leaderboard_scrape: "0 6 * * *" # daily at 6am
research_ingest: "0 7 * * 1" # weekly on Monday
experiment_sweep: "0 8 * * *" # daily at 8am
report_digest: "0 9 * * 1" # weekly digest on Monday
On each scheduled run, update the competitive landscape before doing anything else.
For each leaderboard in the registry:
competitive_registry.yaml with the latest scores.Emit a competitive delta report on any change:
[ALERT] CompetitorA improved on MemoryBench: 0.847 → 0.861 (+0.014)
[ALERT] New entrant: StartupX at 0.855 — now ranked #2, ahead of us
[STATUS] Our rank: #3 | Gap to #1: -0.038
Produce a ranked table after every scrape:
| Rank | System | MemoryBench | LongContextEval | Δ to Our Score |
|------|--------------|-------------|-----------------|----------------|
| #1 | CompetitorA | 0.861 | 0.812 | -0.038 |
| #2 | StartupX | 0.855 | 0.798 | -0.032 |
| #3 | OurSystem | 0.823 | 0.791 | — |
| #4 | CompetitorB | 0.801 | 0.764 | +0.022 |
The gap to #1 on each leaderboard is the primary input to Phase 3.
Continuously pull the latest research and translate it into actionable improvement candidates.
Configure sources in research_config.yaml:
research_sources:
arxiv_queries:
- "memory augmented neural networks"
- "long context transformers 2026"
- "retrieval augmented generation benchmark"
venues:
- ICLR 2026
- NeurIPS 2025
- ICML 2026
competitor_blogs:
- https://competitor-a.ai/research
citation_tracking:
- track papers that cite our core method
For each new paper found:
research_log.md:## [2026-02-10] Paper: "HyperMemory: Hierarchical State Spaces for Long-Context Recall"
**Source:** arXiv:2602.XXXXX
**Relevant leaderboard:** MemoryBench
**Reported gain:** +4.2% on MemoryBench vs prior SOTA
**Techniques extracted:**
- Hierarchical state compression (effort: Medium, impact: High) ← PRIORITIZED
- Cosine decay + warmup schedule (effort: Low, impact: Low)
- Synthetic data augmentation for long-range dependencies (effort: High, impact: Medium)
**Action:** Generate hypothesis for hierarchical state compression — schedule experiment.
When a competitor publishes a technical report or open-sources code:
Translate the competitive gap and research findings into a ranked experiment queue.
Each hypothesis must state:
hypothesis:
id: H-042
title: "Hierarchical state compression reduces long-context forgetting"
claim: "Applying hierarchical state compression to our memory module will
improve MemoryBench accuracy from 0.823 to ≥0.850"
source: arXiv:2602.XXXXX + gap analysis (gap to #1 = 0.038)
target_leaderboards: [MemoryBench]
implementation:
change: "Replace flat KV cache with 3-level hierarchical compression"
effort: Medium
estimated_gain: +0.027
priority_score: 8.1 # (estimated_gain / effort_score) × urgency_multiplier
status: queued
Maintain a experiment_queue.yaml ranked by priority_score:
Limit the queue to the top 10 active hypotheses. Archive superseded ones.
For tweaks that don't come from papers (e.g., hyperparameter tuning):
experiments/
├── queue/ # pending hypotheses (YAML files)
├── active/ # currently running
├── results/
│ └── <hypothesis_id>/
│ ├── config.yaml
│ ├── metrics.json
│ ├── train.log
│ └── eval_on_leaderboard.json
├── promoted/ # configs promoted to production
└── archived/ # failed or superseded experiments
The automated runner:
archived/ with failure notes. Do not waste full compute on broken configs.results/<hypothesis_id>/metrics.json.promoted/ or archived/ based on Phase 5 promotion logic.requirements.lock).A new configuration is promoted to production only when all of the following are true:
| Criterion | Requirement |
|-----------|-------------|
| Primary metric improvement | ≥ promotion_threshold above current production score |
| Statistical significance | p < 0.05 on paired t-test vs production config |
| No regression on secondary metrics | Latency within 10%, memory within 15% |
| Reproducibility | Consistent across ≥ 3 seeds (std < 0.5% of mean) |
| Leaderboard projection | If promoted, would we reach or exceed #1? |
If promotion is approved:
promoted/ with a timestamp.competitive_registry.yaml → our_solution.current_scores.CHANGELOG.md.If rejected, write a clear rejection note explaining which criterion failed.
Once at #1, the system switches to defense mode:
Produce two report types automatically:
# Benchmark Digest — Week of YYYY-MM-DD
## Competitive Position
- MemoryBench: #1 ✅ (our score: 0.871 | gap to #2: +0.010)
- LongContextEval: #2 ⚠️ (our score: 0.812 | gap to #1: -0.006)
## What Changed This Week
- Promoted H-042 (hierarchical compression): +0.031 on MemoryBench
- CompetitorA improved LongContextEval to 0.818 — now ahead of us
## Next Actions (Automated)
- Experiment H-047 (synthetic data aug) queued for LongContextEval — est. gain +0.009
- Research ingest scheduled for Monday
## Experiments Run This Week
- 4 experiments completed | 3 failed fast validation | 1 promoted
Full structured log at TECHNICAL_LOG.md:
Engineers should be able to review the log in under 5 minutes and understand exactly what the system did and why.
One-time setup:
competitive_registry.yaml populated with leaderboards and competitorsresearch_config.yaml configured with paper sources and domain queriesEach automated cycle (verify the system does this):
Escalate to humans only when:
documentation
Replace with a description of the skill and when the agent should use it. Write this as a trigger condition: 'Use this skill when...'
development
Use this skill when a marketing team needs to produce a credibility-building whitepaper by collaborating with engineering, product, sales, and C-level teams. Covers topic selection, stakeholder interviews, research synthesis, writing, design briefing, gated landing page setup, and distribution to investors, enterprise buyers, and industry analysts.
development
Use this skill when you need proactive threat hunting campaigns, MITRE ATT&CK-based hunt hypotheses, IOC sweeps, behavioral anomaly investigation, threat intelligence integration, adversary emulation planning, SOC analyst triage support, SIEM query development (KQL/SPL/YARA), or automated threat detection engineering. Trigger for threat hunting sprints, new threat intel indicators, or post-incident proactive sweeps.
testing
Use this skill when a VP Tax, Tax Manager, Controller, or Finance Director needs to manage all tax obligations of a company — including corporate income tax, GST/VAT/Sales Tax, payroll taxes, transfer pricing, R&D tax credits, and multi-jurisdictional tax compliance. Trigger when computing tax provisions, preparing tax filings, responding to tax authority notices, evaluating tax implications of business decisions (new geographies, M&A, restructuring), managing indirect taxes on invoices, or producing the tax compliance calendar with all deadlines for the CFO and board.