src/autoskillit/skills_extended/exp-lens-benchmark-representativeness/SKILL.md
Create Benchmark Representativeness experimental design diagram showing coverage matrix, generalization gaps, and untested regions. Generalizability lens answering "Does this generalize beyond the test bed?"
npx skillsauth add talont-org/autoskillit exp-lens-benchmark-representativenessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Philosophical Mode: Generalizability Primary Question: "Does this generalize beyond the test bed?" Focus: Task Distribution, Scenario Coverage, Missing Regions, Dataset Selection, Generalization Claims
/autoskillit:exp-lens-benchmark-representativeness [context_path] [experiment_plan_path]
/autoskillit:exp-lens-benchmark-representativeness or /autoskillit:make-experiment-diag benchmarkNEVER:
{{AUTOSKILLIT_TEMP}}/exp-lens-benchmark-representativeness/run_in_background: true is prohibited)ALWAYS:
Focus on GENERALIZATION GAP between benchmark coverage and claimed scope
Show which regions of the target space are untested
Document the relationship between benchmark selection and generalization claims
Include a coverage matrix mapping scenarios to metrics
BEFORE creating any diagram, LOAD the /autoskillit:mermaid skill using the Skill tool - this is MANDATORY
If the Skill tool cannot be used (disable-model-invocation) or refuses this invocation, do NOT proceed with diagram creation. Abort this step and omit the diagram from output.
Write output to {{AUTOSKILLIT_TEMP}}/exp-lens-benchmark-representativeness/exp_diag_benchmark_representativeness_{YYYY-MM-DD_HHMMSS}.md
After writing the file, emit the structured output token as literal plain text with no markdown formatting on the token name (the adjudicator performs a regex match):
diagram_path = /absolute/path/to/{{AUTOSKILLIT_TEMP}}/exp-lens-benchmark-representativeness/exp_diag_benchmark_representativeness_{...}.md
If positional arg 1 (context_path) is provided and the file exists, read it to obtain IV/DV tables, H0/H1 hypotheses, controlled variables, and success criteria. If positional arg 2 (experiment_plan_path) is provided and exists, read the experiment plan for full methodology. Use this structured context as the foundation for Steps 1-5; skip the CWD exploration for these fields if the context file supplies them.
Spawn Explore subagents to investigate:
Benchmark & Dataset Inventory
benchmark, dataset, test_suite, eval, corpus, split, GLUE, ImageNetTask & Scenario Coverage
task, scenario, domain, category, difficulty, subsetMetric Coverage
metric, accuracy, f1, bleu, rouge, latency, cost, fairnessClaimed Generalization Scope
generalize, real-world, production, deploy, robust, transfer, domainDistribution Characteristics
distribution, balance, skew, size, demographics, diversityBuild the coverage matrix: rows = scenarios/domains tested, columns = metrics measured. Identify which cells are populated and which are gaps. Compare the coverage to the stated generalization claims.
For every generalization claim:
Distinguish clearly:
Use flowchart with:
Direction: TB (claims flow from benchmarks up to generalization)
Subgraphs:
BENCHMARKS USEDSCENARIOS TESTEDMETRICS MEASUREDGENERALIZATION CLAIMSUNTESTED REGIONSNode Styling:
stateNode class: benchmarks/datasetshandler class: tested scenariosoutput class: measured metricscli class: generalization claimsgap class: untested regions/missing coveragedetector class: validation of generalizationWrite the diagram to: {{AUTOSKILLIT_TEMP}}/exp-lens-benchmark-representativeness/exp_diag_benchmark_representativeness_{YYYY-MM-DD_HHMMSS}.md (relative to the current working directory)
# Benchmark Representativeness Diagram: {System Name}
**Lens:** Benchmark Representativeness (Generalizability)
**Question:** Does this generalize beyond the test bed?
**Date:** {YYYY-MM-DD}
**Scope:** {What was analyzed}
## Coverage Matrix
| Scenario / Domain | {Metric A} | {Metric B} | {Metric C} | Coverage |
|-------------------|-----------|-----------|-----------|----------|
| {Scenario 1} | ✓ | ✓ | ✗ | Partial |
| {Scenario 2} | ✗ | ✗ | ✗ | None |
| {Scenario 3} | ✓ | ✓ | ✓ | Full |
## Benchmark Representativeness Diagram
```mermaid
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'curve': 'basis'}}}%%
flowchart TB
%% CLASS DEFINITIONS %%
classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
classDef integration fill:#c62828,stroke:#ef9a9a,stroke-width:2px,color:#fff;
subgraph Benchmarks ["BENCHMARKS USED"]
direction TB
B1["{Benchmark 1}<br/>━━━━━━━━━━<br/>{size}, {domain}"]
B2["{Benchmark 2}<br/>━━━━━━━━━━<br/>{size}, {domain}"]
end
subgraph Scenarios ["SCENARIOS TESTED"]
direction TB
S1["{Scenario 1}<br/>━━━━━━━━━━<br/>{conditions}"]
S2["{Scenario 2}<br/>━━━━━━━━━━<br/>{conditions}"]
end
subgraph Metrics ["METRICS MEASURED"]
direction TB
M1["{Metric A}<br/>━━━━━━━━━━<br/>{what it captures}"]
M2["{Metric B}<br/>━━━━━━━━━━<br/>{what it captures}"]
end
subgraph Claims ["GENERALIZATION CLAIMS"]
direction TB
C1["{Claim 1}<br/>━━━━━━━━━━<br/>{source of claim}"]
C2["{Claim 2}<br/>━━━━━━━━━━<br/>{source of claim}"]
end
subgraph Gaps ["UNTESTED REGIONS"]
direction TB
G1["{Gap 1}<br/>━━━━━━━━━━<br/>{why it matters}"]
G2["{Gap 2}<br/>━━━━━━━━━━<br/>{why it matters}"]
end
VALIDATE["{Generalization Validity Check}<br/>━━━━━━━━━━<br/>Coverage vs. Claim scope"]
B1 --> S1
B2 --> S2
S1 --> M1
S2 --> M2
M1 --> C1
M2 --> C2
C1 --> VALIDATE
C2 --> VALIDATE
G1 -.->|missing| VALIDATE
G2 -.->|missing| VALIDATE
%% CLASS ASSIGNMENTS %%
class B1,B2 stateNode;
class S1,S2 handler;
class M1,M2 output;
class C1,C2 cli;
class G1,G2 gap;
class VALIDATE detector;
Color Legend: | Color | Category | Description | |-------|----------|-------------| | Dark Teal | Benchmarks | Datasets and test suites used | | Orange | Scenarios | Tested scenarios and conditions | | Teal | Metrics | Measured evaluation metrics | | Dark Blue | Claims | Generalization claims made | | Yellow/Amber | Gaps | Untested regions of target space | | Red | Validation | Generalization validity check |
| Claim | Evidence (Benchmarks) | Gap (Untested) | Risk | |-------|----------------------|----------------|------| | {Claim 1} | {What covers it} | {What is missing} | High/Med/Low | | {Claim 2} | {What covers it} | {What is missing} | High/Med/Low |
| Dimension | Current Coverage | Required for Claim | Verdict | |-----------|-----------------|-------------------|---------| | Domain diversity | {count} domains | {needed} | ✓/✗ | | Task variety | {count} tasks | {needed} | ✓/✗ | | Scale range | {min}–{max} | {needed} | ✓/✗ | | Distribution shift | {tested?} | {needed} | ✓/✗ |
---
## Pre-Diagram Checklist
Before creating the diagram, verify:
- [ ] LOADED `/autoskillit:mermaid` skill using the Skill tool
- [ ] Using ONLY classDef styles from the mermaid skill (no invented colors)
- [ ] Diagram will include a color legend table
---
## Related Skills
- `/autoskillit:make-experiment-diag` - Parent skill for lens selection
- `/autoskillit:mermaid` - MUST BE LOADED before creating diagram
- `/autoskillit:exp-lens-measurement-validity` - For metric quality analysis
- `/autoskillit:exp-lens-validity-threats` - For systematic threat inventory
development
Generate YAML recipes for .autoskillit/recipes/. Use when user says "make script skill", "generate script", "script a workflow", "write a script", "create a script", "new recipe", "write a pipeline", or when loaded by other skills for script formatting.
data-ai
Create Uncertainty Representation visualization planning spec showing error bar definitions, distribution-aware alternatives, and multi-seed variance protocols. Statistical lens answering "How is uncertainty honestly represented?"
data-ai
Create Temporal Dynamics visualization planning spec showing axis scaling (linear vs log), smoothing disclosure, epoch/step alignment, run aggregation (mean + variance bands), early-stopping markers, and wall-clock vs step-count x-axis. Temporal lens answering "Are training dynamics shown clearly and honestly?"
data-ai
Create Narrative Story Arc visualization planning spec showing visual consistency across the report (same color = same model everywhere), logical figure progression, redundant figure detection, and narrative dependency between figures. Narrative lens answering "Do the figures tell a coherent story across the report?"