src/autoskillit/skills_extended/review-design/SKILL.md
Validate an experiment plan before execution using triage-first, fail-fast dimensional analysis with an adversarial red-team. Emits verdict (GO/REVISE/STOP), experiment_type, evaluation_dashboard, and revision_guidance.
npx skillsauth add talont-org/autoskillit review-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Validate the quality and feasibility of an experiment plan before compute is spent. Runs a triage-first, fail-fast multi-level analysis hierarchy with parallel subagents and an adversarial red-team, then synthesizes a GO/REVISE/STOP verdict.
Use when the research recipe's review_design ingredient is true (the default). The
recipe calls this skill after plan_experiment to gate execution on a quality check.
This skill is bounded by retries: 2 — on exhaustion the recipe proceeds with the
best available plan.
/autoskillit:review-design {experiment_plan_path}
/autoskillit:plan-experiment. Scan tokens after the skill name for the first
path-like token (starts with /, ./, or .autoskillit/).NEVER:
{{AUTOSKILLIT_TEMP}}/review-design/run_in_background: true is prohibited)ALWAYS:
model: "sonnet" when spawning all subagents via the Task tool{{AUTOSKILLIT_TEMP}}/review-design/ (relative to the current working directory)revision_guidance is written and emitted ONLY when verdict = REVISEevaluation_dashboard is ALWAYS written and emittedrequires_decision: true on all its findingsWhen context is exhausted mid-execution, output files may be partially written or
absent. The recipe routes to on_context_limit, abandoning the partial review.
Before emitting structured output tokens:
evaluation_dashboard was not fully written, emit verdict = STOP as a safe fallbackrevision_guidance if not written; the orchestrator handles the context-limit routeCreate {{AUTOSKILLIT_TEMP}}/review-design/ if absent.
Extract experiment_plan_path from arguments (first path-like token starting with /,
./, or .autoskillit/).
Error handling: When no path-like token is present in the arguments, emit
verdict = STOP with message "No experiment_plan_path provided" and return (per
the NEVER exit-non-zero constraint).
Read the plan file.
Error handling: If the file does not exist or is unreadable at the resolved path,
emit verdict = STOP with message "Plan file not found: {path}" and return.
Load the experiment type registry:
a. Locate bundled types dir: run
python -c "from autoskillit.core import pkg_root; print(pkg_root() / 'recipes' / 'experiment-types')"
to get the absolute bundled directory path.
b. Use Glob *.yaml in that directory, then Read each file. Parse YAML frontmatter to
extract name, classification_triggers, dimension_weights, applicable_lenses,
red_team_focus, and l1_severity fields from each.
c. Check .autoskillit/experiment-types/ in the current working directory. If it exists,
read all *.yaml files there. A user-defined type with the same name as a bundled
type replaces the bundled entry entirely — do not merge fields.
d. The resulting registry is a mapping of type name → spec. The set of valid
experiment_type values for this run is the set of keys in the registry.
Parse YAML frontmatter using the backward-compatible two-level fallback:
Level 1 (frontmatter): Read YAML frontmatter between --- delimiters directly
(zero LLM tokens). Return present fields and note which are missing.
Record source: frontmatter for each extracted field.
Error handling: If the YAML between --- delimiters is malformed, treat all
fields as missing and fall through to Level 2 for all fields (fallback handling).
Level 2 (LLM extraction): For each missing field, launch a targeted LLM
extraction subagent against the corresponding prose section. All extractions are
independent and run in parallel. Record source: extracted for each field from
this path.
Fields: experiment_type, hypothesis_h0/h1, estimand, metrics, baselines,
statistical_plan, success_criteria
Missing-field to prose-target mapping:
| Missing Field | Prose Target | Extraction Prompt | |---|---|---| | experiment_type | Full plan | "Classify using the loaded registry types: {', '.join(registry.keys())}" | | hypothesis_h0/h1 | ## Hypothesis | "Extract the null/alternative hypothesis" | | estimand | ## Hypothesis + ## Independent Variables | "Extract: treatment, outcome, population, contrast" | | metrics | ## Dependent Variables table | "Extract each row as structured object" | | baselines | ## Independent Variables | "Extract comparators: name, version, tuning" | | statistical_plan | ## Analysis Plan | "Extract: test, alpha, power, correction, sample size" | | success_criteria | ## Success Criteria | "Extract three criteria" |
This two-level approach ensures backward compatibility with plans that lack frontmatter:
the provenance (source: frontmatter or source: extracted) is tracked for each field
and included in the evaluation dashboard.
Launch one subagent. Receives full plan text plus parsed fields. Returns:
experiment_type: one of the type names in the loaded registry (from Step 0)dimension_weights: the complete weight matrix for this plan (H/M/L/S per dimension)secondary_modifiers: list of active modifiers with their effects on weightsSchema validation: After the subagent returns, verify that experiment_type is a key
in the loaded registry (from Step 0). If the returned value is not in the registry, default
to exploratory and log a warning — do not silently pass an invalid type into the weight
matrix lookup, as this would corrupt all subsequent spawning decisions.
Triage classification rules (first-match):
Use the classification_triggers list from each type in the loaded registry to classify
the experiment. Apply first-match: iterate types in registry insertion order (bundled types
sorted alphabetically, then user-defined types sorted alphabetically). The first type whose
trigger description matches the plan is selected. If no trigger matches, default to
exploratory.
Secondary modifiers (additive, increase dimension weights):
+causal: mechanism claim in non-causal type → causal_structure weight +1 tier+high_cost: resources > 4 GPU-hours → resource_proportionality L→M+deployment: motivation references production/users → ecological_validity floor = M+multi_metric: ≥3 DVs → statistical_corrections weight +1 tierDimension weights:
Use the dimension_weights dict from the matched type's registry entry (loaded in Step 0).
Each key is a dimension name; each value is one of weight=H (High), weight=M (Medium), weight=L (Low),
or weight=S (SILENT — dimension not spawned, not mentioned in output). Pass the full
dimension_weights dict to the triage subagent so it can return the complete weight
matrix for this plan.
After Step 1 classification, check whether the experiment type is "silent" — an experiment type where quantitative-audit dimension scoring does not apply.
Detection rule: An experiment type is silent when >=6 of 8 dimension_weights
are equal to S. Use is_silent_type(spec) from experiment_type_registry (or count
the S values directly from the loaded registry entry).
Reference: docs/research/silent-type-convention.md defines the shared convention
consumed by both review-design (this skill) and vis-lens-methodology-norms (#846).
When is_silent_type returns True:
Skip Steps 2-7 entirely — do not launch L1, L2, L3, L4, or red-team subagents.
Emit verdict = GO with requires_decision: false.
Write evaluation_dashboard with a "Scope Advisory" section containing:
verdict: GO
advisory_context:
subject_kind: experiment_type
subject_name: {experiment_type}
reasoning: "Quantitative audit framework does not apply to {experiment_type}. Design rigor is assessed via domain-appropriate criteria (e.g., trustworthiness, transferability, dependability, confirmability for qualitative research)."
reference_framework: "SRQR / COREQ"
requires_decision: false
Write machine-readable YAML summary in the dashboard:
# --- review-design machine summary ---
verdict: GO
experiment_type: {type}
critical_count: 0
warning_count: 0
blocking_count: 0
required_count: 0
advisory_count: 1
red_team_count: 0
active_dimensions: 0
warning_threshold: 0
Proceed to Step 8 (Emit Output Tokens) — the recipe routes to plan_visualization.
The advisory is appended to the evaluation dashboard file (which is later copied to
research/{slug}/audit/design-review-dashboard.md by create_worktree.sh).
When is_silent_type returns False: Continue with the standard path (Steps 2-7).
Include this instruction block in every dimension subagent prompt.
Every finding must describe WHAT is lacking or at risk in the experimental design. Never prescribe HOW to fix it — the fix is the plan author's responsibility.
GOOD: "The plan does not address how implementation correctness will be verified before measurement"
BAD: "The plan must contain apply_phase1_changes.sh with inline Python that greps for function ordering"
GOOD: "The step-timing instrumentation could introduce inter-iteration contamination if reset ordering is incorrect"
BAD: "step_timing::reset() uses Ordering::Relaxed — change to Ordering::Release"
Findings must never include:
Design scope boundary:
Evaluate the experimental DESIGN: hypotheses, metrics, statistical methodology, controlled variables, threats to validity, data acquisition strategy, reproducibility specification.
Do NOT evaluate:
If a code snippet in the plan reveals a design-level concern (e.g., the metric definition contradicts the hypothesis), flag the design concern, not the code bug.
Two subagents run in parallel. Both are always H-weight; severity thresholds are calibrated per experiment_type via the rubric below.
Each L1 subagent receives as explicit inputs:
experiment_type (from Step 1 triage output)Severity calibration rubric for L1 dimensions:
Use the l1_severity dict from the matched experiment type's registry entry (loaded in
Step 0). Keys are estimand_clarity and hypothesis_falsifiability; values are severity
levels (critical, warning, info). Calibration anchors: causal_inference → critical;
benchmark, configuration_study, robustness_audit → warning; exploratory → info.
estimand_clarity agent: "Can the claim be written as a formal contrast (A vs B on Y in Z)?"
Reference the exp-lens-estimand-clarity philosophical mode as guidance (do NOT invoke
the skill — reference its lens question only in the subagent prompt).
Use the l1_severity.estimand_clarity value from the registry to assign severity.hypothesis_falsifiability agent: "What result would cause the author to conclude H0?"
Use the l1_severity.hypothesis_falsifiability value from the registry to assign severity.Each subagent returns findings in the standard JSON structure (see Finding Format below).
FAIL-FAST GATE: After both Level 1 subagents complete, check for critical findings.
If ANY Level 1 subagent returns a finding with "severity": "critical":
stop_triggersSubagent parse failure: If a Level 1 subagent returns unparseable output (malformed
JSON, empty response, token-limit truncation), treat it as if it returned one critical
finding with message: "L1 subagent did not return parseable findings". This ensures
parse failures trigger the fail-fast gate rather than silently passing it.
ADDRESSABLE vs STRUCTURAL classification: After collecting critical L1 findings, classify each one before applying the gate:
Classification scope limitation: Initially, only hypothesis_falsifiability findings
are eligible for ADDRESSABLE classification — hypothesis restructuring is the dimension
most likely to produce mechanically fixable defects. estimand_clarity findings default
to STRUCTURAL (absent estimands typically indicate deeper design flaws).
Gate behavior after classification:
"priority": "REQUIRED",
continue L2-L4 analysis. The verdict becomes REVISE (not STOP) with the ADDRESSABLE
findings at the top of the evaluation dashboard.When the L1 gate passes (no STRUCTURAL critical L1 findings — gate also passes when all L1 criticals are ADDRESSABLE), launch 2–3 Level 2 subagents AND the red-team agent concurrently — all at the same time without waiting for each other.
Level 2 subagents (parallel, weights from the matrix):
baseline_fairness: "Are all compared systems given symmetric resources and tuning effort?"causal_structure: weight from matrix (S for benchmark/config_study, H for causal_inference).
Only spawn when weight ≥ L.unit_interference: "Can treatments spill over between experimental units?"Red-team agent (concurrent with L2 and L4 — does NOT block L3):
Receives: full plan text and experiment_type (from Step 1 triage output)
red_team_focus.specific from the matched type's registry
entry (loaded in Step 0)."requires_decision": true and "dimension": "red_team"Red-team severity calibration rubric:
| Dimension | causal_inference | benchmark | configuration_study | robustness_audit | exploratory | |-----------|-----------------|-----------|---------------------|------------------|-------------| | red_team | critical | warning | warning | warning | info |
The red-team agent assigns severity based on the intrinsic quality of each finding.
After the red-team agent returns, cap each finding's severity to the maximum
allowed by the experiment type using this rubric — identical to how L1 severity
calibration works. For causal_inference: critical red-team findings remain critical
(STOP-eligible). For benchmark/configuration_study/robustness_audit: critical
findings are downgraded to warning (REVISE-eligible but never STOP). For
exploratory: all red-team findings are capped at info (informational only).
This cap is applied in Step 7 before the verdict logic evaluates stop_triggers.
Run after Level 2 completes. Do not wait for the red-team agent before starting Level 3.
Each L3 subagent receives:
experiment_type (from Step 1 triage output) — calibrates expected statistical rigor:
exploratory plans do not require pre-registered correction procedures; causal_inference
plans demand formal power analysis and correction pre-specification.Three subagents run in parallel:
error_budget: "Is power analysis present? Are error rates (Type I / Type II) acknowledged?"statistical_corrections: "Are multiple comparisons corrections pre-specified for all DVs?"variance_protocol: "Are seeds fixed? Is run-to-run variance addressed?"
NOTE: absent seeds IS a valid finding for this dimension at H-weight — do not suppress
via foothold validation.2–4 subagents. Only spawn subagents for dimensions with weight ≥ L in the matrix. SILENT (S) dimensions are NOT spawned and NOT mentioned in output.
Each L4 subagent receives:
experiment_type (from Step 1 triage output) — calibrates rigor expectations per dimension:
benchmark plans have lower ecological validity expectations than causal_inference plans
by design; reproducibility_spec rigor scales with causal_inference > benchmark >
exploratory.dimension_weights (from Step 1) — provides context on why this dimension was spawned
(e.g., H-weight dimensions warrant stricter thresholds than L-weight dimensions)Level 4 dimensions (spawn when not SILENT):
benchmark_representativeness: "Does this generalize beyond the specific test bed?"ecological_validity: "Do test conditions match the intended deployment context?"measurement_alignment: "Do the metrics actually measure what the research question claims?"reproducibility_spec: "Could an independent party reproduce this experiment?"data_acquisition: "Does the plan include a complete data acquisition strategy?"agent_implementability: "Is this plan executable by a code-generating agent without human intervention?"data_acquisition — Data Acquisition CompletenessValidates that the experiment plan includes a complete data acquisition strategy:
success_criteria has at least one
data_manifest entry specifying its data source.source_type: external has an
explicit acquisition command and a verification criterion.source_type: gitignored has an
acquisition/generation step — gitignored paths are empty in fresh worktrees.depends_on references entry B's acquisition,
B must be listed before A (or the dependency chain must be acyclic).data_manifest must include acquisition steps for that data.Findings format:
agent_implementability — Agent Execution FeasibilityValidates that the experiment plan can be implemented by a code-generating agent without human intervention:
Findings format:
Level 3 and Level 4 may run concurrently with the red-team agent (do not block on red-team).
Three-layer silencing (prevents orphan warnings):
variance_protocol at H-weight.After Levels 3 and 4 complete, wait for the red-team agent if still running.
All red-team findings are merged into the finding pool with their
"requires_decision": true flag preserved.
One synthesis pass (no subagent — orchestrator synthesizes directly):
(dimension, section, message) — identical findings from parallel
agents are collapsed into one entry.# RT_MAX_SEVERITY is built from the registry loaded in Step 0 (dict-of-dicts from YAML parsing):
RT_MAX_SEVERITY = {name: spec["red_team_focus"]["severity_cap"] for name, spec in registry.items()}
SEVERITY_RANK = {"info": 0, "warning": 1, "critical": 2}
rt_cap = RT_MAX_SEVERITY[experiment_type]
for f in findings:
if f.dimension == "red_team" and SEVERITY_RANK[f.severity] > SEVERITY_RANK[rt_cap]:
f.severity = rt_cap # downgrade before verdict
# Reclassify after cap
critical_findings = [f for f in findings if f.severity == "critical"]
warning_findings = [f for f in findings if f.severity == "warning"]
active_dimensions = count_of_spawned_dimensions # tracked from Steps 2-6
# Proportional warning threshold: each active dimension gets a budget of 5
# warnings before the plan is flagged for revision.
WARNING_BUDGET_PER_DIM = 5
warning_threshold = active_dimensions * WARNING_BUDGET_PER_DIM
# L1 fail-fast path: only STRUCTURAL defects trigger STOP
l1_criticals = [f for f in critical_findings if f.dimension in {"estimand_clarity", "hypothesis_falsifiability"}]
# Tag ADDRESSABLE L1 criticals as REQUIRED (scope: hypothesis_falsifiability only)
for f in l1_criticals:
if f.fixability == "ADDRESSABLE":
f.priority = "REQUIRED"
# Scope guard: estimand_clarity always STRUCTURAL; None fixability defaults to STRUCTURAL
structural_stop_triggers = [
f for f in l1_criticals
if f.fixability == "STRUCTURAL" or f.fixability is None or f.dimension == "estimand_clarity"
]
# Red-team STOP path: adversarial critical findings after full analysis (L2-L4)
# These fire only when the L1 gate passed AND the severity cap still allows critical.
stop_triggers = structural_stop_triggers + [f for f in critical_findings if f.dimension == "red_team"]
if stop_triggers:
verdict = "STOP"
elif critical_findings or len(warning_findings) >= warning_threshold:
verdict = "REVISE"
else:
verdict = "GO"
evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md — always written.
Must include:
spec.dimension_weight_rationale (loaded in Step 0 registry). Table format:
## Dimension Rationale (why these weights apply to {experiment_type})
| Dimension | Weight | Rationale |
|---|---|---|
| {dim} | {weight} | {rationale from spec.dimension_weight_rationale[dim]} |
Include ALL non-SILENT dimensions (H, M, L). For each dimension:
dimension_weight_rationale[dim] exists and is non-empty: render the rationale stringdimension_weight_rationale):
render the Weight column only, leave Rationale blank
Omit SILENT (S) dimensions from the table (consistent with finding-count suppression).
If dimension_weight_rationale is entirely empty (no entries for any dimension),
omit the Dimension Rationale subsection entirely — do not render an empty table.requires_decision: true)# --- review-design machine summary ---
verdict: GO|REVISE|STOP
experiment_type: {type}
critical_count: {n}
warning_count: {n}
blocking_count: {n}
required_count: {n}
advisory_count: {n}
red_team_count: {n}
active_dimensions: {n}
warning_threshold: {n}
revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md — written ONLY when
verdict = REVISE. Must include:
revision_guidance path is passed back
to plan-experiment in the recipe loop.Emit these lines as your final output:
verdict = GO|REVISE|STOP
experiment_type = {experiment_type}
classification_timestamp = {ISO 8601 UTC timestamp}
evaluation_dashboard = /absolute/path/{{AUTOSKILLIT_TEMP}}/review-design/evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md
revision_guidance = /absolute/path/{{AUTOSKILLIT_TEMP}}/review-design/revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md
classification_timestamp is the UTC timestamp (e.g., 2026-04-13T15:32:00Z) at the moment the experiment-type classification is finalized, before writing the evaluation dashboard.
revision_guidance line is emitted ONLY when verdict = REVISE. When verdict is GO or STOP,
omit the revision_guidance line entirely.
All subagents must return findings in this JSON structure:
{
"section": "## Hypothesis",
"dimension": "estimand_clarity",
"level": 1,
"severity": "critical | warning | info",
"priority": "BLOCKING | REQUIRED | ADVISORY",
"fixability": "ADDRESSABLE | STRUCTURAL | null",
"message": "{describes what is lacking or at risk — never prescribes how to fix}",
"requires_decision": false
}
Fixability classification (L1 findings only — all other levels use JSON null):
See ADDRESSABLE vs STRUCTURAL classification in Step 2 for authoritative definitions
and scope limitations. null (JSON null literal, not the string "null") is used for all
non-L1 findings where fixability is not applicable.
Priority tiers (supplementing, not replacing, severity):
Priority assignment rules:
Red-team findings: always "requires_decision": true, "dimension": "red_team".
{{AUTOSKILLIT_TEMP}}/review-design/
├── evaluation_dashboard_{slug}_{YYYY-MM-DD_HHMMSS}.md (always written)
└── revision_guidance_{slug}_{YYYY-MM-DD_HHMMSS}.md (REVISE only)
Emit structured output tokens (absolute paths) as your final output.
/autoskillit:plan-experiment — produces the plan this skill validates/autoskillit:scope — first step in the research recipe chain/autoskillit:implement-experiment — consumes this skill's GO output (via recipe routing)development
Generate YAML recipes for .autoskillit/recipes/. Use when user says "make script skill", "generate script", "script a workflow", "write a script", "create a script", "new recipe", "write a pipeline", or when loaded by other skills for script formatting.
data-ai
Create Uncertainty Representation visualization planning spec showing error bar definitions, distribution-aware alternatives, and multi-seed variance protocols. Statistical lens answering "How is uncertainty honestly represented?"
data-ai
Create Temporal Dynamics visualization planning spec showing axis scaling (linear vs log), smoothing disclosure, epoch/step alignment, run aggregation (mean + variance bands), early-stopping markers, and wall-clock vs step-count x-axis. Temporal lens answering "Are training dynamics shown clearly and honestly?"
data-ai
Create Narrative Story Arc visualization planning spec showing visual consistency across the report (same color = same model everywhere), logical figure progression, redundant figure detection, and narrative dependency between figures. Narrative lens answering "Do the figures tell a coherent story across the report?"