src/autoskillit/skills_extended/stage-data/SKILL.md
Pre-flight resource gate for the research recipe. Reads the experiment plan's data_manifest, checks disk space and network connectivity for external/gitignored entries, creates data directory structure, and emits a PASS/WARN/FAIL feasibility verdict.
npx skillsauth add talont-org/autoskillit stage-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pre-flight resource gate for the research recipe. Reads the experiment plan's
data_manifest frontmatter section, checks available disk space and network
connectivity for external and gitignored data source entries, creates the
required data directory structure in the worktree, and emits a PASS/WARN/FAIL
feasibility verdict. PASS and WARN proceed to implementation; FAIL escalates
immediately with a detailed resource feasibility report rather than wasting
compute on doomed downloads.
stage_data step between create_worktree
and decompose_phases/autoskillit:stage-data <experiment_plan_path>
experiment_plan_path — Absolute path to the experiment plan (positional).
Default: $AUTOSKILLIT_TEMP/experiment-plan.md in the current working directory.NEVER:
{{AUTOSKILLIT_TEMP}}/stage-data/run_in_background: true is prohibited)ALWAYS:
data_manifest frontmatter section of the experiment planexternal and gitignored source_type entries (synthetic and
fixture entries require no disk space or network access)location field is non-null,
using mkdir -pmodel: "sonnet" for all subagentsRead the experiment plan at the provided path (or default path). Parse the
data_manifest YAML frontmatter section. Identify all entries where
source_type is "external" or "gitignored".
If no external or gitignored entries exist, skip disk and network checks.
Create any data directories for entries with non-null location. Emit
verdict = PASS and exit.
Launch parallel subagents — one per external/gitignored entry — each
performing:
a. DISK SPACE AGENT: Run df -k . to get available bytes in the worktree.
Estimate storage need from the entry's description field using LLM reasoning
(e.g., "10-50GB h5ad files" → project 50GB worst case). Compute headroom:
If available_bytes == 0 (filesystem completely full), emit FAIL immediately —
do not proceed to the formula below.
headroom_pct = (available_bytes - projected_bytes) / available_bytes * 100
Disk space verdict thresholds:
projected_bytes > available_bytes (negative headroom)0 < headroom_pct < 20 (less than 20% remaining)headroom_pct >= 20b. NETWORK PROBE AGENT: Infer the API base URL from the acquisition
field. Known endpoints to probe:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
Rate limit: 3 req/s without API key, 10 req/s with NCBI API key. Single probe is well within limits.https://www.encodeproject.org/
Rate limit: no published hard limit; courtesy throttle expected at high volume.https://rest.uniprot.org/uniprotkb/search?query=reviewed:true&size=1
Rate limit: undocumented; size=1 keeps payload ~1KB.https://api.brain-map.org/api/v2/data/Gene/query.json?num_rows=1
Rate limit: no published limit; public API. num_rows=1 keeps response ~200B.https://api.cellxgene.cziscience.com/dp/v1/collections
Rate limit: CDN-backed, no published limit. HEAD avoids 53KB collection list.https://www.ebi.ac.uk/gxa/json/experiments
Rate limit: no published hard limit (EBI courtesy policy). CRITICAL: response is ~2.6MB — always use HEAD (curl -sI), never GET.https://www.proteinatlas.org/api/search_download.php?search=CD8A&columns=g&compress=no&format=json
Rate limit: no published limit; public API. columns=g is required (API returns 400 without it). Response ~20B.https://string-db.org/api/json/version
Rate limit: ~10 req/s for programmatic access. Version endpoint returns 84B — ideal probe.https://jaspar.elixir.no/api/v1/matrix/?page_size=1
Rate limit: no published limit (academic resource). Note: domain migrated from genereg.net to elixir.no. page_size=1 keeps response ~330B.acquisition field.Run: curl -sI --max-time 10 <endpoint> and inspect HTTP status:
Network connectivity check: the WARN condition indicates the endpoint is reachable but authentication may be needed. A network reachability issue produces FAIL.
For every entry whose location field is non-null, run:
mkdir -p <worktree_cwd>/<location>
This creates the data dir hierarchy required by the experiment implementation.
Aggregate results across all entries:
Write the resource feasibility report to:
{{AUTOSKILLIT_TEMP}}/stage-data/resource_feasibility_{YYYY-MM-DD_HHMMSS}.md
Report structure:
## Resource Feasibility Report
**Date:** {timestamp}
**Verdict:** PASS | WARN | FAIL
### Disk Space Assessment
| Entry | Source Type | Projected Size | Available | Headroom | Status |
|-------|-------------|----------------|-----------|----------|--------|
...
### Network Connectivity Assessment
| Entry | Endpoint Probed | HTTP Status | Latency | Status |
|-------|-----------------|-------------|---------|--------|
...
### Data Directories Created
- {location}: created | skipped (null location)
...
### Recommended Actions (WARN/FAIL only)
- {specific actionable step to resolve each issue}
Emit structured output tokens as LITERAL PLAIN TEXT with NO markdown
formatting on the token names. Do not wrap token names in **bold**,
*italic*, or any other markdown. The adjudicator performs a regex match
on the exact token name — decorators cause match failure.
verdict = PASS
resource_report = /absolute/path/to/resource_feasibility_{YYYY-MM-DD_HHMMSS}.md
verdict = PASS|WARN|FAIL
resource_report = /absolute/path/to/{{AUTOSKILLIT_TEMP}}/stage-data/resource_feasibility_{YYYY-MM-DD_HHMMSS}.md
development
Generate YAML recipes for .autoskillit/recipes/. Use when user says "make script skill", "generate script", "script a workflow", "write a script", "create a script", "new recipe", "write a pipeline", or when loaded by other skills for script formatting.
data-ai
Create Uncertainty Representation visualization planning spec showing error bar definitions, distribution-aware alternatives, and multi-seed variance protocols. Statistical lens answering "How is uncertainty honestly represented?"
data-ai
Create Temporal Dynamics visualization planning spec showing axis scaling (linear vs log), smoothing disclosure, epoch/step alignment, run aggregation (mean + variance bands), early-stopping markers, and wall-clock vs step-count x-axis. Temporal lens answering "Are training dynamics shown clearly and honestly?"
data-ai
Create Narrative Story Arc visualization planning spec showing visual consistency across the report (same color = same model everywhere), logical figure progression, redundant figure detection, and narrative dependency between figures. Narrative lens answering "Do the figures tell a coherent story across the report?"