skills/audit-config/SKILL.md
Pre-launch experimental configuration audit. Use before committing API spend to a detection or verification run. Systematically checks every config parameter against the preregistered protocol, known failure modes, and the filesystem. Produces a READY TO LAUNCH / BLOCKED verdict. Trigger phrases: 'audit configs', 'check configs', 'ready to run', 'launch the evaluation'. Also trigger PROACTIVELY when the user is about to launch an experimental API run — especially after generating new configs.
npx skillsauth add saross/personal-assistant audit-configInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This is a VERIFICATION task — not a review, not a suggestion engine. The goal is to produce a binary READY TO LAUNCH / BLOCKED verdict by systematically checking every configuration parameter against the preregistered protocol. This is NOT a code review or a design discussion. Do not suggest improvements to the experimental design; only verify that the implementation matches the intent.
You are auditing VLM detection experiment configurations before an API run that costs real money. Your job is to catch configuration errors, preregistration violations, and silent parameter mismatches BEFORE they waste budget. The H10/H12 text-only error ($33 wasted, half a day of invalid analysis and troubleshooting) was caused by exactly the kind of mistake this audit is designed to catch.
docs/methodology/preregistration/osf/preregistration.md — the canonical protocoldocs/methodology/preregistration/protocol-errata.md — documented deviations that override the preregistrationdocs/methodology/preregistration/decisions-log.md — decisions that constrain or modify the protocolWhen sources conflict, higher-numbered sources override lower (errata override preregistration; filesystem overrides everything claimed in config descriptions).
The user will provide:
IN SCOPE: Every parameter that reaches the API payload — model, temperature, thinking_level, include_example_images, instruction_file, examples list, tile size, tile manifest, evaluation bounds.
OUT OF SCOPE: Code quality, script architecture, analysis plan, cost optimisation. These are valid concerns but not this audit's job.
Read the relevant hypothesis section from
docs/methodology/preregistration/osf/preregistration.md. Also read
protocol-errata.md and decisions-log.md for modifications. Refer to other
preregistration documents as necessary to clarify the protocol.
Extract EVERY testable requirement into a numbered checklist. For each requirement, record:
Do not proceed to Step 2 until this checklist is complete. This checklist becomes the input for Step 4.
Load ALL condition configs. For EVERY field in EVERY config, check whether it is IDENTICAL across all conditions or DIFFERS. Do not skip fields that "obviously" should be the same — check them.
Report as a table:
| Field | Identical? | Value(s) | Classification | |-------|-----------|----------|---------------| | model | YES | gemini-3.5-flash | Controlled | | examples | NO | [differs per pool] | MANIPULATED — expected | | temperature | YES | 0.7 | Controlled | | ... | ... | ... | ... |
Check in BOTH directions:
For EVERY config, verify that the manipulated factor will physically reach the API. Check against these known failure modes:
| Error Mode | What to check | Source |
|---|---|---|
| Image flag off | include_example_images is explicitly true, not false, null, or absent | H10/H12 text-only error |
| Temperature shadowed | No CLI --temperature override contradicts config value | E43, E44 |
| Thinking level dropped | thinking_level matches intent; if model is Pro, level is ≥ MEDIUM | E34, E40 |
| Model version drift | model field is identical across all conditions and matches the study's model | Scratchpad rule |
| Tile size mismatch | Config tile_size matches actual tile dimensions (384 vs 512) | Dry-run error |
| Wrong tile set | Manifest points to correct tile set (calibration vs evaluation vs full) | Manual check |
| Wrong instruction file | instruction_file matches the intended track (image vs text-only) | H10/H12 base config error |
| Example paths broken | EVERY path in the examples list resolves to a file under inputs/examples/ | Filesystem check |
| Example image dimensions | Crop images are the expected size (e.g., 150×150 for hard examples, 384×384 for nulls) | Pipeline consistency |
For each config, report PASS or FAIL per error mode. A FAIL on any error mode is a BLOCKER.
Using the requirements checklist from Step 1, check EVERY requirement against the configs:
| # | Requirement | Config value | Verdict | Notes | |---|---|---|---|---| | 1 | Pool sizes: 20, 40, 80, 160 | [check] | MATCHES / DEVIATION | | | 2 | ... | ... | ... | |
Verdict categories:
protocol-errata.md. Cite the errata entry (e.g., "E49").Also check the reverse direction: for every parameter in the configs, is it consistent with the preregistration? This catches parameters the preregistration doesn't mention but the config sets to non-default values.
Run --dry-run of the detection script with ONE config and verify:
experiment_intent.md preview confirms the hypothesis, factor, and
modalityReport PASS or FAIL. Any discrepancy is a BLOCKER.
Verify:
After completing Steps 1-6, review your own audit:
If any items remain unchecked, flag them as WARNINGS in the final report.
Present results as a structured audit report. EVERY check MUST have a verdict — "looks fine" is not a verdict.
=== PRE-LAUNCH AUDIT: [Hypothesis] ===
1. PREREGISTRATION REQUIREMENTS: [N requirements extracted]
[numbered list]
2. CONFIG DIFF: [N fields identical, M fields differ]
Controlled: [list with values]
Manipulated: [list — EXPECTED or UNEXPECTED per field]
Confounds: [NONE or list]
3. TRANSMISSION CHECK:
[table: config × error mode → PASS/FAIL]
Blockers: [NONE or list]
4. PREREGISTRATION ALIGNMENT:
Matches: [count]
Deliberate deviations: [count, with errata refs]
Undocumented deviations: [count — BLOCKER if >0]
5. DRY-RUN: PASS / FAIL
[details if FAIL]
6. EVALUATION SCOPE: PASS / FAIL
[details if FAIL]
7. COMPLETENESS: [items not checked, if any]
BLOCKERS: [list, or NONE]
WARNINGS: [list, or NONE]
OVERALL: READY TO LAUNCH / BLOCKED ([reasons])
This audit is complete when:
DO NOT:
description fields — check the ACTUAL JSON values for every
parameterinclude_example_images: null or absent as equivalent to true —
it must be explicitly true for image-based experimentsA config with include_example_images: false that claims to test an
image-based factor is ALWAYS a blocker. If more parameters differ between
conditions than the target factor + metadata fields, flag as a potential
confound.
development
This skill should be used when the user asks to "moderate marks", "produce marking dossiers", "double-mark" an assessment, run a "second-reader pass", or "build a moderation pack". Also trigger when the user has just entered rubric marks for a HUMN8031 Assessment 2 paper and wants a moderation dossier produced. Do not trigger for rubric design or rubric review — only for dossier production on a marked assessment.
testing
Generate valid Fieldmark notebook JSON files from natural language descriptions, field manuals, or specifications. Supports validation rules, conditional logic, and parent-child relationships.
development
Generate modular "lego brick" documentation for Fieldmark field types. Produces design docs (Notebook Editor configuration), collect docs (data collection usage), shared docs, Playwright screenshot specs, and practical fieldwork tips. This skill should be used when creating, updating, or reviewing field type documentation for the fieldmark-docs-staging repository.
development
Classify dual-nature entities (hotels, churches, schools, halls) as building-only, business/organisation-only, or both based on contextual linguistic analysis.