.claude/skills/critical-analysis/SKILL.md
## Skill: Critical Analysis (v1.1) Multi-role critique system for ML experiments and architecture decisions. Based on: Solo Performance Prompting (NAACL 2024), CrewAI role taxonomy, Bermingham 13-agent DA-as-gate, De Bono Six Thinking Hats (Black Hat gating). --- ### When to Load (Auto-Default Behavior) **Before any of the following — run Quick Mode without waiting for user request:** - Architecture or design decision (new integration, refactor, layer change) - ML experiment launch (clusteri
npx skillsauth add pyramidheadshark/ml-claude-infra .claude/skills/critical-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multi-role critique system for ML experiments and architecture decisions. Based on: Solo Performance Prompting (NAACL 2024), CrewAI role taxonomy, Bermingham 13-agent DA-as-gate, De Bono Six Thinking Hats (Black Hat gating).
Before any of the following — run Quick Mode without waiting for user request:
Design doc check: for any architectural decision, verify design-doc.md exists and
guides this decision. If missing → [DA]: "No design doc — decisions are ungrounded."
Quick Mode: if all 8 roles return "clear" — proceed. If any role flags CRITICAL — surface it to the user before proceeding.
This is NOT optional. Critique runs before code, not after.
| Role | Source | Key Question | |------|--------|--------------| | [Security] Security Sentinel | CrewAI Code Security Auditor | Injection, auth, secrets, attack surface — what can be exploited? | | [Perf] Performance Analyst | CrewAI Performance Optimizer | O(n²), wasted iterations, memory — where is 80% of cost wasted? | | [DA] Devil's Advocate | Bermingham 13-agent + Black Hat | What is the strongest argument AGAINST this? What breaks in 90 days? | | [Crutch] Crutch Identifier | Project history: defectoscopy + hub | Is this a reusable pattern or a workaround being institutionalized? | | [Strategy] Strategic Horizon | DX/tech debt, 6-month view | New external dep / >3 files / new interface contract? Flag only if yes. | | [ML] ML Experiment Auditor | arxiv 2603.15916, project history | Are we at a plateau? Is this experiment worth running at all? | | [TestCov] Testing Coverage | Qodo/CodeRabbit/Anthropic tool | Does this change need new tests? Are edge cases covered? | | [Obs] Observability Enforcer | Google SRE, OpenTelemetry practices | Can we debug this in prod? Are logs/metrics/traces in place? |
Run all 8 roles simultaneously in your reasoning. Output before proceeding:
QUICK CRITIQUE:
[Security]: <finding + what was evaluated, or "clear — evaluated: X">
[Perf]: <finding + what was evaluated, or "clear — evaluated: X">
[DA]: <top objection with 90-day scenario, or "no blockers — reason">
[Crutch]: <pattern or crutch + evidence, or "pattern — reusable: evidence">
[Strategy]: <flag only if new dep/3+ files/new contract; else "below threshold">
[ML]: <plateau/pivot signal, or "N/A — not an experiment">
[TestCov]: <missing test areas, or "clear — existing tests cover: X">
[Obs]: <observability gap, or "clear — logs/metrics in place">
VERDICT: PROCEED / BLOCKED by [Role] — [reason]
Anti-collapse rule: Before writing "clear" for any role — cite what was evaluated.
WRONG: [Security]: clear
RIGHT: [Security]: clear — evaluated: no new endpoints, no user input, no new deps
[Strategy] falsifiability threshold: Flag ONLY if ≥1 of:
below threshold — not flaggingIf BLOCKED: surface the finding to the user. Do not write code around a CRITICAL finding.
Before any experiment launch, answer all 4 checks:
PLATEAU CHECK Plateau = absolute delta < 0.01 AND relative delta < 5% (both conditions). Example: metric=0.65, new=0.659 → 0.009 abs, 1.4% rel → PLATEAU Example: metric=0.65, new=0.670 → 0.020 abs, 3.1% rel → NOT plateau
HYPOTHESIS QUALITY
OBSERVABILITY SLA
MACRO PRIORITY
Pre-flight template (required before any experiment > 30 min):
Hypothesis: [specific falsifiable claim]
Baseline: [prior result, or "NONE — run baseline first"]
Success: [metric >= X, or cost <= Y]
Pilot: [5% subset, ~5 min — validates I/O before full run]
SLA: [check interval, hung condition, artifact path]
Crutch indicators:
Pattern indicators (safe):
[CRITIQUE] in user prompt)Launch 8 subagents via Agent tool in parallel — one per role.
Each receives: context, the proposed decision, role system prompt from resources/role-prompts.md.
Synthesize via D3:
From defectoscopy (ML pipeline):
From techcon_hub (infrastructure):
terraform apply -auto-approve destroyed controller node → [Security] + [DA]Load only when explicitly needed (not auto-injected):
resources/role-prompts.md — Full SPP prompts for all 8 roles (Deep Mode)resources/ml-audit-protocol.md — Plateau detection, pivot logic, experiment manifestresources/failure-patterns.md — Annotated failure catalog (defectoscopy + hub)testing
# Design Doc Creator ## When to Load This Skill Load when: design documents, requirements, new project start. Short fixture skill for testing (optional/meta skill).
development
# Windows Developer Guide ## When to Load Automatically loaded on Windows (`platform_trigger: "win32"`). Applies to: `.py`, `.ps1`, `.bat`, `.cmd` files and any Windows-specific workflow. ## Python on Windows ### Encoding (CRITICAL) Windows defaults to `cp1251` / `cp1252` for file I/O. Always specify UTF-8 explicitly: ```python with open("file.txt", "r", encoding="utf-8") as f: content = f.read() Path("file.txt").read_text(encoding="utf-8") Path("file.txt").write_text(content, encodin
development
# Test-First Patterns ## When to Load This Skill Load when writing tests, creating `.feature` files, setting up conftest, discussing test strategy, or reviewing coverage. ## Philosophy Tests are written BEFORE code. Always. No exceptions. The order is: Design Doc → BDD Scenarios → Unit Tests → Implementation. BDD scenarios come from the design document's use cases section — they are a direct translation of business requirements into executable specifications. This makes tests the living do
testing
# Skill: Supply Chain Auditor ## When to Load Auto-load when: adding dependencies, reviewing packages, updating versions, or discussing `requirements.txt`, `pyproject.toml`, `package.json`. Triggers on `dependency`, `install`, `package`, `CVE`, `audit`, `vulnerable` (≥2 keywords). ## Core Rules Every new dependency addition must pass this checklist before merging: 1. **Pinned** — exact version in production (`==1.2.3` for pip, `"1.2.3"` for npm, not `^` or `~`). 2. **Maintained** — last com