skills/data-and-analytics/ml-failure-audit/SKILL.md
General workflow for auditing ML CI failures, experiment regressions, training run failures, golden metric failures, and telemetry-backed ML work-product claims from local repositories, logs, metrics, configs, and artifacts. Use when Codex needs to decide whether an ML failure is a model/convergence issue, correctness bug, data/config issue, infrastructure/runtime issue, evaluation/gating policy issue, or unsupported claim, and produce structured evidence-backed outputs.
npx skillsauth add eigent-ai/agent-skills ml-failure-auditInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Audit ML failures from supplied artifacts without assuming the headline explanation is true. Use this skill when the user provides a repo, logs, W&B/MLflow/TensorBoard exports, CI artifacts, config files, or reports and asks for a diagnosis, go/no-go decision, or structured output.
Locate evidence
Classify the failure
Recompute key facts
Trace code paths
Make a decision
Write outputs
Use scripts/collect_failure_evidence.py for a quick first pass over a repo and logs:
python3 <skill_dir>/scripts/collect_failure_evidence.py \
--repo <repo-root> \
--logs <log1> <log2> \
--out <output.json>
The script is intentionally generic. It extracts failure lines, pass lines, metric-looking lines, config/source candidates, and nearby context windows. Use it to accelerate evidence gathering, not as the final diagnosis.
references/workflow.md for the detailed audit checklist and failure taxonomy.references/output_guidance.md when the user asks for structured JSON or a file deliverable.Use a short realistic task prompt like:
Use $ml-failure-audit to audit this ML CI failure from the provided repo and logs. Decide whether it is a real training regression or a gate/policy issue, and produce the requested output files.
development
Generate web, mobile, and desktop prototypes, slides, dashboards, and editorial layouts from a single prompt using brand-grade design systems. Use when the user wants /web-prototype, /mobile-app, or /dashboard interactive HTML previews exportable to HTML, PDF, PPTX, or MP4 without Figma.
data-ai
Tailor a resume to a job description with ATS keyword optimization, gap analysis, and rewritten bullets — zero fabrication. Use when the user wants /tailor for a single role, /batch for multiple JDs, career pivot reframing, or interview prep questions from a tailored application.
tools
In-house legal workflows for contract review against playbooks, NDA triage with GREEN/YELLOW/RED ratings, compliance briefings, and vendor checks. Use when the user invokes /review-contract, /triage-nda, /legal-risk-assessment, or /vendor-check for organisation-standard legal analysis.
development
Review contracts with clause-by-clause risk scoring, market benchmarks, negotiability ratings, and redline suggestions. Use when the user pastes an NDA, SaaS/MSA, M&A LOI, or payment agreement and wants a Contract Safety Score, CUAD-based risk breakdown, or /review as [position] analysis.