Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

eigent-ai/ml-failure-audit

Name: ml-failure-audit
Author: eigent-ai

skills/data-and-analytics/ml-failure-audit/SKILL.md

npx skillsauth add eigent-ai/agent-skills ml-failure-audit

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ML Failure Audit

Purpose

Audit ML failures from supplied artifacts without assuming the headline explanation is true. Use this skill when the user provides a repo, logs, W&B/MLflow/TensorBoard exports, CI artifacts, config files, or reports and asks for a diagnosis, go/no-go decision, or structured output.

Core Workflow

Locate evidence
- Find the repo root, log files, metric exports, configs, test definitions, golden values, and any requested output schema.
- Treat raw logs, raw telemetry, configs, and source code as higher-trust than reports, PR text, summaries, or generated JSONs.
Classify the failure
- Separate model/convergence signals from correctness, data, config, runtime, infra, and metric-policy signals.
- Do not label a failure as model/convergence regression when only a performance, timeout, logging, or tolerance gate failed and correctness/loss checks passed.
Recompute key facts
- Extract the failing metric/test, passed checks, final run state, key training counters, and relevant metric values.
- Recompute numeric claims directly from raw artifacts when possible.
- Record formulas for derived values such as throughput, relative error, finish rate, loss deltas, or token counts.
Trace code paths
- Identify how the repo selects metrics/tests and how comparisons are made.
- Cite exact source files and, when useful, function names or line snippets.
Make a decision
- State whether this is a true ML regression, system correctness bug, infra/runtime issue, data/config issue, metric-policy issue, or unsupported claim.
- Recommend the minimal policy or engineering action that protects correctness without over-blocking valid runs.
Write outputs
- Follow the task's requested schema exactly.
- If no schema is given, write a concise report with: evidence, classification, calculations, recommendation, and validation checks.

Guardrails

Do not run expensive GPU training unless the user explicitly requests it.
Do not clone remote repos when a local checkout is supplied.
Do not trust prose summaries until verified against raw artifacts.
Do not infer OOM, convergence, stability, or success from missing logs alone.
If the task asks for a file output, create the file; do not only answer in chat.

Helper Script

Use scripts/collect_failure_evidence.py for a quick first pass over a repo and logs:

python3 <skill_dir>/scripts/collect_failure_evidence.py \
  --repo <repo-root> \
  --logs <log1> <log2> \
  --out <output.json>

The script is intentionally generic. It extracts failure lines, pass lines, metric-looking lines, config/source candidates, and nearby context windows. Use it to accelerate evidence gathering, not as the final diagnosis.

References

Read references/workflow.md for the detailed audit checklist and failure taxonomy.
Read references/output_guidance.md when the user asks for structured JSON or a file deliverable.

Short Prompt Pattern

Use a short realistic task prompt like:

Use $ml-failure-audit to audit this ML CI failure from the provided repo and logs. Decide whether it is a real training regression or a gate/policy issue, and produce the requested output files.

eigent-ai/ml-failure-audit

skills/data-and-analytics/ml-failure-audit/SKILL.md

General workflow for auditing ML CI failures, experiment regressions, training run failures, golden metric failures, and telemetry-backed ML work-product claims from local repositories, logs, metrics, configs, and artifacts. Use when Codex needs to decide whether an ML failure is a model/convergence issue, correctness bug, data/config issue, infrastructure/runtime issue, evaluation/gating policy issue, or unsupported claim, and produce structured evidence-backed outputs.

9 stars

development

Updated May 20, 2026

$ install --global

skillsauth

npx skillsauth add eigent-ai/agent-skills ml-failure-audit

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 20, 2026, 2:02 AM41.3s5 files scanned

SKILL.md

name:: ml-failure-audit
description:: General workflow for auditing ML CI failures, experiment regressions, training run failures, golden metric failures, and telemetry-backed ML work-product claims from local repositories, logs, metrics, configs, and artifacts. Use when Codex needs to decide whether an ML failure is a model/convergence issue, correctness bug, data/config issue, infrastructure/runtime issue, evaluation/gating policy issue, or unsupported claim, and produce structured evidence-backed outputs.

ML Failure Audit

Purpose

Core Workflow

Locate evidence
- Find the repo root, log files, metric exports, configs, test definitions, golden values, and any requested output schema.
- Treat raw logs, raw telemetry, configs, and source code as higher-trust than reports, PR text, summaries, or generated JSONs.
Classify the failure
- Separate model/convergence signals from correctness, data, config, runtime, infra, and metric-policy signals.
- Do not label a failure as model/convergence regression when only a performance, timeout, logging, or tolerance gate failed and correctness/loss checks passed.
Recompute key facts
- Extract the failing metric/test, passed checks, final run state, key training counters, and relevant metric values.
- Recompute numeric claims directly from raw artifacts when possible.
- Record formulas for derived values such as throughput, relative error, finish rate, loss deltas, or token counts.
Trace code paths
- Identify how the repo selects metrics/tests and how comparisons are made.
- Cite exact source files and, when useful, function names or line snippets.
Make a decision
- State whether this is a true ML regression, system correctness bug, infra/runtime issue, data/config issue, metric-policy issue, or unsupported claim.
- Recommend the minimal policy or engineering action that protects correctness without over-blocking valid runs.
Write outputs
- Follow the task's requested schema exactly.
- If no schema is given, write a concise report with: evidence, classification, calculations, recommendation, and validation checks.

Guardrails

Do not run expensive GPU training unless the user explicitly requests it.
Do not clone remote repos when a local checkout is supplied.
Do not trust prose summaries until verified against raw artifacts.
Do not infer OOM, convergence, stability, or success from missing logs alone.
If the task asks for a file output, create the file; do not only answer in chat.

Helper Script

Use scripts/collect_failure_evidence.py for a quick first pass over a repo and logs:

python3 <skill_dir>/scripts/collect_failure_evidence.py \
  --repo <repo-root> \
  --logs <log1> <log2> \
  --out <output.json>

References

Read references/workflow.md for the detailed audit checklist and failure taxonomy.
Read references/output_guidance.md when the user asks for structured JSON or a file deliverable.

Short Prompt Pattern

Use a short realistic task prompt like:

Use $ml-failure-audit to audit this ML CI failure from the provided repo and logs. Decide whether it is a real training regression or a gate/policy issue, and produce the requested output files.

Related Skills

eigent-ai/open-design

development

VerifiedTrustedCommunity

Generate web, mobile, and desktop prototypes, slides, dashboards, and editorial layouts from a single prompt using brand-grade design systems. Use when the user wants /web-prototype, /mobile-app, or /dashboard interactive HTML previews exportable to HTML, PDF, PPTX, or MP4 without Figma.

10SKILL.mdUpdated May 27, 2026

eigent-ai/open-design

eigent-ai/resume-tailoring

data-ai

VerifiedTrustedCommunity

Tailor a resume to a job description with ATS keyword optimization, gap analysis, and rewritten bullets — zero fabrication. Use when the user wants /tailor for a single role, /batch for multiple JDs, career pivot reframing, or interview prep questions from a tailored application.

10SKILL.mdUpdated May 27, 2026

eigent-ai/resume-tailoring

eigent-ai/legal-work-plugin

tools

VerifiedTrustedCommunity

In-house legal workflows for contract review against playbooks, NDA triage with GREEN/YELLOW/RED ratings, compliance briefings, and vendor checks. Use when the user invokes /review-contract, /triage-nda, /legal-risk-assessment, or /vendor-check for organisation-standard legal analysis.

10SKILL.mdUpdated May 27, 2026

eigent-ai/legal-work-plugin

eigent-ai/contract-review-risk-scoring

development

VerifiedTrustedCommunity

Review contracts with clause-by-clause risk scoring, market benchmarks, negotiability ratings, and redline suggestions. Use when the user pastes an NDA, SaaS/MSA, M&A LOI, or payment agreement and wants a Contract Safety Score, CUAD-based risk breakdown, or /review as [position] analysis.

10SKILL.mdUpdated May 27, 2026

eigent-ai/contract-review-risk-scoring

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/eigent-ai/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/data-and-analytics/ml-failure-audit ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

eigent-ai/agent-skills

9 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT