/SKILL.md
Publication-grade medical prediction workflow with 33 fail-closed gates covering data leakage, calibration, fairness, TRIPOD+AI/PROBAST+AI compliance, and full model lifecycle governance.
npx skillsauth add furinaaa-cancan/medical-ml-leakage-guard ml-governance-guardInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
/mlgg → 加载 .claude/commands/mlgg.md(状态机 + 评审循环,~200 行)references/protocols/phase-N.md(仅 Research 模式)references/protocols/audit-mode.md| 用户说的 | 命令 |
|---------|------|
| 建模 / 训练 / "我有数据" | /mlgg |
| 交互式体验 | python3 scripts/orchestration/mlgg.py play |
| 查看结果 | python3 scripts/tools/quick_summary.py <dir> |
| 下载数据集 | python3 examples/download_real_data.py <name> (heart/breast/ckd/hepatitis/spect/dermatology/pima/mammographic/thyroid/eeg_eye/framingham/diabetes130/diabetes130_full/vitaldb/rhc/sepsis_survival) |
| 下载 CDC 数据 | python3 examples/download_cdc_data.py <name> (brfss/nhis/covid/all) |
| 下载 NHANES | python3 examples/download_nhanes.py --cycles both --output examples/nhanes_diabetes.csv |
| 下载 NCI 癌症 | python3 examples/download_nci_gdc.py --output examples/nci_gdc_cancer_survival.csv |
| 严格审计 | python3 scripts/orchestration/mlgg.py workflow --strict |
| 检查环境 | python3 scripts/orchestration/mlgg.py doctor |
| 初始化项目 | python3 scripts/orchestration/mlgg.py onboarding |
| 对比两次运行 | python3 scripts/tools/compare_runs.py --run-a <dir1> --run-b <dir2> |
| 生成修复计划 | python3 scripts/tools/remediation_plan.py --evidence-dir <dir> |
| 解释 gate 失败 | python3 scripts/tools/explain_gate.py --report <gate_report.json> |
| 检查代码泄漏 | python3 scripts/orchestration/mlgg.py lint check <file.py> |
| SHAP 可解释性 | python3 scripts/gates/shap_interpretability_gate.py --model-pool evidence/model_pool.pkl --train-data data/train.csv --test-data data/test.csv --target-col y --report evidence/shap_report.json |
| 校准指标 | calibration_metrics() in _gate_utils.py |
| NRI/IDI 模型比较 | compute_nri_idi() in _gate_utils.py |
| VIF 共线性 | compute_vif() in _gate_utils.py |
| 非线性检验 | check_nonlinearity() in _gate_utils.py |
| MNAR 敏感性 | mnar_sensitivity_analysis() in _gate_utils.py |
| 时序漂移 | temporal_drift_analysis() in _gate_utils.py |
| Model Card | generate_model_card() in _gate_utils.py |
| 插补敏感性 | imputation_sensitivity() in _gate_utils.py |
| 亚组 DCA | subgroup_dca() in _gate_utils.py |
| 消融实验 | feature_ablation() in _gate_utils.py |
| LaTeX 表格 | python3 scripts/tools/export_latex.py --evaluation-report evidence/evaluation_report.json |
| 合规证书 | python3 scripts/tools/generate_compliance_certificate.py --evidence-dir evidence/ |
| 查审稿案例 | python3 scripts/tools/peer_review_lookup.py --stats |
| 审稿人怎么看? | python3 scripts/tools/peer_review_lookup.py --tags "<tags>" |
| gate 抓过什么? | python3 scripts/tools/peer_review_lookup.py --gate <gate_name> |
| 审查论文 Methods | python3 scripts/tools/score_paper_metadata.py --metadata <metadata.json> |
| 批量评审 | python3 scripts/tools/batch_journal_review.py --manifest batch_manifest.json |
Agent 审查代码时,必须查阅 references/case-studies/peer-review-kb.json(107 篇 NC 论文,375 条审稿意见)。
审稿人的原话比规则更有说服力——"NC Reviewer #2 在类似论文中指出了完全相同的问题"。
KB 结构: concern_id, category, severity, mlgg_dimension, mlgg_gates, tags, concern_text, author_response。
检索策略:
| 场景 | 检索字段 |
|------|---------|
| Gate 失败 | mlgg_gates 包含该 gate |
| 发现具体问题 | tags 匹配 |
| Phase checkpoint | mlgg_dimension |
| 严重度过滤 | severity |
引用格式: [PEER-REVIEW] PR-XXX-CYY (Nature Communications, 20XX) 审稿人: "..." 修复: "..."
统计引用: "107 篇 NC 论文中,119/375 (31.7%) 审稿意见要求完善评估指标"
python3 scripts/tools/peer_review_lookup.py --stats
python3 scripts/tools/peer_review_lookup.py --gate leakage_gate
python3 scripts/tools/peer_review_lookup.py --tags "missing_calibration,no_dca"
Agent 审查或构建模型时,必须执行以下临床检查(自动 gate 无法覆盖):
每个特征判定产生时间点:
| 数据集 | 常被误用的 post-index 特征 | |--------|--------------------------| | Diabetes 130 (UCI) | time_in_hospital, num_medications, discharge_disposition_id | | MIMIC-III/IV | Procedures, ventilation hours, vasopressor doses |
用户未指定预测时间点 → 问: "模型用于入院时、住院中、还是出院时?"
当用户用 hba1c >= 6.5 或 fasting_glucose >= 126 定义糖尿病标签后,
这些变量不能出现在特征列表中。Agent 必须检查:
df["label"] = ... 的定义逻辑)features = [...] 或 X = df.drop(...) 中references/methodology/disease-definition-knowledge-base.json 获取泄漏黑名单用户可能将 test set 赋给别名变量后用于调参:
holdout_X = X_test # alias
for params in grid:
score = evaluate(holdout_X) # 实际上在用 test set 调参
R021 可检测 holdout/held_out 等关键词,但任意命名(如 eval_data = X_test)
仍需 agent 人工追踪赋值链。
每次校准报告必须包含:
| # | 维度 | 权重 | 评分要点 | |---|------|------|---------| | 1 | 数据完整性 | 12 | Split 隔离、患者级不重叠、时序有序 | | 2 | 防泄漏 | 15 | 无目标/定义/谱系/未来泄漏 | | 3 | 流水线隔离 | 12 | 预处理器仅 train fit、插补隔离 | | 4 | 模型选择严谨性 | 10 | 候选≥3、one-SE、不窥测试集 | | 5 | 统计有效性 | 12 | Bootstrap CI、置换检验、校准、DCA | | 6 | 泛化证据 | 10 | Train-test gap、外部队列、种子稳定 | | 7 | 临床完整性 | 7 | 完整指标面板、混淆矩阵、阈值 | | 8 | 报告标准 | 7 | TRIPOD+AI、PROBAST+AI | | 9 | 可重复性 | 6 | 种子记录、版本追踪 | | 10 | 安全与溯源 | 3 | 模型签名、工件完整性 | | 11 | 公平性 | 3 | 均等化优势、差异影响比 | | 12 | 样本量 | 3 | EPV≥10、收缩因子≥0.90 |
≥90 顶刊级 · 75-89 需补充 · 60-74 重大缺陷 · <60 不可发表
期刊标准映射: references/standards/journal-rigor-standards.json (Nature Medicine, Lancet DH, JAMA, BMJ, npj DM)
examples/download_real_data.py → URLS + prepare_<name>() + PREPARE dictpatient_id, event_time, y, features...scripts/orchestration/mlgg_pixel.py → i18n + PLAY_DOWNLOAD_DATASETSscripts/tools/train_select_evaluate.py 5 处: SUPPORTED_MODEL_FAMILIES, _family_grid(), _build_estimator_for_family(), _family_base_complexity(), _family_friendly_name()
统一 CLI 契约: --report, --strict, exit 0/2, build_report_envelope(), start_gate_timer(), 注册到 _gate_registry.py。
plugin/mlgg_lint/rules/r0xx_rule_name.py + plugin/tests/samples/r0xx_bad.py + r0xx_good.py
| 错误 | 修复 |
|------|------|
| candidate_pool_too_small | 增加模型族或 --max-trials-per-family |
| 训练超时 (>20min) | 减少模型数/trials |
| FileNotFoundError | 检查 data/ 下 CSV |
| Gate 失败 | python3 scripts/tools/explain_gate.py --report evidence/<gate>_report.json |
| Profile | 适用场景 | EPV | 最小事件 |
|---------|---------|-----|---------|
| standard | N≥1000 | 10 | 100 |
| small_cohort | N=200-1000 | 7 | 50 |
| rare_disease | N<200 | 5 | 20 |
在 request.json 中指定: "thresholds": {"profile": "rare_disease"}
| 数据集 | 行数 | 下载命令 |
|--------|------|---------|
| Sepsis Survival | 129K | download_real_data.py sepsis_survival |
| Diabetes 130 Full | 102K | download_real_data.py diabetes130_full |
| BRFSS 2022 | 100K | download_cdc_data.py brfss |
| NHANES | 16K | download_nhanes.py --cycles both |
| RHC | 5.7K | download_real_data.py rhc |
| Heart/Breast/Pima | <1K | download_real_data.py heart |
能做: 表格型医学二分类 (EHR/临床/注册), 20 个 sklearn 模型族 + 4 个可选后端, 33 gate 全生命周期治理 不能做: 图像/文本/时序, 多分类/回归, 深度学习, 模型部署
| 用户代码中的问题 | 严重度 | 修复 |
|----------------|--------|------|
| train_test_split(X, y) 无 groups | CRITICAL | 加 groups=df["patient_id"] |
| scaler.fit(X) 在 split 前 | CRITICAL | 移到 split 后 scaler.fit(X_train) |
| SMOTE 用在全数据 | CRITICAL | 删 SMOTE,改 class_weight="balanced" |
| 只报 AUROC | MAJOR | 补 AUPRC、MCC、Brier、校准、DCA |
| 无 CI | MAJOR | 加 bootstrap 95% CI (≥1000) |
| 阈值在 test 上选 | CRITICAL | 改为 validation 上选 (Youden's J) |
| 定义变量做特征 | CRITICAL | 删除所有定义变量 |
<project>/
├── data/train.csv, valid.csv, test.csv
├── configs/request.json, *.json
├── evidence/*_report.json (×33), manifest.json, prediction_trace.csv.gz
├── models/model.pkl + model.pkl.sig
└── results/summary.md, tables.tex
references/protocols/
├── review-protocol.md # 评审循环详细协议
├── phase-1.md ~ phase-9.md # 各阶段详细规则
└── audit-mode.md # 快速审计模式
疾病定义知识库: references/methodology/disease-definition-knowledge-base.json (10 种常见疾病)
错误知识库: references/operations/error-knowledge-base.json
文献知识库: references/methodology/literature-knowledge-base.json (30 条顶刊)
Agent Quick Reference:
构建新项目: mlgg.py onboarding --mode auto
审计项目: audit_external_project.py
修复计划: remediation_plan.py --evidence-dir <dir>
证据对比: evidence_comparator.py
LaTeX导出: export_latex.py
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.
development
Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.
development
Run, watch, debug, and extend OpenClaw QA testing with qa-lab and qa-channel. Use when Codex needs to execute the repo-backed QA suite, inspect live QA artifacts, debug failing scenarios, add new QA scenarios, or explain the OpenClaw QA workflow. Prefer the live OpenAI lane with regular openai/gpt-5.4 in fast mode; do not use gpt-5.4-pro or gpt-5.4-mini unless the user explicitly overrides that policy.
development
End-to-end Parallels smoke, upgrade, and rerun workflow for OpenClaw across macOS, Windows, and Linux guests. Use when Codex needs to run, rerun, debug, or interpret VM-based install, onboarding, gateway smoke tests, latest-release-to-main upgrade checks, fresh snapshot retests, or optional Discord roundtrip verification under Parallels.