ML Governance Guard

架构

/mlgg → 加载 .claude/commands/mlgg.md（状态机 + 评审循环，~200 行）
每个 Phase → 按需读 references/protocols/phase-N.md（仅 Research 模式）
审计模式 → references/protocols/audit-mode.md

Quick Dispatch

| 用户说的 | 命令 | |---------|------| | 建模 / 训练 / "我有数据" | /mlgg | | 交互式体验 | python3 scripts/orchestration/mlgg.py play | | 查看结果 | python3 scripts/tools/quick_summary.py <dir> | | 下载数据集 | python3 examples/download_real_data.py <name> (heart/breast/ckd/hepatitis/spect/dermatology/pima/mammographic/thyroid/eeg_eye/framingham/diabetes130/diabetes130_full/vitaldb/rhc/sepsis_survival) | | 下载 CDC 数据 | python3 examples/download_cdc_data.py <name> (brfss/nhis/covid/all) | | 下载 NHANES | python3 examples/download_nhanes.py --cycles both --output examples/nhanes_diabetes.csv | | 下载 NCI 癌症 | python3 examples/download_nci_gdc.py --output examples/nci_gdc_cancer_survival.csv | | 严格审计 | python3 scripts/orchestration/mlgg.py workflow --strict | | 检查环境 | python3 scripts/orchestration/mlgg.py doctor | | 初始化项目 | python3 scripts/orchestration/mlgg.py onboarding | | 对比两次运行 | python3 scripts/tools/compare_runs.py --run-a <dir1> --run-b <dir2> | | 生成修复计划 | python3 scripts/tools/remediation_plan.py --evidence-dir <dir> | | 解释 gate 失败 | python3 scripts/tools/explain_gate.py --report <gate_report.json> | | 检查代码泄漏 | python3 scripts/orchestration/mlgg.py lint check <file.py> | | SHAP 可解释性 | python3 scripts/gates/shap_interpretability_gate.py --model-pool evidence/model_pool.pkl --train-data data/train.csv --test-data data/test.csv --target-col y --report evidence/shap_report.json | | 校准指标 | calibration_metrics() in _gate_utils.py | | NRI/IDI 模型比较 | compute_nri_idi() in _gate_utils.py | | VIF 共线性 | compute_vif() in _gate_utils.py | | 非线性检验 | check_nonlinearity() in _gate_utils.py | | MNAR 敏感性 | mnar_sensitivity_analysis() in _gate_utils.py | | 时序漂移 | temporal_drift_analysis() in _gate_utils.py | | Model Card | generate_model_card() in _gate_utils.py | | 插补敏感性 | imputation_sensitivity() in _gate_utils.py | | 亚组 DCA | subgroup_dca() in _gate_utils.py | | 消融实验 | feature_ablation() in _gate_utils.py | | LaTeX 表格 | python3 scripts/tools/export_latex.py --evaluation-report evidence/evaluation_report.json | | 合规证书 | python3 scripts/tools/generate_compliance_certificate.py --evidence-dir evidence/ | | 查审稿案例 | python3 scripts/tools/peer_review_lookup.py --stats | | 审稿人怎么看？ | python3 scripts/tools/peer_review_lookup.py --tags "<tags>" | | gate 抓过什么？ | python3 scripts/tools/peer_review_lookup.py --gate <gate_name> | | 审查论文 Methods | python3 scripts/tools/score_paper_metadata.py --metadata <metadata.json> | | 批量评审 | python3 scripts/tools/batch_journal_review.py --manifest batch_manifest.json |

Peer Review Evidence Protocol

Agent 审查代码时，必须查阅 references/case-studies/peer-review-kb.json（107 篇 NC 论文，375 条审稿意见）。

审稿人的原话比规则更有说服力——"NC Reviewer #2 在类似论文中指出了完全相同的问题"。

KB 结构: concern_id, category, severity, mlgg_dimension, mlgg_gates, tags, concern_text, author_response。

检索策略: | 场景 | 检索字段 | |------|---------| | Gate 失败 | mlgg_gates 包含该 gate | | 发现具体问题 | tags 匹配 | | Phase checkpoint | mlgg_dimension | | 严重度过滤 | severity |

引用格式: [PEER-REVIEW] PR-XXX-CYY (Nature Communications, 20XX) 审稿人: "..." 修复: "..."

统计引用: "107 篇 NC 论文中，119/375 (31.7%) 审稿意见要求完善评估指标"

python3 scripts/tools/peer_review_lookup.py --stats
python3 scripts/tools/peer_review_lookup.py --gate leakage_gate
python3 scripts/tools/peer_review_lookup.py --tags "missing_calibration,no_dca"

Clinical Semantic Review Checklist

Agent 审查或构建模型时，必须执行以下临床检查（自动 gate 无法覆盖）：

Feature Timeline Audit

每个特征判定产生时间点：

Pre-index (入院前: demographics, prior diagnoses) — 安全
Index-time (入院时: admission labs, chief complaint) — 安全（如果预测在入院时）
Post-index (出院后: length of stay, discharge disposition) — LEAKAGE

| 数据集 | 常被误用的 post-index 特征 | |--------|--------------------------| | Diabetes 130 (UCI) | time_in_hospital, num_medications, discharge_disposition_id | | MIMIC-III/IV | Procedures, ventilation hours, vasopressor doses |

用户未指定预测时间点 → 问: "模型用于入院时、住院中、还是出院时？"

Definition Variable Leakage (Lint 无法检测)

当用户用 hba1c >= 6.5 或 fasting_glucose >= 126 定义糖尿病标签后，这些变量不能出现在特征列表中。Agent 必须检查:

标签是如何构建的（查找 df["label"] = ... 的定义逻辑）
定义中用到的列是否出现在 features = [...] 或 X = df.drop(...) 中
如果结局 = 疾病诊断，读 references/methodology/disease-definition-knowledge-base.json 获取泄漏黑名单

Variable Aliasing (Lint R021 可部分检测)

用户可能将 test set 赋给别名变量后用于调参:

holdout_X = X_test       # alias
for params in grid:
    score = evaluate(holdout_X)  # 实际上在用 test set 调参

R021 可检测 holdout/held_out 等关键词，但任意命名（如 eval_data = X_test）仍需 agent 人工追踪赋值链。

Calibration Standards (Van Calster 2019)

每次校准报告必须包含:

Calibration slope (target: 1.0)
Calibration intercept (target: 0)
O:E ratio (target: 1.0)
ECE (<0.05 good, <0.10 acceptable)

Interpretability Standards

Multi-model SHAP: ≥ 2 model families
Cross-model Spearman rank ρ ≥ 0.5
Top-5 features 临床可解释

Fairness Standards

95% Bootstrap CI for subgroup metrics
n < 200 subgroups flagged as unreliable
Equalized odds gap + disparate impact ratio

Model Comparison

≥ 3 models on same test → 需多重比较校正 (Bonferroni-adjusted DeLong)
无校正 → 报告为 "empirical comparison" 非 "statistically superior"

12 维评分 (100 分制)

| # | 维度 | 权重 | 评分要点 | |---|------|------|---------| | 1 | 数据完整性 | 12 | Split 隔离、患者级不重叠、时序有序 | | 2 | 防泄漏 | 15 | 无目标/定义/谱系/未来泄漏 | | 3 | 流水线隔离 | 12 | 预处理器仅 train fit、插补隔离 | | 4 | 模型选择严谨性 | 10 | 候选≥3、one-SE、不窥测试集 | | 5 | 统计有效性 | 12 | Bootstrap CI、置换检验、校准、DCA | | 6 | 泛化证据 | 10 | Train-test gap、外部队列、种子稳定 | | 7 | 临床完整性 | 7 | 完整指标面板、混淆矩阵、阈值 | | 8 | 报告标准 | 7 | TRIPOD+AI、PROBAST+AI | | 9 | 可重复性 | 6 | 种子记录、版本追踪 | | 10 | 安全与溯源 | 3 | 模型签名、工件完整性 | | 11 | 公平性 | 3 | 均等化优势、差异影响比 | | 12 | 样本量 | 3 | EPV≥10、收缩因子≥0.90 |

≥90 顶刊级 · 75-89 需补充 · 60-74 重大缺陷 · <60 不可发表

期刊标准映射: references/standards/journal-rigor-standards.json (Nature Medicine, Lancet DH, JAMA, BMJ, npj DM)

Developer Reference

添加新数据集

examples/download_real_data.py → URLS + prepare_<name>() + PREPARE dict
输出: patient_id, event_time, y, features...
scripts/orchestration/mlgg_pixel.py → i18n + PLAY_DOWNLOAD_DATASETS

添加新模型族

scripts/tools/train_select_evaluate.py 5 处: SUPPORTED_MODEL_FAMILIES, _family_grid(), _build_estimator_for_family(), _family_base_complexity(), _family_friendly_name()

添加新 Gate

统一 CLI 契约: --report, --strict, exit 0/2, build_report_envelope(), start_gate_timer(), 注册到 _gate_registry.py。

添加新 Lint 规则

plugin/mlgg_lint/rules/r0xx_rule_name.py + plugin/tests/samples/r0xx_bad.py + r0xx_good.py

常见错误恢复

| 错误 | 修复 | |------|------| | candidate_pool_too_small | 增加模型族或 --max-trials-per-family | | 训练超时 (>20min) | 减少模型数/trials | | FileNotFoundError | 检查 data/ 下 CSV | | Gate 失败 | python3 scripts/tools/explain_gate.py --report evidence/<gate>_report.json |

Gate 严格性 Profile

| Profile | 适用场景 | EPV | 最小事件 | |---------|---------|-----|---------| | standard | N≥1000 | 10 | 100 | | small_cohort | N=200-1000 | 7 | 50 | | rare_disease | N<200 | 5 | 20 |

在 request.json 中指定: "thresholds": {"profile": "rare_disease"}

可用数据集 (16 个, 526K+ 行)

| 数据集 | 行数 | 下载命令 | |--------|------|---------| | Sepsis Survival | 129K | download_real_data.py sepsis_survival | | Diabetes 130 Full | 102K | download_real_data.py diabetes130_full | | BRFSS 2022 | 100K | download_cdc_data.py brfss | | NHANES | 16K | download_nhanes.py --cycles both | | RHC | 5.7K | download_real_data.py rhc | | Heart/Breast/Pima | <1K | download_real_data.py heart |

能力边界

能做: 表格型医学二分类 (EHR/临床/注册), 20 个 sklearn 模型族 + 4 个可选后端, 33 gate 全生命周期治理 不能做: 图像/文本/时序, 多分类/回归, 深度学习, 模型部署

Research 模式常见修复

| 用户代码中的问题 | 严重度 | 修复 | |----------------|--------|------| | train_test_split(X, y) 无 groups | CRITICAL | 加 groups=df["patient_id"] | | scaler.fit(X) 在 split 前 | CRITICAL | 移到 split 后 scaler.fit(X_train) | | SMOTE 用在全数据 | CRITICAL | 删 SMOTE，改 class_weight="balanced" | | 只报 AUROC | MAJOR | 补 AUPRC、MCC、Brier、校准、DCA | | 无 CI | MAJOR | 加 bootstrap 95% CI (≥1000) | | 阈值在 test 上选 | CRITICAL | 改为 validation 上选 (Youden's J) | | 定义变量做特征 | CRITICAL | 删除所有定义变量 |

标准化交付物

<project>/
├── data/train.csv, valid.csv, test.csv
├── configs/request.json, *.json
├── evidence/*_report.json (×33), manifest.json, prediction_trace.csv.gz
├── models/model.pkl + model.pkl.sig
└── results/summary.md, tables.tex

Phase 文件参考

references/protocols/
├── review-protocol.md    # 评审循环详细协议
├── phase-1.md ~ phase-9.md  # 各阶段详细规则
└── audit-mode.md         # 快速审计模式

疾病定义知识库: references/methodology/disease-definition-knowledge-base.json (10 种常见疾病) 错误知识库: references/operations/error-knowledge-base.json 文献知识库: references/methodology/literature-knowledge-base.json (30 条顶刊)

Agent Quick Reference:

构建新项目:  mlgg.py onboarding --mode auto
审计项目:    audit_external_project.py
修复计划:    remediation_plan.py --evidence-dir <dir>
证据对比:    evidence_comparator.py
LaTeX导出:   export_latex.py

ML Governance Guard

架构

/mlgg → 加载 .claude/commands/mlgg.md（状态机 + 评审循环，~200 行）
每个 Phase → 按需读 references/protocols/phase-N.md（仅 Research 模式）
审计模式 → references/protocols/audit-mode.md

Quick Dispatch

Peer Review Evidence Protocol

Agent 审查代码时，必须查阅 references/case-studies/peer-review-kb.json（107 篇 NC 论文，375 条审稿意见）。

审稿人的原话比规则更有说服力——"NC Reviewer #2 在类似论文中指出了完全相同的问题"。

KB 结构: concern_id, category, severity, mlgg_dimension, mlgg_gates, tags, concern_text, author_response。

引用格式: [PEER-REVIEW] PR-XXX-CYY (Nature Communications, 20XX) 审稿人: "..." 修复: "..."

统计引用: "107 篇 NC 论文中，119/375 (31.7%) 审稿意见要求完善评估指标"

python3 scripts/tools/peer_review_lookup.py --stats
python3 scripts/tools/peer_review_lookup.py --gate leakage_gate
python3 scripts/tools/peer_review_lookup.py --tags "missing_calibration,no_dca"

Clinical Semantic Review Checklist

Agent 审查或构建模型时，必须执行以下临床检查（自动 gate 无法覆盖）：

Feature Timeline Audit

每个特征判定产生时间点：

Pre-index (入院前: demographics, prior diagnoses) — 安全
Index-time (入院时: admission labs, chief complaint) — 安全（如果预测在入院时）
Post-index (出院后: length of stay, discharge disposition) — LEAKAGE

用户未指定预测时间点 → 问: "模型用于入院时、住院中、还是出院时？"

Definition Variable Leakage (Lint 无法检测)

当用户用 hba1c >= 6.5 或 fasting_glucose >= 126 定义糖尿病标签后，这些变量不能出现在特征列表中。Agent 必须检查:

标签是如何构建的（查找 df["label"] = ... 的定义逻辑）
定义中用到的列是否出现在 features = [...] 或 X = df.drop(...) 中
如果结局 = 疾病诊断，读 references/methodology/disease-definition-knowledge-base.json 获取泄漏黑名单

Variable Aliasing (Lint R021 可部分检测)

用户可能将 test set 赋给别名变量后用于调参:

holdout_X = X_test       # alias
for params in grid:
    score = evaluate(holdout_X)  # 实际上在用 test set 调参

R021 可检测 holdout/held_out 等关键词，但任意命名（如 eval_data = X_test）仍需 agent 人工追踪赋值链。

Calibration Standards (Van Calster 2019)

每次校准报告必须包含:

Calibration slope (target: 1.0)
Calibration intercept (target: 0)
O:E ratio (target: 1.0)
ECE (<0.05 good, <0.10 acceptable)

Interpretability Standards

Multi-model SHAP: ≥ 2 model families
Cross-model Spearman rank ρ ≥ 0.5
Top-5 features 临床可解释

Fairness Standards

95% Bootstrap CI for subgroup metrics
n < 200 subgroups flagged as unreliable
Equalized odds gap + disparate impact ratio

Model Comparison

≥ 3 models on same test → 需多重比较校正 (Bonferroni-adjusted DeLong)
无校正 → 报告为 "empirical comparison" 非 "statistically superior"

12 维评分 (100 分制)

≥90 顶刊级 · 75-89 需补充 · 60-74 重大缺陷 · <60 不可发表

期刊标准映射: references/standards/journal-rigor-standards.json (Nature Medicine, Lancet DH, JAMA, BMJ, npj DM)

Developer Reference

添加新数据集

examples/download_real_data.py → URLS + prepare_<name>() + PREPARE dict
输出: patient_id, event_time, y, features...
scripts/orchestration/mlgg_pixel.py → i18n + PLAY_DOWNLOAD_DATASETS

添加新模型族

scripts/tools/train_select_evaluate.py 5 处: SUPPORTED_MODEL_FAMILIES, _family_grid(), _build_estimator_for_family(), _family_base_complexity(), _family_friendly_name()

添加新 Gate

统一 CLI 契约: --report, --strict, exit 0/2, build_report_envelope(), start_gate_timer(), 注册到 _gate_registry.py。

添加新 Lint 规则

plugin/mlgg_lint/rules/r0xx_rule_name.py + plugin/tests/samples/r0xx_bad.py + r0xx_good.py

常见错误恢复

Gate 严格性 Profile

| Profile | 适用场景 | EPV | 最小事件 | |---------|---------|-----|---------| | standard | N≥1000 | 10 | 100 | | small_cohort | N=200-1000 | 7 | 50 | | rare_disease | N<200 | 5 | 20 |

在 request.json 中指定: "thresholds": {"profile": "rare_disease"}

可用数据集 (16 个, 526K+ 行)

能力边界

Research 模式常见修复

标准化交付物

<project>/
├── data/train.csv, valid.csv, test.csv
├── configs/request.json, *.json
├── evidence/*_report.json (×33), manifest.json, prediction_trace.csv.gz
├── models/model.pkl + model.pkl.sig
└── results/summary.md, tables.tex

Phase 文件参考

references/protocols/
├── review-protocol.md    # 评审循环详细协议
├── phase-1.md ~ phase-9.md  # 各阶段详细规则
└── audit-mode.md         # 快速审计模式

Agent Quick Reference:

构建新项目:  mlgg.py onboarding --mode auto
审计项目:    audit_external_project.py
修复计划:    remediation_plan.py --evidence-dir <dir>
证据对比:    evidence_comparator.py
LaTeX导出:   export_latex.py

Adoption

furinaaa-cancan/ml-governance-guard

$ install --global

Security Scan Results

SKILL.md

ML Governance Guard

架构

Quick Dispatch

Peer Review Evidence Protocol

Clinical Semantic Review Checklist

Feature Timeline Audit

Definition Variable Leakage (Lint 无法检测)

Variable Aliasing (Lint R021 可部分检测)

Calibration Standards (Van Calster 2019)

Interpretability Standards

Fairness Standards

Model Comparison

12 维评分 (100 分制)

Developer Reference

添加新数据集

添加新模型族

添加新 Gate

添加新 Lint 规则

常见错误恢复

Gate 严格性 Profile

可用数据集 (16 个, 526K+ 行)

能力边界

Research 模式常见修复

标准化交付物

Phase 文件参考

Related Skills

openclaw/openclaw-secret-scanning-maintainer

openclaw/openclaw-release-maintainer

openclaw/openclaw-qa-testing

openclaw/openclaw-parallels-smoke

furinaaa-cancan/ml-governance-guard

$ install --global

Security Scan Results

SKILL.md

ML Governance Guard

架构

Quick Dispatch

Peer Review Evidence Protocol

Clinical Semantic Review Checklist

Feature Timeline Audit

Definition Variable Leakage (Lint 无法检测)

Variable Aliasing (Lint R021 可部分检测)

Calibration Standards (Van Calster 2019)

Interpretability Standards

Fairness Standards

Model Comparison

12 维评分 (100 分制)

Developer Reference

添加新数据集

添加新模型族

添加新 Gate

添加新 Lint 规则

常见错误恢复

Gate 严格性 Profile

可用数据集 (16 个, 526K+ 行)

能力边界

Research 模式常见修复

标准化交付物

Phase 文件参考

Related Skills

openclaw/openclaw-secret-scanning-maintainer

openclaw/openclaw-release-maintainer

openclaw/openclaw-qa-testing

openclaw/openclaw-parallels-smoke