skills/msverl-daily-regression-triage/SKILL.md
Triage a daily msverl regression run by reading the baseline comparison log, stopping on success, extracting the most relevant training failure evidence from the daily training log when needed, collecting recent commits from verl main and MindSpeed master, and ranking the most likely culprit commits with concise fix-direction guidance.
npx skillsauth add Ascend/agent-skills msverl-daily-regression-triageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill when a fixed daily verl + MindSpeed training job has run and Codex needs to decide whether the result is healthy, whether there is a training failure or an accuracy regression, and which recent commit is the most likely cause.
/home/st_daily_verl/msverl.log/home/st_daily_verl/logs/msverl_YYYYMMDD.logverl repo: https://github.com/verl-project/verl.git on mainMindSpeed repo: https://gitcode.com/Ascend/MindSpeed.git on master/tmp/msverl-skill-cache00:00:00 to the task execution timemean abs diff: and the parsed value is exactly 0, stop and report success.mean abs diff: and the value is non-zero, classify as accuracy_regression.error, please check log, classify as train_error.unknown and explain what evidence is missing before doing expensive work.pass.train_error, run extract_failure_tail.py against the daily training log and keep only the final high-signal error block.accuracy_regression, use the parsed reward lists and mean abs diff as the primary evidence.ERROR block.3 candidates per repository unless the evidence is still weak.Return a compact report with:
status: pass, train_error, accuracy_regression, or unknowntime_windowevidence_summarycandidate_repocandidate_commitsconfidence: high, medium, or lowfix_directionWhen evidence is weak, say so clearly instead of forcing a single-commit claim.
testing
Kubernetes 集群健康检查与安全修复 — 诊断问题,用户确认后执行修复
tools
昇腾NPU CANN Toolkit+Kernels+NNAL安装部署技能。支持从官网下载run包安装和从Docker镜像提取两种方式,覆盖驱动检查、包下载、安装、环境变量配置与验证全流程。当用户需要安装CANN全套组件或指定版本CANN到自定义路径时调用。
development
编译 ATB (Ascend Transformer Boost) 测试框架。当用户需要编译 ATB 测试框架、 运行 CSV 测试、或构建 atb_test_framework 时调用。支持全量编译(含第三方依赖克隆与源替换) 和增量编译两种模式。需在 Docker 容器内配合 CANN 环境执行。
databases
ATB OPS→ACLNN 迁移标准化工作流主模板。整合前置学习、设计文档生成、CSV用例设计、 实际迁移、编译验证、测试验证全流程,提供明确的阶段 Gates 和用户确认机制。