skills/drivingsdk-ascend-model-migration/model-training/SKILL.md
Model training on Ascend NPU. Invoke when user wants to launch training script and monitor training progress.
npx skillsauth add Ascend/agent-skills model-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Launch and monitor model training on Ascend NPU environment.
Ask user for the following:
1. Training mode preference:
- Performance training (FP32/FP16)
- Accuracy training (full epochs)
- Custom training
2. Number of GPUs/NPUs to use
3. Batch size (if custom)
4. Any specific training parameters
Provide user with training options:
Training Options:
1. Performance Training (FP32, 8 GPUs) - Quick performance test
2. Performance Training (FP16, 8 GPUs) - Quick performance test with mixed precision
3. Accuracy Training (24 epochs) - Full accuracy training
4. Custom training configuration
Ask user to select training mode or provide custom configuration.
Ensure environment variables are set:
export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3
export TASK_QUEUE_ENABLE=2
export COMBINED_ENABLE=1
export HCCL_WHITELIST_DISABLE=1
export HCCL_IF_IP=$(hostname -I | awk '{print $1}')
export HCCL_CONNECT_TIMEOUT=1200
Execute training script based on selected mode.
Check training log and verify training started successfully.
cd <working_directory>
bash test/train_performance_8p_base_fp32.sh --batch-size=1 --num-npu=8
cd <working_directory>
bash test/train_performance_8p_base_fp16.sh --batch-size=1 --num-npu=8
cd <working_directory>
bash test/train_full_8p.sh --batch-size=1
cd <model_directory>
bash ./tools/dist_train.sh ./projects/configs/bevformer/bevformer_base.py <num_gpus>
# Check training process
ps aux | grep torchrun
# Check training log
tail -20 <working_directory>/test/output/train_performance_8p_base_fp32.log
Training is considered successfully started when:
Example successful log:
2026-03-12 09:47:40,838 - mmdet - INFO - Epoch [1][26/41] lr: 7.333e-05, eta: 0:36:35, time: 1.883, data_time: 0.023, memory: 25332, loss_cls: 1.2614, loss_bbox: 1.7827
tail -f <working_directory>/test/output/train_performance_8p_base_fp32.log
Training is complete when:
Check for completion:
grep "Saving checkpoint" <log_file> | tail -1
npu-smi infonpu-smi infops aux | grep python# Kill existing training processes
pkill -f torchrun
pkill -f dist_train
| Training Mode | Log File |
|---------------|----------|
| FP32 Performance | test/output/train_performance_8p_base_fp32.log |
| FP16 Performance | test/output/train_performance_8p_base_fp16.log |
| Accuracy | test/output/train_full_8p.log |
<model_directory>/test/<model_directory>/projects/configs/DrivingSDK/model_examples/<model>/README.mdtesting
Kubernetes 集群健康检查与安全修复 — 诊断问题,用户确认后执行修复
tools
昇腾NPU CANN Toolkit+Kernels+NNAL安装部署技能。支持从官网下载run包安装和从Docker镜像提取两种方式,覆盖驱动检查、包下载、安装、环境变量配置与验证全流程。当用户需要安装CANN全套组件或指定版本CANN到自定义路径时调用。
development
编译 ATB (Ascend Transformer Boost) 测试框架。当用户需要编译 ATB 测试框架、 运行 CSV 测试、或构建 atb_test_framework 时调用。支持全量编译(含第三方依赖克隆与源替换) 和增量编译两种模式。需在 Docker 容器内配合 CANN 环境执行。
databases
ATB OPS→ACLNN 迁移标准化工作流主模板。整合前置学习、设计文档生成、CSV用例设计、 实际迁移、编译验证、测试验证全流程,提供明确的阶段 Gates 和用户确认机制。