Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

Ascend/model-training

Name: model-training
Author: Ascend

skills/drivingsdk-ascend-model-migration/model-training/SKILL.md

npx skillsauth add Ascend/agent-skills model-training

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Model Training Skill

Launch and monitor model training on Ascend NPU environment.

When to Invoke

User wants to start model training
User wants to run performance training
User wants to run accuracy training
User asks about training options

Information to Collect

Ask user for the following:

1. Training mode preference:
   - Performance training (FP32/FP16)
   - Accuracy training (full epochs)
   - Custom training
   
2. Number of GPUs/NPUs to use

3. Batch size (if custom)

4. Any specific training parameters

Training Options

Provide user with training options:

Training Options:
1. Performance Training (FP32, 8 GPUs) - Quick performance test
2. Performance Training (FP16, 8 GPUs) - Quick performance test with mixed precision
3. Accuracy Training (24 epochs) - Full accuracy training
4. Custom training configuration

Workflow

Step 1: Select Training Mode

Ask user to select training mode or provide custom configuration.

Step 2: Prepare Environment

Ensure environment variables are set:

export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3
export TASK_QUEUE_ENABLE=2
export COMBINED_ENABLE=1
export HCCL_WHITELIST_DISABLE=1
export HCCL_IF_IP=$(hostname -I | awk '{print $1}')
export HCCL_CONNECT_TIMEOUT=1200

Step 3: Launch Training

Execute training script based on selected mode.

Step 4: Monitor Progress

Check training log and verify training started successfully.

BEVFormer Training Commands

Performance Training (FP32, 8 GPUs)

cd <working_directory>
bash test/train_performance_8p_base_fp32.sh --batch-size=1 --num-npu=8

Performance Training (FP16, 8 GPUs)

cd <working_directory>
bash test/train_performance_8p_base_fp16.sh --batch-size=1 --num-npu=8

Accuracy Training (24 epochs)

cd <working_directory>
bash test/train_full_8p.sh --batch-size=1

Custom Training

cd <model_directory>
bash ./tools/dist_train.sh ./projects/configs/bevformer/bevformer_base.py <num_gpus>

Training Verification

Check Training Started

# Check training process
ps aux | grep torchrun

# Check training log
tail -20 <working_directory>/test/output/train_performance_8p_base_fp32.log

Success Indicators

Training is considered successfully started when:

Log file shows training iterations
Loss values are being printed
No error messages in recent log
Process is running (torchrun)

Example successful log:

2026-03-12 09:47:40,838 - mmdet - INFO - Epoch [1][26/41]       lr: 7.333e-05, eta: 0:36:35, time: 1.883, data_time: 0.023, memory: 25332, loss_cls: 1.2614, loss_bbox: 1.7827

Monitoring Training

Check Training Log

tail -f <working_directory>/test/output/train_performance_8p_base_fp32.log

Key Metrics

Loss values: loss_cls, loss_bbox, total loss
Learning rate: lr
Time per iteration: time
Memory usage: memory
ETA: estimated time remaining

Validation Metrics (per epoch)

NDS: NuScenes Detection Score
mAP: mean Average Precision
mATE: mean Translation Error
mASE: mean Scale Error
mAOE: mean Orientation Error

Training Completion

Training is complete when:

All epochs finished
Final validation metrics printed
Checkpoint saved

Check for completion:

grep "Saving checkpoint" <log_file> | tail -1

Troubleshooting

Training Fails to Start

Check NPU devices: npu-smi info
Check environment variables
Check conda environment is activated
Check dataset and weights are linked

Out of Memory

Reduce batch size
Reduce number of GPUs
Enable gradient checkpointing

Training Hangs

Check NPU status: npu-smi info
Check for dead processes: ps aux | grep python
Check log for last activity

Port Already in Use

# Kill existing training processes
pkill -f torchrun
pkill -f dist_train

Log File Locations

| Training Mode | Log File | |---------------|----------| | FP32 Performance | test/output/train_performance_8p_base_fp32.log | | FP16 Performance | test/output/train_performance_8p_base_fp16.log | | Accuracy | test/output/train_full_8p.log |

Reference

Training scripts: <model_directory>/test/
Model config: <model_directory>/projects/configs/
DrivingSDK README: DrivingSDK/model_examples/<model>/README.md

Ascend/model-training

skills/drivingsdk-ascend-model-migration/model-training/SKILL.md

Model training on Ascend NPU. Invoke when user wants to launch training script and monitor training progress.

5 stars

data-ai

Updated Apr 22, 2026

$ install --global

skillsauth

npx skillsauth add Ascend/agent-skills model-training

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 22, 2026, 10:13 PM171.1s1 file scanned

SKILL.md

name:: model-training
description:: Model training on Ascend NPU. Invoke when user wants to launch training script and monitor training progress.

Model Training Skill

Launch and monitor model training on Ascend NPU environment.

When to Invoke

User wants to start model training
User wants to run performance training
User wants to run accuracy training
User asks about training options

Information to Collect

Ask user for the following:

1. Training mode preference:
   - Performance training (FP32/FP16)
   - Accuracy training (full epochs)
   - Custom training
   
2. Number of GPUs/NPUs to use

3. Batch size (if custom)

4. Any specific training parameters

Training Options

Provide user with training options:

Training Options:
1. Performance Training (FP32, 8 GPUs) - Quick performance test
2. Performance Training (FP16, 8 GPUs) - Quick performance test with mixed precision
3. Accuracy Training (24 epochs) - Full accuracy training
4. Custom training configuration

Workflow

Step 1: Select Training Mode

Ask user to select training mode or provide custom configuration.

Step 2: Prepare Environment

Ensure environment variables are set:

export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3
export TASK_QUEUE_ENABLE=2
export COMBINED_ENABLE=1
export HCCL_WHITELIST_DISABLE=1
export HCCL_IF_IP=$(hostname -I | awk '{print $1}')
export HCCL_CONNECT_TIMEOUT=1200

Step 3: Launch Training

Execute training script based on selected mode.

Step 4: Monitor Progress

Check training log and verify training started successfully.

BEVFormer Training Commands

Performance Training (FP32, 8 GPUs)

cd <working_directory>
bash test/train_performance_8p_base_fp32.sh --batch-size=1 --num-npu=8

Performance Training (FP16, 8 GPUs)

cd <working_directory>
bash test/train_performance_8p_base_fp16.sh --batch-size=1 --num-npu=8

Accuracy Training (24 epochs)

cd <working_directory>
bash test/train_full_8p.sh --batch-size=1

Custom Training

cd <model_directory>
bash ./tools/dist_train.sh ./projects/configs/bevformer/bevformer_base.py <num_gpus>

Training Verification

Check Training Started

# Check training process
ps aux | grep torchrun

# Check training log
tail -20 <working_directory>/test/output/train_performance_8p_base_fp32.log

Success Indicators

Training is considered successfully started when:

Log file shows training iterations
Loss values are being printed
No error messages in recent log
Process is running (torchrun)

Example successful log:

2026-03-12 09:47:40,838 - mmdet - INFO - Epoch [1][26/41]       lr: 7.333e-05, eta: 0:36:35, time: 1.883, data_time: 0.023, memory: 25332, loss_cls: 1.2614, loss_bbox: 1.7827

Monitoring Training

Check Training Log

tail -f <working_directory>/test/output/train_performance_8p_base_fp32.log

Key Metrics

Loss values: loss_cls, loss_bbox, total loss
Learning rate: lr
Time per iteration: time
Memory usage: memory
ETA: estimated time remaining

Validation Metrics (per epoch)

NDS: NuScenes Detection Score
mAP: mean Average Precision
mATE: mean Translation Error
mASE: mean Scale Error
mAOE: mean Orientation Error

Training Completion

Training is complete when:

All epochs finished
Final validation metrics printed
Checkpoint saved

Check for completion:

grep "Saving checkpoint" <log_file> | tail -1

Troubleshooting

Training Fails to Start

Check NPU devices: npu-smi info
Check environment variables
Check conda environment is activated
Check dataset and weights are linked

Out of Memory

Reduce batch size
Reduce number of GPUs
Enable gradient checkpointing

Training Hangs

Check NPU status: npu-smi info
Check for dead processes: ps aux | grep python
Check log for last activity

Port Already in Use

# Kill existing training processes
pkill -f torchrun
pkill -f dist_train

Log File Locations

Reference

Training scripts: <model_directory>/test/
Model config: <model_directory>/projects/configs/
DrivingSDK README: DrivingSDK/model_examples/<model>/README.md

Related Skills

Ascend/k8s-check-fix

testing

VerifiedTrustedCommunity

Kubernetes 集群健康检查与安全修复 — 诊断问题，用户确认后执行修复

13SKILL.mdUpdated May 8, 2026

Ascend/cann-nnal-installer

tools

VerifiedTrustedCommunity

昇腾NPU CANN Toolkit+Kernels+NNAL安装部署技能。支持从官网下载run包安装和从Docker镜像提取两种方式，覆盖驱动检查、包下载、安装、环境变量配置与验证全流程。当用户需要安装CANN全套组件或指定版本CANN到自定义路径时调用。

13SKILL.mdUpdated May 8, 2026

Ascend/cann-nnal-installer

Ascend/atb-testframework-build

development

VerifiedTrustedCommunity

编译 ATB (Ascend Transformer Boost) 测试框架。当用户需要编译 ATB 测试框架、运行 CSV 测试、或构建 atb_test_framework 时调用。支持全量编译（含第三方依赖克隆与源替换）和增量编译两种模式。需在 Docker 容器内配合 CANN 环境执行。

13SKILL.mdUpdated May 7, 2026

Ascend/atb-testframework-build

Ascend/atb-ops-to-aclnn-migration-workflow

databases

VerifiedTrustedCommunity

ATB OPS→ACLNN 迁移标准化工作流主模板。整合前置学习、设计文档生成、CSV用例设计、实际迁移、编译验证、测试验证全流程，提供明确的阶段 Gates 和用户确认机制。

13SKILL.mdUpdated May 7, 2026

Ascend/atb-ops-to-aclnn-migration-workflow

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/Ascend/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/drivingsdk-ascend-model-migration/model-training ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

Ascend/agent-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT