skills/nlp/onnx-mixed-precision-export/SKILL.md
Exports a HuggingFace transformer to ONNX with dynamic axes, then auto-converts to BF16 mixed precision for 30-200% GPU inference speedup with 2x memory reduction.
npx skillsauth add wenmin-wu/ds-skills nlp-onnx-mixed-precision-exportInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PyTorch inference is flexible but slow for production. ONNX Runtime with CUDA provides 30-50% speedup from graph optimizations alone. Adding BF16 mixed-precision conversion doubles that — the ONNX Runtime auto-mixed-precision tool profiles each operator and converts safe ones to BF16 while keeping precision-sensitive ops in FP32. Total speedup: 30-200% over PyTorch, with <0.1% accuracy loss. This is the fastest path from a trained HuggingFace model to production inference.
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
from onnxruntime.transformers import auto_mixed_precision as amp
model = AutoModelForTokenClassification.from_pretrained("model_path")
tokenizer = AutoTokenizer.from_pretrained("model_path")
model.eval()
# Step 1: Export to ONNX with dynamic axes
dummy = tokenizer("sample text", return_tensors="pt")
torch.onnx.export(
model,
(dummy['input_ids'], dummy['attention_mask']),
"model.onnx",
opset_version=14,
input_names=['input_ids', 'attention_mask'],
output_names=['logits'],
dynamic_axes={
'input_ids': {0: 'batch', 1: 'seq'},
'attention_mask': {0: 'batch', 1: 'seq'},
'logits': {0: 'batch', 1: 'seq'},
}
)
# Step 2: Convert to BF16 mixed precision
amp.auto_convert_mixed_precision_model_path(
"model.onnx", input_data=dummy,
output_model_path="model_bf16.onnx",
provider=['CUDAExecutionProvider'],
keep_io_types=True
)
# Step 3: Run inference
import onnxruntime as ort
session = ort.InferenceSession("model_bf16.onnx",
providers=['CUDAExecutionProvider'])
outputs = session.run(None, {
'input_ids': input_ids.numpy(),
'attention_mask': attention_mask.numpy()
})
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF