skills/onnx-webgpu-converter/SKILL.md
Convert HuggingFace transformer models to ONNX format for browser inference with Transformers.js and WebGPU. Use when given a HuggingFace model link to convert to ONNX, when setting up optimum-cli for ONNX export, when quantizing models (fp16, q8, q4) for web deployment, when configuring Transformers.js with WebGPU acceleration, or when troubleshooting ONNX conversion errors. Triggers on mentions of ONNX conversion, Transformers.js, WebGPU inference, optimum export, model quantization for browser, or running ML models in the browser.
npx skillsauth add jakerains/agentskills onnx-webgpu-converterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.
Before converting, check if the model already has an ONNX version:
onnx-community/<model-name> on HuggingFace Hubonnx/ folderIf found, skip to Step 6.
# Create venv (recommended)
python -m venv onnx-env && source onnx-env/bin/activate
# Install optimum with ONNX support
pip install "optimum[onnx]" onnxruntime
# For GPU-accelerated export (optional)
pip install onnxruntime-gpu
Verify installation:
optimum-cli export onnx --help
optimum-cli export onnx --model <model_id_or_path> ./output_dir/
optimum-cli export onnx \
--model <model_id> \
--task <task> \
./output_dir/
Common tasks: text-generation, text-classification, feature-extraction, image-classification, automatic-speech-recognition, object-detection, image-segmentation, question-answering, token-classification, zero-shot-classification
For decoder models, append -with-past for KV cache reuse (default behavior):
text-generation-with-past, text2text-generation-with-past, automatic-speech-recognition-with-past
| Flag | Description |
|------|-------------|
| -m MODEL, --model MODEL | HuggingFace model ID or local path (required) |
| --task TASK | Export task (auto-detected if on Hub) |
| --opset OPSET | ONNX opset version (default: auto) |
| --device DEVICE | Export device, cpu (default) or cuda |
| --optimize {O1,O2,O3,O4} | ONNX Runtime optimization level |
| --monolith | Force single ONNX file (vs split encoder/decoder) |
| --no-post-process | Skip post-processing (e.g., decoder merging) |
| --trust-remote-code | Allow custom model code from Hub |
| --pad_token_id ID | Override pad token (needed for some models) |
| --cache_dir DIR | Cache directory for downloaded models |
| --batch_size N | Batch size for dummy inputs |
| --sequence_length N | Sequence length for dummy inputs |
| --framework {pt} | Source framework |
| --atol ATOL | Absolute tolerance for validation |
| Level | Description |
|-------|-------------|
| O1 | Basic general optimizations |
| O2 | Basic + extended + transformer fusions |
| O3 | O2 + GELU approximation |
| O4 | O3 + mixed precision fp16 (GPU only, requires --device cuda) |
| dtype | Precision | Best For | Size Reduction |
|-------|-----------|----------|----------------|
| fp32 | Full 32-bit | Maximum accuracy | None (baseline) |
| fp16 | Half 16-bit | WebGPU default quality | ~50% |
| q8 / int8 | 8-bit | WASM default, good balance | ~75% |
| q4 / bnb4 | 4-bit | Maximum compression | ~87% |
| q4f16 | 4-bit weights, fp16 compute | WebGPU + small size | ~87% |
# Dynamic quantization (post-export)
optimum-cli onnxruntime quantize \
--onnx_model ./output_dir/ \
--avx512 \
-o ./quantized_dir/
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)
To provide fp32, fp16, q8, and q4 variants (like onnx-community models), organize output as:
model_onnx/
├── onnx/
│ ├── model.onnx # fp32
│ ├── model_fp16.onnx # fp16
│ ├── model_quantized.onnx # q8
│ └── model_q4.onnx # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json
# Login
huggingface-cli login
# Upload
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/
# Add transformers.js tag to model card for discoverability
npm install @huggingface/transformers
import { pipeline } from "@huggingface/transformers";
const pipe = await pipeline("task-name", "model-id-or-path", {
device: "webgpu", // GPU acceleration
dtype: "q4", // Quantization level
});
const result = await pipe("input text");
Some models (Whisper, Florence-2) need different quantization per component:
const model = await Florence2ForConditionalGeneration.from_pretrained(
"onnx-community/Florence-2-base-ft",
{
dtype: {
embed_tokens: "fp16",
vision_encoder: "fp16",
encoder_model: "q4",
decoder_model_merged: "q4",
},
device: "webgpu",
},
);
For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md
For conversion errors and common issues: See references/conversion-guide.md
--task flag explicitly. For decoder models try text-generation-with-past--trust-remote-code flag for custom model architectures--device cpu and smaller --batch_size--no-post-process or increase --atol| Task | Transformers.js Pipeline | Example Model |
|------|-------------------------|---------------|
| text-classification | sentiment-analysis | distilbert-base-uncased-finetuned-sst-2 |
| text-generation | text-generation | Qwen2.5-0.5B-Instruct |
| feature-extraction | feature-extraction | mxbai-embed-xsmall-v1 |
| automatic-speech-recognition | automatic-speech-recognition | whisper-tiny.en |
| image-classification | image-classification | mobilenetv4_conv_small |
| object-detection | object-detection | detr-resnet-50 |
| image-segmentation | image-segmentation | segformer-b0 |
| zero-shot-image-classification | zero-shot-image-classification | clip-vit-base-patch32 |
| depth-estimation | depth-estimation | depth-anything-small |
| translation | translation | nllb-200-distilled-600M |
| summarization | summarization | bart-large-cnn |
development
Build durable workflows with Vercel Workflow DevKit using "use workflow" and "use step" directives. Use for long-running tasks, background jobs, AI agents, webhooks, scheduled tasks, retries, and workflow orchestration. Supports Next.js, Vite, Astro, Express, Fastify, Hono, Nitro, Nuxt, SvelteKit.
documentation
Automate changelog management, version bumping, release tracking, tags, and GitHub Releases. Sets up a changelog system (CHANGELOG.md, UI modal, version display) if none exists, or updates an existing one. Use when: updating changelog, bumping version, creating release entry, promoting [Unreleased], tagging, publishing GitHub Release notes, handling prerelease versions, setting up changelog, adding version display, managing semver, commit/push/release workflow. Triggers on: changelog, version bump, release notes, semver, CHANGELOG.md, release entry, what's new, patch/minor/major/prerelease bump, tag release, GitHub Release, update the changelog, release, new version.
development
Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills. Use when creating Claude skills from docs, scraping documentation, packaging websites into skills, or converting repos/PDFs to Claude knowledge.
development
Generate professional shot lists from screenplays and scripts. Use when user uploads a screenplay (.fountain, .fdx, .txt, .pdf, .docx) or describes scenes for production planning. Parses scripts to extract scenes, helps determine camera setups, shot types, framing, and movement through collaborative discussion, then generates beautifully formatted PDF shot lists for production. Triggers include requests to create shot lists, plan shots, break down scripts for filming, or organize camera coverage.