skills/model-optimization/SKILL.md
Optimize ML models for edge deployment through quantization, pruning, format conversion (TensorRT/TFLite/ONNX), and accuracy/latency benchmarking. Use when preparing models for resource-constrained devices.
npx skillsauth add michaelalber/ai-toolkit model-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"Quantization is not about making models worse. It is about finding the representation that preserves what matters while discarding what does not." -- Benoit Jacob, Google Quantization Team
This skill covers the complete model optimization pipeline: profiling baseline performance, applying quantization and pruning, converting between inference formats, and benchmarking the results. Every optimization decision is driven by measurement, not intuition.
Non-Negotiable Constraints:
| Principle | Description | Priority | |-----------|-------------|----------| | Measure Before Optimizing | Profile the unmodified model for latency, accuracy, size, and memory before applying any optimization | Critical | | Accuracy Floor Enforcement | Define an acceptable accuracy degradation threshold and reject any optimization that violates it | Critical | | Format-Device Alignment | Match the inference format to the target hardware: TensorRT for NVIDIA GPU, TFLite for ARM CPU, ONNX for portable | Critical | | Calibration Data Quality | INT8 quantization is only as good as the calibration dataset; use representative, domain-specific data | High | | Sequential Optimization | Apply one optimization at a time, benchmark, then decide whether to keep or revert | High | | Reproducible Benchmarks | Lock clock speeds, set power modes, run warmup iterations, and report percentile latencies | High | | Original Preservation | Never modify, move, or delete the original model file; all outputs use new descriptive filenames | High | | Per-Layer Sensitivity | Not all layers respond equally to quantization; identify and protect sensitive layers | Medium | | Mixed Precision | When uniform INT8 fails accuracy, use FP16 for sensitive layers and INT8 for the rest | Medium | | Deployment Metadata | Package optimized models with benchmark results, preprocessing config, and provenance information | Medium |
Use search_knowledge (grounded-code-mcp) to ground decisions in authoritative references.
| Query | When to Call |
|-------|--------------|
| search_knowledge("quantization INT8 PTQ calibration dataset") | During OPTIMIZE — choosing post-training quantization strategy |
| search_knowledge("TensorRT FP16 INT8 engine conversion") | During OPTIMIZE — converting to TensorRT for Jetson targets |
| search_knowledge("TFLite quantization ARM Raspberry Pi") | During OPTIMIZE — converting to TFLite for ARM targets |
| search_knowledge("ONNX model export PyTorch TensorFlow") | During PROFILE/OPTIMIZE — exporting to ONNX as interchange format |
| search_knowledge("model pruning structured unstructured channels") | During OPTIMIZE — applying pruning to reduce model size |
| search_knowledge("inference benchmark latency percentile P95") | During BENCHMARK — measuring latency with statistical rigor |
| search_code_examples("TensorRT calibration INT8 Python") | Before INT8 calibration — find calibration dataset patterns |
| search_code_examples("ONNX export PyTorch torch.onnx") | Before ONNX export — find correct opset and export options |
Protocol: Search edge_ai collection first for model optimization patterns. Search python for framework-specific code patterns. Always cite the source path from KB results. Report accuracy degradation before proceeding.
+--------+ +----------+ +-----------+ +----------+ +---------+
| PROFILE|---->| OPTIMIZE |---->| BENCHMARK |---->| VALIDATE |---->| PACKAGE |
+--------+ +----------+ +-----------+ +----------+ +---------+
| | | |
| | | |
v v v v
Baseline Quantized/ Speedup/ Accuracy Deploy-ready
metrics Converted Compression verified artifact
model ratios within tol.
Before starting any optimization workflow, verify:
PRE-FLIGHT VERIFICATION
+--------------------------------------------------------------+
| [ ] Source model file exists and is loadable |
| [ ] Model framework identified (PyTorch / TensorFlow / ONNX) |
| [ ] Target device identified (Jetson / RPi / CPU) |
| [ ] Test/validation dataset available |
| [ ] Accuracy metric defined (mAP / top-1 / F1 / custom) |
| [ ] Accuracy tolerance defined (default: 2% relative drop) |
| [ ] Latency target defined (optional but recommended) |
| [ ] Calibration dataset available (for INT8 quantization) |
| [ ] Disk space sufficient for multiple model variants |
+--------------------------------------------------------------+
What is the target device?
+-- NVIDIA Jetson (GPU)
| +-- Start with TensorRT FP16
| +-- FP16 meets latency target?
| +-- YES --> Ship FP16 (done)
| +-- NO --> Prepare calibration dataset (500-1000 images)
| +-- Apply TensorRT INT8 with calibration
| +-- Accuracy within tolerance?
| +-- YES --> Ship INT8
| +-- NO --> Try mixed precision or smaller model
|
+-- Raspberry Pi / ARM CPU
| +-- Start with TFLite float16 quantization
| +-- Float16 meets latency target?
| +-- YES --> Ship float16 (done)
| +-- NO --> Prepare calibration dataset (200-500 images)
| +-- Apply TFLite full INT8 PTQ
| +-- Accuracy within tolerance?
| +-- YES --> Ship INT8
| +-- NO --> Try QAT or smaller model
|
+-- General CPU / Cloud
+-- Start with ONNX Runtime graph optimizations
+-- Apply dynamic range quantization
+-- If needed, apply full INT8 with calibration
Is the model overparameterized for the task?
+-- YES (accuracy is well above requirements)
| +-- Try structured pruning (remove entire channels/filters)
| +-- Start with 20% pruning ratio
| +-- Fine-tune for 5-10 epochs
| +-- Accuracy still within tolerance?
| +-- YES --> Increase pruning ratio (30%, 40%, ...)
| +-- NO --> Reduce pruning ratio or switch to unstructured
|
+-- NO (accuracy is near the floor already)
+-- Do NOT prune; focus on quantization and format conversion instead
Maintain state across conversation turns using this block:
<model-opt-state>
phase: [PROFILE | OPTIMIZE | BENCHMARK | VALIDATE | PACKAGE]
model_name: [name of the model being optimized]
source_format: [pytorch | tensorflow | onnx | tflite | tensorrt]
target_device: [jetson-orin-nano | raspberry-pi-5 | raspberry-pi-4 | cpu-generic]
baseline_latency_ms: [number or "unmeasured"]
baseline_accuracy: [number or "unmeasured"]
accuracy_tolerance: [percentage, e.g., "2%"]
optimizations_applied: [comma-separated list or "none"]
current_best_latency_ms: [number or "unmeasured"]
current_best_accuracy: [number or "unmeasured"]
original_model_path: [absolute path to original model file]
last_action: [what was just done]
next_action: [what should happen next]
blockers: [any issues]
</model-opt-state>
## Model Optimization Report: [Model Name]
**Source**: [framework] [format] ([size] MB)
**Target Device**: [device]
**Optimization Pipeline**: [list of steps applied]
**Date**: [date]
### Baseline vs Optimized
| Metric | Baseline | Optimized | Change |
|--------|----------|-----------|--------|
| File Size | [MB] | [MB] | [ratio]x compression |
| Latency (mean) | [ms] | [ms] | [speedup]x faster |
| Latency (P95) | [ms] | [ms] | [speedup]x faster |
| Memory (peak) | [MB] | [MB] | [reduction]x smaller |
| Accuracy ([metric]) | [value] | [value] | [delta] ([status]) |
| Throughput | [fps] | [fps] | [improvement]x |
### Optimization Steps Applied
| Step | Input | Output | Size | Latency | Accuracy |
|------|-------|--------|------|---------|----------|
| 1. [step] | [file] | [file] | [MB] | [ms] | [value] |
| 2. [step] | [file] | [file] | [MB] | [ms] | [value] |
### Verdict
[PASS/FAIL]: Accuracy delta of [N]% is [within/outside] the [N]% tolerance.
Speedup: [N]x. Compression: [N]x.
### Deployment Artifact
- Model file: [path]
- Metadata: [path]
- Preprocessing config: [input_shape, dtype, normalization]
## Optimization Tradeoff Analysis: [Model Name]
| Variant | Format | Precision | Size (MB) | Latency (ms) | Accuracy | Speedup | Acc. Delta |
|---------|--------|-----------|-----------|-------------|----------|---------|-----------|
| Baseline | [fmt] | FP32 | [size] | [lat] | [acc] | 1.0x | 0.0% |
| ONNX Simplified | ONNX | FP32 | [size] | [lat] | [acc] | [x] | [%] |
| TensorRT FP16 | TRT | FP16 | [size] | [lat] | [acc] | [x] | [%] |
| TensorRT INT8 | TRT | INT8 | [size] | [lat] | [acc] | [x] | [%] |
| TFLite Float16 | TFLite | FP16 | [size] | [lat] | [acc] | [x] | [%] |
| TFLite INT8 | TFLite | INT8 | [size] | [lat] | [acc] | [x] | [%] |
**Recommendation**: [variant] provides [speedup]x speedup with only [delta]% accuracy loss.
Before applying any optimization:
WRONG: "MobileNetV2 is typically about 30ms on this device, so let's quantize."
RIGHT: "Measured MobileNetV2 baseline: mean=34.2ms, P95=37.1ms, accuracy=71.8% top-1."
After every format conversion, verify that preprocessing produces correct inputs:
# ALWAYS verify after conversion
input_details = interpreter.get_input_details()[0]
expected_shape = tuple(input_details['shape'])
expected_dtype = input_details['dtype']
assert input_data.shape == expected_shape, \
f"Shape mismatch: {input_data.shape} vs expected {expected_shape}"
assert input_data.dtype == expected_dtype, \
f"Dtype mismatch: {input_data.dtype} vs expected {expected_dtype}"
# INT8 models: input is uint8 (0-255), NOT float32 (0.0-1.0)
# Float models: input is float32 (0.0-1.0) unless model-specific
If accuracy drops beyond the stated tolerance:
1. STOP the optimization pipeline
2. Report the exact numbers to the user
3. Present alternatives (less aggressive quantization, mixed precision, QAT)
4. Let the user decide whether to accept the tradeoff
5. NEVER proceed silently past an accuracy violation
Host machine numbers are estimates, NOT deployment metrics.
If target hardware is available:
- ALL latency benchmarks MUST run on the target device
- Set power mode explicitly (Jetson: nvpmodel, RPi: governor)
- Lock clock frequencies for reproducibility
- Run sustained load test (5+ minutes) to catch thermal throttling
If target hardware is unavailable:
- Label ALL results as "Host Estimate (not target hardware)"
- Accuracy measurements are still valid (platform-independent)
- Latency numbers may differ 2-10x on actual target hardware
| Anti-Pattern | Why It Fails | Correct Approach | |--------------|-------------|------------------| | Optimizing without baseline measurement | Cannot quantify improvement; may ship a regression | Always profile the original model first | | Stacking multiple optimizations at once | Cannot attribute accuracy loss to a specific change | Apply one optimization, benchmark, then decide | | Using random data for INT8 calibration | Quantization ranges will not match real data distribution | Use 500-1000 representative samples from the target domain | | Reporting mean latency only | Hides tail latency spikes from thermal throttling | Report P50, P95, P99, and run sustained load tests | | Assuming FP16 is lossless | Some models with large dynamic ranges lose accuracy at FP16 | Always validate accuracy after FP16 conversion | | Deleting the original model after optimization | Cannot re-optimize or debug accuracy issues later | Keep original model; use descriptive names for variants | | Building TensorRT engine on x86 for ARM target | TensorRT engines are architecture-specific | Build engines on the target device or matching architecture | | Quantizing the entire model uniformly to INT8 | Some layers (attention, final classifier) are INT8-sensitive | Run per-layer sensitivity analysis; use mixed precision |
Problem: INT8 quantization produces inconsistent or degraded accuracy.
Actions:
1. Increase calibration dataset to 500-1000 samples
2. Ensure samples cover the full input distribution (all classes, lighting conditions, etc.)
3. Avoid using training data augmentations in calibration data
4. Try a different calibration algorithm (Entropy vs MinMax)
5. Compare INT8 outputs against FP32 outputs on calibration samples
Problem: ONNX model fails validation or produces wrong outputs.
Actions:
1. Run onnx.checker.check_model() for structural validation
2. Compare ONNX output against original framework output on same input
3. Try a different opset version (lower is more compatible, higher has more ops)
4. Simplify with onnxsim before further conversion
5. Check for unsupported dynamic operations and replace with static alternatives
Problem: Model is still too slow after quantization and conversion.
Actions:
1. Review the profiling breakdown -- which stage is the bottleneck?
2. Reduce model input resolution (e.g., 640x640 to 320x320)
3. Switch to a smaller model architecture (e.g., YOLOv8n instead of YOLOv8s)
4. Apply structured pruning to reduce channel counts
5. Consider model distillation to a smaller student architecture
6. Accept a lower FPS target if accuracy requirements are non-negotiable
edge-cv-pipeline -- After optimizing a model, use edge-cv-pipeline to build the complete inference pipeline with camera capture, preprocessing, postprocessing, and result publishing. The optimized model from this skill becomes the inference engine in the CV pipeline.
jetson-deploy -- After optimizing a model for Jetson, use jetson-deploy to containerize the deployment, manage TensorRT engine building on-device, configure power modes, and set up monitoring with tegrastats and jtop.
development
Federal / government security overlay applied ON TOP OF a base language security review (dotnet/python/php/rust/react). Language-agnostic: adds NIST SP 800-53 control mapping, FIPS 140-2/3 cryptographic compliance (with a per-language crypto table), CUI handling, EO 14028 supply-chain requirements, and DOE Order 205.1B, and emits POA&M-ready findings with FIPS 199 impact levels. Use for federal/DOE/DOD/national-laboratory systems. Triggers on "federal security review", "NIST compliance", "NIST 800-53", "FISMA", "CUI", "FIPS audit", "DOE security", "POA&M", "ATO review". Do NOT use alone — run the matching <lang>-security-review FIRST; this overlay maps and extends it.
tools
OWASP-based security review of React / TypeScript front-end applications. Detects the framework (Vite/CRA/Next), entry points, and data flows, scans against the OWASP Top 10 (2025) mapped to React client-side patterns (XSS via raw HTML, URL/protocol injection, secrets in the bundle, insecure token storage, dependency CVEs, missing CSP, open redirects), and produces a manager-friendly executive summary plus a graded technical findings table. Use to audit React code for vulnerabilities. Triggers on "react security review", "frontend security audit", "audit react for vulnerabilities", "owasp react", "react xss", "react security posture", "npm audit review". For federal / gov / DOE / NIST / FIPS / CUI context, run security-review-federal after this base review. Do NOT use to grade architecture/structure — use react-architecture-checklist.
tools
Analyzes legacy React codebases and produces actionable modernization plans. Primary migration paths include class components to function components + hooks, Create React App to Vite, React 16/17 to 18 to 19, JavaScript to TypeScript, Enzyme to React Testing Library, legacy Redux to Redux Toolkit / Zustand / Context, and deprecated lifecycle/API removal. Does NOT perform the migration — assesses, quantifies risk, and plans. Triggers on phrases like "modernize react", "class to hooks", "upgrade react", "migrate CRA to vite", "react legacy migration", "react 17 to 18", "react js to typescript", "react technical debt", "enzyme to RTL".
development
Scaffolds feature-based React / TypeScript architecture using feature folders, presentational + container components, custom hooks, a typed data layer, and structural CQRS (query hooks vs mutation hooks). React analog of dotnet-vertical-slice and python-feature-slice — no DI framework; uses props/context for dependency injection and a query cache for server state. Use when creating feature-based React projects, adding React features, organizing components by feature rather than by technical type, or scaffolding a feature's data layer. Triggers on phrases like "scaffold react feature", "create react slice", "react feature folder", "react vertical slice", "add react feature", "react feature architecture", "organize react by feature".