Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

JosiahSiegel/ml-inference-optimization

Name: ml-inference-optimization
Author: JosiahSiegel

plugins/ml-master/skills/ml-inference-optimization/SKILL.md

npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-inference-optimization

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ML Inference Optimization

Overview

Use this skill for reducing inference latency, increasing throughput, shrinking memory, lowering cost, and deploying optimized models safely. Optimization must be benchmark-driven: define workload, input shapes, concurrency, SLOs, hardware, runtime, numerical tolerance, and quality metrics before changing the model.

Optimization Workflow

Establish a baseline with realistic data, preprocessing, postprocessing, network overhead, warmup, and concurrency.
Profile bottlenecks: CPU preprocessing, model compute, memory bandwidth, GPU utilization, serialization, queueing, retrieval, or downstream calls.
Apply the least risky optimization first: batching, compilation, precision, runtime tuning, then compression.
Revalidate accuracy, calibration, fairness slices, robustness, and numerical stability after every change.
Measure p50, p95, p99 latency, throughput, memory, cold start, error rate, and cost per 1,000 predictions.

Export and Compilation

Export paths should be chosen by deployment target. Compare exported output with framework output on representative inputs. Define tolerances per output type. Include preprocessing and postprocessing in the benchmark; many failures occur outside the core model.

PyTorch to ONNX Export Blueprint

import torch
import torch.nn as nn

def export_to_onnx(model: nn.Module, dummy_input: torch.Tensor, onnx_path: str):
    model.eval()
    
    # Define input and output names, and dynamic axes for batch size and sequence length
    input_names = ["input_ids"]
    output_names = ["logits"]
    dynamic_axes = {
        "input_ids": {0: "batch_size", 1: "seq_len"},
        "logits": {0: "batch_size"}
    }
    
    torch.onnx.export(
        model,
        dummy_input,
        onnx_path,
        export_params=True,
        opset_version=17, # Modern stable opset version
        do_constant_folding=True,
        input_names=input_names,
        output_names=output_names,
        dynamic_axes=dynamic_axes
    )

PyTorch `torch.compile` Optimization Setup

torch.compile leverages the Inductor compiler for massive speedups on modern GPUs.

import torch

def compile_model(model: torch.nn.Module, mode: str = "default"):
    # Modes:
    # - "default": Balance compile time and inference speed
    # - "reduce-overhead": Reduce framework overhead (uses CUDA graphs, good for small batch sizes)
    # - "max-autotune": Perform exhaustive kernel tuning (longest compile time, fastest runtime)
    
    optimized_model = torch.compile(
        model, 
        mode=mode, 
        fullgraph=False,
        dynamic=True # Handles variable sequence lengths/batch sizes efficiently
    )
    return optimized_model

Batching and Concurrency

Dynamic batching is often the biggest serving win for GPU endpoints. Tune max batch size and queue delay together: larger batches improve throughput but can harm tail latency. For LLMs, distinguish prefill and decode phases; sequence length, KV cache, and concurrent requests dominate memory. Use continuous batching or specialized LLM serving runtimes when appropriate.

For CPU services, concurrency may improve utilization until memory bandwidth, GIL/runtime constraints, or thread oversubscription dominate. Pin thread counts for BLAS/OpenMP/ONNX Runtime and benchmark rather than accepting defaults.

Triton Inference Server Configuration

Triton is the industry standard for high-performance multi-framework serving.

Triton `config.pbtxt` Template

name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 32

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ] # Dynamic sequence length
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ 5 ] # Number of classes
  }
]

# Enable Dynamic Batching
dynamic_batching {
  max_queue_delay_microseconds: 5000 # Wait up to 5ms for concurrent requests
  preferred_batch_size: [ 8, 16, 32 ]
}

# GPU Instance Groups
instance_group [
  {
    count: 2 # Host 2 model execution instances per GPU
    kind: KIND_GPU
  }
]

Precision and Quantization

| Method | Best for | Caveats | |---|---|---| | fp16/bf16 | GPU/accelerator throughput and memory reduction | Validate numerically sensitive ops | | INT8 post-training quantization | CNNs, transformers, tabular neural nets with calibration data | Calibration set must represent production | | Quantization-aware training | Accuracy-sensitive low precision deployments | Adds training complexity | | INT4/GPTQ/AWQ/bitsandbytes | Large language model memory reduction | Can degrade reasoning, rare tokens, calibration, tool use | | Weight-only quantization | LLM serving where activations remain higher precision | Speedup depends on kernels/hardware |

Quantization changes model behavior. Re-run task metrics, calibration, slice metrics, and safety checks. For RAG or embedding models, verify retrieval recall and embedding similarity distribution after quantization.

Pruning, Sparsity, and Distillation

Pruning removes weights, channels, heads, or layers. Unstructured sparsity may not speed inference without hardware/runtime support; structured pruning is more likely to reduce latency. Knowledge distillation trains a smaller student to mimic a larger teacher.

TensorRT Compilation Recipes (trtexec)

NVIDIA TensorRT compiled engines deliver maximum throughput on GPUs.

1. Build FP16 Engine with Dynamic Shapes

trtexec --onnx=model.onnx \
        --saveEngine=model.engine \
        --fp16 \
        --minShapes=input_ids:1x1 \
        --optShapes=input_ids:16x128 \
        --maxShapes=input_ids:32x512

2. Build INT8 Engine with Calibration Data

trtexec --onnx=model.onnx \
        --saveEngine=model_int8.engine \
        --int8 \
        --calib=calibration_profile.cache \
        --minShapes=input_ids:1x1 \
        --optShapes=input_ids:16x128 \
        --maxShapes=input_ids:32x512

Serving Runtimes

| Runtime | Use when | |---|---| | Triton Inference Server | Multi-framework GPU/CPU serving, dynamic batching, ensembles, model repository, high throughput | | TorchServe | PyTorch model serving with handlers and management APIs | | TensorFlow Serving | TensorFlow SavedModel serving and versioned model management | | ONNX Runtime Server/custom API | Portable ONNX inference with lightweight service control | | BentoML | Packaging models as APIs, deployment flexibility, Python-centric workflows | | Seldon/KServe | Kubernetes-native model serving, canaries, inference graphs, enterprise ops | | vLLM / TGI | Continuous batching, PagedAttention, KV cache management, tensor parallel LLM serving |

Use model repository versioning and readiness checks. Separate health checks from expensive inference. For GPU serving, avoid loading multiple memory-heavy models per device unless capacity is proven.

Memory Optimization

For transformers, memory is driven by weights plus KV cache. Reduce memory with quantization, tensor parallelism, smaller max sequence length, paged attention, cache eviction, smaller batch/concurrency, or distilled models. For diffusion and vision models, use attention slicing, VAE tiling, fp16/bf16, memory-efficient attention, and model offload where latency allows.

Edge Deployment

Edge constraints include binary size, RAM, thermal throttling, offline operation, intermittent updates, privacy, and hardware acceleration. Prefer TFLite/Core ML/OpenVINO/ONNX depending on target. Test on real devices.

Benchmarking Pitfalls

Measuring only warm single-request latency while production has concurrent traffic.
Ignoring tokenization, image decode, retrieval, serialization, or network time.
Reporting average latency instead of p95/p99.
Comparing quantized and baseline models on different data.
Building TensorRT engines for one shape then serving unsupported dynamic shapes.
Increasing batch size until throughput improves but SLO fails.

Sources

NVIDIA Triton Inference Server documentation: https://docs.nvidia.com/deeplearning/triton-inference-server/
NVIDIA TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
ONNX Runtime documentation: https://onnxruntime.ai/docs/
PyTorch export and compilation docs: https://pytorch.org/docs/stable/
TensorFlow Lite documentation: https://www.tensorflow.org/lite
BentoML documentation: https://docs.bentoml.com/

JosiahSiegel/ml-inference-optimization

plugins/ml-master/skills/ml-inference-optimization/SKILL.md

This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.

38 stars

development

Updated May 28, 2026

$ install --global

skillsauth

npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-inference-optimization

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 28, 2026, 7:35 AM102.0s1 file scanned

SKILL.md

name:: ml-inference-optimization
description:: |
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for:: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.

ML Inference Optimization

Overview

Optimization Workflow

Establish a baseline with realistic data, preprocessing, postprocessing, network overhead, warmup, and concurrency.
Profile bottlenecks: CPU preprocessing, model compute, memory bandwidth, GPU utilization, serialization, queueing, retrieval, or downstream calls.
Apply the least risky optimization first: batching, compilation, precision, runtime tuning, then compression.
Revalidate accuracy, calibration, fairness slices, robustness, and numerical stability after every change.
Measure p50, p95, p99 latency, throughput, memory, cold start, error rate, and cost per 1,000 predictions.

Export and Compilation

PyTorch to ONNX Export Blueprint

import torch
import torch.nn as nn

def export_to_onnx(model: nn.Module, dummy_input: torch.Tensor, onnx_path: str):
    model.eval()
    
    # Define input and output names, and dynamic axes for batch size and sequence length
    input_names = ["input_ids"]
    output_names = ["logits"]
    dynamic_axes = {
        "input_ids": {0: "batch_size", 1: "seq_len"},
        "logits": {0: "batch_size"}
    }
    
    torch.onnx.export(
        model,
        dummy_input,
        onnx_path,
        export_params=True,
        opset_version=17, # Modern stable opset version
        do_constant_folding=True,
        input_names=input_names,
        output_names=output_names,
        dynamic_axes=dynamic_axes
    )

PyTorch `torch.compile` Optimization Setup

torch.compile leverages the Inductor compiler for massive speedups on modern GPUs.

import torch

def compile_model(model: torch.nn.Module, mode: str = "default"):
    # Modes:
    # - "default": Balance compile time and inference speed
    # - "reduce-overhead": Reduce framework overhead (uses CUDA graphs, good for small batch sizes)
    # - "max-autotune": Perform exhaustive kernel tuning (longest compile time, fastest runtime)
    
    optimized_model = torch.compile(
        model, 
        mode=mode, 
        fullgraph=False,
        dynamic=True # Handles variable sequence lengths/batch sizes efficiently
    )
    return optimized_model

Batching and Concurrency

Triton Inference Server Configuration

Triton is the industry standard for high-performance multi-framework serving.

Triton `config.pbtxt` Template

name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 32

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ] # Dynamic sequence length
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ 5 ] # Number of classes
  }
]

# Enable Dynamic Batching
dynamic_batching {
  max_queue_delay_microseconds: 5000 # Wait up to 5ms for concurrent requests
  preferred_batch_size: [ 8, 16, 32 ]
}

# GPU Instance Groups
instance_group [
  {
    count: 2 # Host 2 model execution instances per GPU
    kind: KIND_GPU
  }
]

Precision and Quantization

Pruning, Sparsity, and Distillation

TensorRT Compilation Recipes (trtexec)

NVIDIA TensorRT compiled engines deliver maximum throughput on GPUs.

1. Build FP16 Engine with Dynamic Shapes

trtexec --onnx=model.onnx \
        --saveEngine=model.engine \
        --fp16 \
        --minShapes=input_ids:1x1 \
        --optShapes=input_ids:16x128 \
        --maxShapes=input_ids:32x512

2. Build INT8 Engine with Calibration Data

trtexec --onnx=model.onnx \
        --saveEngine=model_int8.engine \
        --int8 \
        --calib=calibration_profile.cache \
        --minShapes=input_ids:1x1 \
        --optShapes=input_ids:16x128 \
        --maxShapes=input_ids:32x512

Serving Runtimes

Memory Optimization

Edge Deployment

Benchmarking Pitfalls

Measuring only warm single-request latency while production has concurrent traffic.
Ignoring tokenization, image decode, retrieval, serialization, or network time.
Reporting average latency instead of p95/p99.
Comparing quantized and baseline models on different data.
Building TensorRT engines for one shape then serving unsupported dynamic shapes.
Increasing batch size until throughput improves but SLO fails.

Sources

NVIDIA Triton Inference Server documentation: https://docs.nvidia.com/deeplearning/triton-inference-server/
NVIDIA TensorRT documentation: https://docs.nvidia.com/deeplearning/tensorrt/
ONNX Runtime documentation: https://onnxruntime.ai/docs/
PyTorch export and compilation docs: https://pytorch.org/docs/stable/
TensorFlow Lite documentation: https://www.tensorflow.org/lite
BentoML documentation: https://docs.bentoml.com/

Related Skills

JosiahSiegel/clerk-sessions-webhooks-security

development

VerifiedTrustedCommunity

Use for Clerk sessions, tokens, webhooks, orgs, and security. PROACTIVELY activate for session tokens, JWT templates, getToken(), custom claims, pending sessions, multi-session UX, organizations, roles, permissions, system vs custom permissions, features/plans, MFA/passkeys/password policy/bot protection, Clerk webhooks, Svix signatures, verifyWebhook(), user/org sync, retries/replays, environment variables, custom domains, secret rotation, logs, and auth security reviews. Provides token semantics, webhook idempotency, authorization defaults, and hardening checklist.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-sessions-webhooks-security

JosiahSiegel/clerk-nextjs-auth

tools

VerifiedTrustedCommunity

Use for Clerk in Next.js. PROACTIVELY activate for @clerk/nextjs setup, App Router auth()/currentUser(), clerkMiddleware(), proxy.ts/middleware.ts, createRouteMatcher(), protected pages/layouts/Route Handlers/Server Actions/API routes/tRPC, auth.protect() role/permission/token checks, ClerkProvider placement, server-only clerkClient, Link prefetch, redirects, 401/404 auth failures, custom domains, __clerk proxy paths, and deployment gotchas. Provides file patterns, server/client boundary rules, matcher templates, and production checks.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-nextjs-auth

JosiahSiegel/clerk-frontend-sdks

development

VerifiedTrustedCommunity

Use for Clerk frontend auth flows. PROACTIVELY activate for React, JavaScript, Vue, Nuxt, Astro, Expo, React Router, TanStack React Start, or SPA setup; ClerkProvider and publishable-key wiring; SignIn/SignUp/UserButton/UserProfile/OrganizationSwitcher; custom useUser/useAuth/useClerk/useSignIn/useSignUp/useSession/useOrganization flows; multi-session UX; cross-origin getToken() fetches; loading states, redirects, routing, CORS/cookies, or hydration bugs. Provides SDK selection, UI patterns, token-fetch templates, and frontend gotchas.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-frontend-sdks

JosiahSiegel/clerk-environments-deployment

development

VerifiedTrustedCommunity

Use for Clerk dev/prod readiness, deployment, and multi-language implementation planning. PROACTIVELY activate for environment variables, pk_test/sk_test vs pk_live/sk_live, local dev, preview/staging/prod instances, domains/DNS, redirects, OAuth credentials, custom domains/proxy, authorizedParties, CSP, CORS/cookies, webhooks/tunnels, Vercel/Netlify/Cloudflare/API gateways, monitoring/troubleshooting, and backends in Node/Express/Fastify, Python/FastAPI/Django/Flask, Go, Ruby/Rails, Java/Spring, .NET, PHP/Laravel. Provides checklists, rollout plans, and language-portable patterns.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-environments-deployment

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/JosiahSiegel/claude-plugin-marketplace.git

# Copy into Claude Code skills folder (global)
cp -r claude-plugin-marketplace/plugins/ml-master/skills/ml-inference-optimization ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

JosiahSiegel/claude-plugin-marketplace

38 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

JosiahSiegel/ml-inference-optimization

$ install --global

Security Scan Results

SKILL.md

ML Inference Optimization

Overview

Optimization Workflow

Export and Compilation

PyTorch to ONNX Export Blueprint

PyTorch torch.compile Optimization Setup

Batching and Concurrency

Triton Inference Server Configuration

Triton config.pbtxt Template

Precision and Quantization

Pruning, Sparsity, and Distillation

TensorRT Compilation Recipes (trtexec)

1. Build FP16 Engine with Dynamic Shapes

2. Build INT8 Engine with Calibration Data

Serving Runtimes

Memory Optimization

Edge Deployment

Benchmarking Pitfalls

Sources

Related Skills

JosiahSiegel/clerk-sessions-webhooks-security

JosiahSiegel/clerk-nextjs-auth

JosiahSiegel/clerk-frontend-sdks

JosiahSiegel/clerk-environments-deployment

JosiahSiegel/ml-inference-optimization

$ install --global

Security Scan Results

SKILL.md

ML Inference Optimization

Overview

Optimization Workflow

Export and Compilation

PyTorch to ONNX Export Blueprint

PyTorch torch.compile Optimization Setup

Batching and Concurrency

Triton Inference Server Configuration

Triton config.pbtxt Template

Precision and Quantization

Pruning, Sparsity, and Distillation

TensorRT Compilation Recipes (trtexec)

1. Build FP16 Engine with Dynamic Shapes

2. Build INT8 Engine with Calibration Data

Serving Runtimes

Memory Optimization

Edge Deployment

Benchmarking Pitfalls

Sources

Related Skills

JosiahSiegel/clerk-sessions-webhooks-security

JosiahSiegel/clerk-nextjs-auth

JosiahSiegel/clerk-frontend-sdks

JosiahSiegel/clerk-environments-deployment

PyTorch `torch.compile` Optimization Setup

Triton `config.pbtxt` Template

PyTorch `torch.compile` Optimization Setup

Triton `config.pbtxt` Template