cli-tool/components/skills/ai-research/emerging-techniques-model-pruning/SKILL.md
Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.
npx skillsauth add davila7/claude-code-templates model-pruningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use Model Pruning when you need to:
Key Techniques: Wanda (weights × activations), SparseGPT (second-order), structured pruning, N:M sparsity
Papers: Wanda ICLR 2024 (arXiv 2306.11695), SparseGPT (arXiv 2301.00774)
# Wanda implementation
git clone https://github.com/locuslab/wanda
cd wanda
pip install -r requirements.txt
# Optional: SparseGPT
git clone https://github.com/IST-DASLab/sparsegpt
cd sparsegpt
pip install -e .
# Dependencies
pip install torch transformers accelerate
Source: ICLR 2024 (arXiv 2306.11695)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Calibration data (small dataset for activation statistics)
calib_data = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming the world.",
"Artificial intelligence powers modern applications.",
]
# Wanda pruning function
def wanda_prune(model, calib_data, sparsity=0.5):
"""
Wanda: Prune by weight magnitude × input activation.
Args:
sparsity: Fraction of weights to prune (0.5 = 50%)
"""
# 1. Collect activation statistics
activations = {}
def hook_fn(name):
def hook(module, input, output):
# Store input activation norms
activations[name] = input[0].detach().abs().mean(dim=0)
return hook
# Register hooks for all linear layers
hooks = []
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
hooks.append(module.register_forward_hook(hook_fn(name)))
# Run calibration data
model.eval()
with torch.no_grad():
for text in calib_data:
inputs = tokenizer(text, return_tensors="pt").to(model.device)
model(**inputs)
# Remove hooks
for hook in hooks:
hook.remove()
# 2. Prune weights based on |weight| × activation
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear) and name in activations:
W = module.weight.data
act = activations[name]
# Compute importance: |weight| × activation
importance = W.abs() * act.unsqueeze(0)
# Flatten and find threshold
threshold = torch.quantile(importance.flatten(), sparsity)
# Create mask
mask = importance >= threshold
# Apply mask (prune)
W *= mask.float()
return model
# Apply Wanda pruning (50% sparsity, one-shot, no retraining)
pruned_model = wanda_prune(model, calib_data, sparsity=0.5)
# Save
pruned_model.save_pretrained("./llama-2-7b-wanda-50")
Source: arXiv 2301.00774
from sparsegpt import SparseGPT
# Load model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Initialize SparseGPT
pruner = SparseGPT(model)
# Calibration data
calib_data = load_calibration_data() # ~128 samples
# Prune (one-shot, layer-wise reconstruction)
pruned_model = pruner.prune(
calib_data=calib_data,
sparsity=0.5, # 50% sparsity
prunen=0, # Unstructured (0) or N:M structured
prunem=0,
percdamp=0.01, # Damping for Hessian inverse
)
# Results: Near-lossless pruning at 50% sparsity
def nm_prune(weight, n=2, m=4):
"""
N:M pruning: Keep N weights per M consecutive weights.
Example: 2:4 = keep 2 out of every 4 weights.
Compatible with NVIDIA sparse tensor cores (2:4, 4:8).
"""
# Reshape weight into groups of M
shape = weight.shape
weight_flat = weight.flatten()
# Pad to multiple of M
pad_size = (m - weight_flat.numel() % m) % m
weight_padded = F.pad(weight_flat, (0, pad_size))
# Reshape into (num_groups, m)
weight_grouped = weight_padded.reshape(-1, m)
# Find top-N in each group
_, indices = torch.topk(weight_grouped.abs(), n, dim=-1)
# Create mask
mask = torch.zeros_like(weight_grouped)
mask.scatter_(1, indices, 1.0)
# Apply mask
weight_pruned = weight_grouped * mask
# Reshape back
weight_pruned = weight_pruned.flatten()[:weight_flat.numel()]
return weight_pruned.reshape(shape)
# Apply 2:4 sparsity (NVIDIA hardware)
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
module.weight.data = nm_prune(module.weight.data, n=2, m=4)
# 50% sparsity, 2× speedup on A100 with sparse tensor cores
Magnitude Pruning (baseline):
# Prune weights with smallest absolute values
importance = weight.abs()
threshold = torch.quantile(importance, sparsity)
mask = importance >= threshold
Wanda (weights × activations):
# Importance = |weight| × input_activation
importance = weight.abs() * activation
# Better than magnitude alone (considers usage)
SparseGPT (second-order):
# Uses Hessian (second derivative) for importance
# More accurate but computationally expensive
importance = weight^2 / diag(Hessian)
Unstructured (fine-grained):
Structured (coarse-grained):
Semi-structured (N:M):
# Unstructured (random)
# [1, 0, 1, 0, 1, 1, 0, 0]
# Pros: Flexible, high quality
# Cons: No speedup
# Structured (block)
# [1, 1, 0, 0, 1, 1, 0, 0]
# Pros: Hardware friendly
# Cons: More accuracy loss
# N:M (semi-structured)
# [1, 0, 1, 0] [1, 1, 0, 0] (2:4 pattern)
# Pros: Hardware speedup + good quality
# Cons: Requires specific hardware (NVIDIA)
def gradual_prune(model, initial_sparsity=0.0, final_sparsity=0.5, num_steps=100):
"""Gradually increase sparsity during training."""
for step in range(num_steps):
# Current sparsity
current_sparsity = initial_sparsity + (final_sparsity - initial_sparsity) * (step / num_steps)
# Prune at current sparsity
for module in model.modules():
if isinstance(module, torch.nn.Linear):
weight = module.weight.data
threshold = torch.quantile(weight.abs().flatten(), current_sparsity)
mask = weight.abs() >= threshold
weight *= mask.float()
# Train one step
train_step(model)
return model
def layer_wise_prune(model, sparsity_per_layer):
"""Different sparsity for different layers."""
# Early layers: Less pruning (more important)
# Late layers: More pruning (less critical)
sparsity_schedule = {
"layer.0": 0.3, # 30% sparsity
"layer.1": 0.4,
"layer.2": 0.5,
"layer.3": 0.6, # 60% sparsity
}
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
# Find layer index
for layer_name, sparsity in sparsity_schedule.items():
if layer_name in name:
# Prune at layer-specific sparsity
prune_layer(module, sparsity)
break
return model
def iterative_prune_finetune(model, target_sparsity=0.5, iterations=5):
"""Prune gradually with fine-tuning between iterations."""
current_sparsity = 0.0
sparsity_increment = target_sparsity / iterations
for i in range(iterations):
# Increase sparsity
current_sparsity += sparsity_increment
# Prune
prune_model(model, sparsity=current_sparsity)
# Fine-tune (recover accuracy)
fine_tune(model, epochs=2, lr=1e-5)
return model
# Results: Better accuracy than one-shot at high sparsity
from transformers import Trainer, TrainingArguments
def production_pruning_pipeline(
model_name="meta-llama/Llama-2-7b-hf",
target_sparsity=0.5,
method="wanda", # or "sparsegpt"
):
# 1. Load model
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. Load calibration data
calib_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1000]")
# 3. Apply pruning
if method == "wanda":
pruned_model = wanda_prune(model, calib_dataset, sparsity=target_sparsity)
elif method == "sparsegpt":
pruner = SparseGPT(model)
pruned_model = pruner.prune(calib_dataset, sparsity=target_sparsity)
# 4. (Optional) Fine-tune to recover accuracy
training_args = TrainingArguments(
output_dir="./pruned-model",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=1e-5,
bf16=True,
)
trainer = Trainer(
model=pruned_model,
args=training_args,
train_dataset=finetune_dataset,
)
trainer.train()
# 5. Save
pruned_model.save_pretrained("./pruned-llama-7b-50")
tokenizer.save_pretrained("./pruned-llama-7b-50")
return pruned_model
# Usage
pruned_model = production_pruning_pipeline(
model_name="meta-llama/Llama-2-7b-hf",
target_sparsity=0.5,
method="wanda"
)
from lm_eval import evaluator
# Evaluate pruned vs original model
original_results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-2-7b-hf",
tasks=["arc_easy", "hellaswag", "winogrande"],
)
pruned_results = evaluator.simple_evaluate(
model="hf",
model_args="pretrained=./pruned-llama-7b-50",
tasks=["arc_easy", "hellaswag", "winogrande"],
)
# Compare
print(f"Original: {original_results['results']['arc_easy']['acc']:.3f}")
print(f"Pruned: {pruned_results['results']['arc_easy']['acc']:.3f}")
print(f"Degradation: {(original_results - pruned_results):.3f}")
# Typical results at 50% sparsity:
# - Wanda: <1% accuracy loss
# - SparseGPT: <0.5% accuracy loss
# - Magnitude: 2-3% accuracy loss
# Conservative (safe)
sparsity = 0.3 # 30%, <0.5% loss
# Balanced (recommended)
sparsity = 0.5 # 50%, ~1% loss
# Aggressive (risky)
sparsity = 0.7 # 70%, 2-5% loss
# Extreme (model-dependent)
sparsity = 0.9 # 90%, significant degradation
# One-shot, no retraining → Wanda or SparseGPT
if no_retraining_budget:
use_method = "wanda" # Faster
# Best quality → SparseGPT
if need_best_quality:
use_method = "sparsegpt" # More accurate
# Hardware speedup → N:M structured
if need_speedup:
use_method = "nm_prune" # 2:4 or 4:8
# ❌ Bad: Pruning without calibration data
prune_random(model) # No activation statistics
# ✅ Good: Use calibration data
prune_wanda(model, calib_data)
# ❌ Bad: Too high sparsity in one shot
prune(model, sparsity=0.9) # Massive accuracy loss
# ✅ Good: Gradual or iterative
iterative_prune(model, target=0.9, steps=10)
Pruning methods at 50% sparsity (LLaMA-7B):
| Method | Accuracy Loss | Speed | Memory | Retraining Needed | |--------|---------------|-------|---------|-------------------| | Magnitude | -2.5% | 1.0× | -50% | No | | Wanda | -0.8% | 1.0× | -50% | No | | SparseGPT | -0.4% | 1.0× | -50% | No | | N:M (2:4) | -1.0% | 2.0× | -50% | No | | Structured | -3.0% | 2.0× | -50% | No |
Source: Wanda paper (ICLR 2024), SparseGPT paper
tools
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity - these platforms have their own patterns, pitfalls, and breaking points. This skill covers when to use which platform, how to build reliable automations, and when to graduate to code-based solutions. Key insight: Zapier optimizes for simplicity and integrations (7000+ apps), Make optimizes for power
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).
tools
Workflow automation is the infrastructure that makes AI agents reliable. Without durable execution, a network hiccup during a 10-step payment flow means lost money and angry customers. With it, workflows resume exactly where they left off. This skill covers the platforms (n8n, Temporal, Inngest) and patterns (sequential, parallel, orchestrator-worker) that turn brittle scripts into production-grade automation. Key insight: The platforms make different tradeoffs. n8n optimizes for accessibility
development
Trigger.dev expert for background jobs, AI workflows, and reliable async execution with excellent developer experience and TypeScript-first design. Use when: trigger.dev, trigger dev, background task, ai background job, long running task.