cli-tool/components/skills/ai-research/optimization-hqq/SKILL.md
Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.
npx skillsauth add davila7/claude-code-templates hqq-quantizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.
Use HQQ when:
Key advantages:
Use alternatives instead:
pip install hqq
# With specific backend
pip install hqq[torch] # PyTorch backend
pip install hqq[torchao] # TorchAO int4 backend
pip install hqq[bitblas] # BitBlas backend
pip install hqq[marlin] # Marlin backend
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nn
# Configure quantization
config = BaseQuantizeConfig(
nbits=4, # 4-bit quantization
group_size=64, # Group size for quantization
axis=1 # Quantize along output dimension
)
# Quantize a linear layer
linear = nn.Linear(4096, 4096)
hqq_linear = HQQLinear(linear, config)
# Use normally
output = hqq_linear(input_tensor)
from transformers import AutoModelForCausalLM, HqqConfig
# Configure HQQ
quantization_config = HqqConfig(
nbits=4,
group_size=64,
axis=1
)
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quantization_config,
device_map="auto"
)
# Model is quantized and ready to use
HQQ uses BaseQuantizeConfig to define quantization parameters:
from hqq.core.quantize import BaseQuantizeConfig
# Standard 4-bit config
config_4bit = BaseQuantizeConfig(
nbits=4, # Bits per weight (1-8)
group_size=64, # Weights per quantization group
axis=1 # 0=input dim, 1=output dim
)
# Aggressive 2-bit config
config_2bit = BaseQuantizeConfig(
nbits=2,
group_size=16, # Smaller groups for low-bit
axis=1
)
# Mixed precision per layer type
layer_configs = {
"self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
}
The core quantized layer that replaces nn.Linear:
from hqq.core.quantize import HQQLinear
import torch
# Create quantized layer
linear = torch.nn.Linear(4096, 4096)
hqq_layer = HQQLinear(linear, config)
# Access quantized weights
W_q = hqq_layer.W_q # Quantized weights
scale = hqq_layer.scale # Scale factors
zero = hqq_layer.zero # Zero points
# Dequantize for inspection
W_dequant = hqq_layer.dequantize()
HQQ supports multiple inference backends for different hardware:
from hqq.core.quantize import HQQLinear
# Available backends
backends = [
"pytorch", # Pure PyTorch (default)
"pytorch_compile", # torch.compile optimized
"aten", # Custom CUDA kernels
"torchao_int4", # TorchAO int4 matmul
"gemlite", # GemLite CUDA kernels
"bitblas", # BitBlas optimized
"marlin", # Marlin 4-bit kernels
]
# Set backend globally
HQQLinear.set_backend("torchao_int4")
# Or per layer
hqq_layer.set_backend("marlin")
Backend selection guide: | Backend | Best For | Requirements | |---------|----------|--------------| | pytorch | Compatibility | Any GPU | | pytorch_compile | Moderate speedup | torch>=2.0 | | aten | Good balance | CUDA GPU | | torchao_int4 | 4-bit inference | torchao installed | | marlin | Maximum 4-bit speed | Ampere+ GPU | | bitblas | Flexible bit-widths | bitblas installed |
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load HQQ-quantized model from Hub
model = AutoModelForCausalLM.from_pretrained(
"mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# Use normally
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
from transformers import AutoModelForCausalLM, HqqConfig
# Quantize
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
# Save quantized model
model.save_pretrained("./llama-8b-hqq-4bit")
# Push to Hub
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
from transformers import AutoModelForCausalLM, HqqConfig
# Different precision per layer type
config = HqqConfig(
nbits=4,
group_size=64,
# Attention layers: higher precision
# MLP layers: lower precision for memory savings
dynamic_config={
"attn": {"nbits": 4, "group_size": 64},
"mlp": {"nbits": 2, "group_size": 32}
}
)
from vllm import LLM, SamplingParams
# Load HQQ-quantized model
llm = LLM(
model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
quantization="hqq",
dtype="float16"
)
# Generate
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B",
quantization="hqq",
quantization_config={
"nbits": 4,
"group_size": 64
}
)
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_model
# Load quantized model
quant_config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quant_config,
device_map="auto"
)
# Apply LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train normally with Trainer or custom loop
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./hqq-lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator
)
trainer.train()
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
# 1. Configure quantization
config = HqqConfig(nbits=4, group_size=64)
# 2. Load and quantize (no calibration needed!)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# 3. Verify quality
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
# 4. Save
model.save_pretrained("./llama-8b-hqq")
tokenizer.save_pretrained("./llama-8b-hqq")
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfig
# 1. Quantize with optimal backend
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
# 2. Set fast backend
HQQLinear.set_backend("marlin") # or "torchao_int4"
# 3. Compile for additional speedup
import torch
model = torch.compile(model)
# 4. Benchmark
import time
inputs = tokenizer("Hello", return_tensors="pt").to(model.device)
start = time.time()
for _ in range(10):
model.generate(**inputs, max_new_tokens=100)
print(f"Avg time: {(time.time() - start) / 10:.2f}s")
Out of memory during quantization:
# Quantize layer-by-layer
from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="sequential" # Load layers sequentially
)
Slow inference:
# Switch to optimized backend
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("marlin") # Requires Ampere+ GPU
# Or compile
model = torch.compile(model, mode="reduce-overhead")
Poor quality at 2-bit:
# Use smaller group size
config = BaseQuantizeConfig(
nbits=2,
group_size=16, # Smaller groups help at low bits
axis=1
)
tools
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity - these platforms have their own patterns, pitfalls, and breaking points. This skill covers when to use which platform, how to build reliable automations, and when to graduate to code-based solutions. Key insight: Zapier optimizes for simplicity and integrations (7000+ apps), Make optimizes for power
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).
tools
Workflow automation is the infrastructure that makes AI agents reliable. Without durable execution, a network hiccup during a 10-step payment flow means lost money and angry customers. With it, workflows resume exactly where they left off. This skill covers the platforms (n8n, Temporal, Inngest) and patterns (sequential, parallel, orchestrator-worker) that turn brittle scripts into production-grade automation. Key insight: The platforms make different tradeoffs. n8n optimizes for accessibility
development
Trigger.dev expert for background jobs, AI workflows, and reliable async execution with excellent developer experience and TypeScript-first design. Use when: trigger.dev, trigger dev, background task, ai background job, long running task.