infrastructure/local-ai/llm-fine-tuning/SKILL.md
Set up infrastructure for fine-tuning LLMs with QLoRA, LoRA, and full fine-tuning using Hugging Face TRL, Axolotl, and distributed training with DeepSpeed or FSDP. Covers dataset prep, training runs, and model export.
npx skillsauth add bagelhole/devops-security-agent-skills llm-fine-tuningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Train and fine-tune open-source LLMs efficiently — from LoRA on a single GPU to distributed full fine-tuning across multi-node clusters.
Use this skill when:
nvidia-smi workingpipHF_TOKEN for gated modelspip install transformers datasets trl peft bitsandbytes accelerate
python - <<'EOF'
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import torch
model_id = "meta-llama/Llama-3.1-8B-Instruct"
# 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# LoRA configuration
peft_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
dataset = load_dataset("your-org/your-dataset", split="train")
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch",
report_to="wandb",
),
train_dataset=dataset,
peft_config=peft_config,
processing_class=tokenizer,
)
trainer.train()
trainer.save_model("./fine-tuned-model")
EOF
# config.yaml — Axolotl QLoRA config for Llama 3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: PreTrainedTokenizerFast
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
datasets:
- path: your-org/your-dataset
type: alpaca # or sharegpt, chat_template, etc.
dataset_prepared_path: ./prepared-data
val_set_size: 0.05
output_dir: ./output
sequence_len: 4096
sample_packing: true # pack multiple short samples for efficiency
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 2e-4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
warmup_ratio: 0.05
bf16: true
flash_attention: true
logging_steps: 10
eval_steps: 100
save_steps: 200
wandb_project: my-fine-tune
# Run with Axolotl
pip install axolotl[flash-attn,deepspeed]
accelerate launch -m axolotl.cli.train config.yaml
// deepspeed_zero3.json — ZeRO Stage 3 (split optimizer + gradients + params)
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu", "pin_memory": true},
"offload_param": {"device": "cpu", "pin_memory": true},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"gather_16bit_weights_on_model_save": true
},
"bf16": {"enabled": true},
"gradient_clipping": 1.0,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
# Launch 4-GPU DeepSpeed training
deepspeed --num_gpus=4 train.py \
--deepspeed deepspeed_zero3.json \
--model_name meta-llama/Llama-3.1-70B-Instruct \
--output_dir ./output
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# Dataset format: {"prompt": ..., "chosen": ..., "rejected": ...}
dataset = load_dataset("your-org/preference-data")
trainer = DPOTrainer(
model=model,
ref_model=None, # None = implicit reference with peft
args=DPOConfig(
output_dir="./dpo-output",
beta=0.1, # KL divergence weight
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=5e-7,
bf16=True,
),
train_dataset=dataset["train"],
peft_config=peft_config,
processing_class=tokenizer,
)
trainer.train()
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model in full precision
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cpu",
)
# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")
merged_model = model.merge_and_unload()
# Save merged model (ready for vLLM serving)
merged_model.save_pretrained("./merged-model", safe_serialization=True)
tokenizer.save_pretrained("./merged-model")
# Push to Hugging Face Hub
merged_model.push_to_hub("your-org/your-fine-tuned-model")
apiVersion: batch/v1
kind: Job
metadata:
name: llm-fine-tune
spec:
template:
spec:
restartPolicy: OnFailure
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-80GB
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.05-py3
command: ["accelerate", "launch", "-m", "axolotl.cli.train", "/config/config.yaml"]
resources:
limits:
nvidia.com/gpu: "4"
memory: "320Gi"
requests:
nvidia.com/gpu: "4"
volumeMounts:
- name: config
mountPath: /config
- name: model-cache
mountPath: /root/.cache/huggingface
- name: output
mountPath: /output
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: WANDB_API_KEY
valueFrom:
secretKeyRef:
name: wandb-token
key: key
volumes:
- name: config
configMap:
name: axolotl-config
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: output
persistentVolumeClaim:
claimName: training-output-pvc
| Issue | Cause | Fix |
|-------|-------|-----|
| CUDA out of memory | Batch too large | Reduce micro_batch_size; increase gradient_accumulation_steps |
| Training loss NaN | Learning rate too high | Lower LR to 1e-4 or 5e-5; add warmup |
| Slow training | No Flash Attention | Install flash-attn; enable flash_attention: true |
| Poor fine-tune quality | Bad data formatting | Validate dataset format; check sample_packing compatibility |
| Adapter merge errors | Mixed quantization | Merge in bf16 on CPU, not in 4-bit |
sample_packing in Axolotl to maximize GPU utilization on short sequences.development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.