skills/AI/AI-llm-architecture/6.-pre-training-and-loading-models/SKILL.md
How to train LLMs from scratch using PyTorch, including model architecture setup, data preparation, training loops, loss monitoring, and model saving/loading. Use this skill whenever the user wants to train a language model from scratch, understand pre-training workflows, set up GPT architectures, configure training parameters, monitor loss/perplexity, or load/save model checkpoints. Make sure to use this skill when users mention training LLMs, pre-training, model checkpoints, GPT architectures, training loops, or want to build language models from the ground up.
npx skillsauth add abelrguezr/hacktricks-skills llm-pretraining-helperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for training language models from scratch using PyTorch, following best practices from the "LLMs from Scratch" methodology.
This skill helps you:
Use this skill when:
GPT_CONFIG = {
"vocab_size": 50257, # GPT-2 vocabulary size
"context_length": 256, # Context window (adjust based on data)
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Attention heads
"n_layers": 12, # Transformer layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-key-value bias
}
# Load your text data
text_data = "your training text here"
# Split into train/validation (90/10 is common)
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]
# Create data loaders
train_loader = create_dataloader_v1(
train_data,
batch_size=2,
max_length=GPT_CONFIG["context_length"],
stride=GPT_CONFIG["context_length"],
shuffle=True,
drop_last=True
)
val_loader = create_dataloader_v1(
val_data,
batch_size=2,
max_length=GPT_CONFIG["context_length"],
stride=GPT_CONFIG["context_length"],
shuffle=False,
drop_last=False
)
import torch
# Set seed for reproducibility
torch.manual_seed(123)
# Initialize model
model = GPTModel(GPT_CONFIG)
# Select device
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
model.to(device)
# Setup optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=0.0004,
weight_decay=0.1
)
# Train
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
model, train_loader, val_loader, optimizer, device,
num_epochs=num_epochs,
eval_freq=5, # Evaluate every 5 steps
eval_iter=5, # Use 5 batches for evaluation
start_context="Your starting phrase",
tokenizer=tokenizer
)
The GPT model consists of:
For each epoch:
For each batch:
1. Zero gradients
2. Forward pass → get logits
3. Calculate loss (cross-entropy)
4. Backward pass → compute gradients
5. Optimizer step → update weights
6. (Optional) Evaluate and log metrics
exp(loss) - represents model uncertainty (lower is better)| Strategy | Description | Use Case | |----------|-------------|----------| | Greedy | Always pick highest probability token | Deterministic output | | Top-k | Sample from top k tokens | Balanced diversity | | Temperature | Scale logits before softmax | Control randomness | | Top-p (nucleus) | Sample until cumulative probability threshold | Adaptive diversity |
exp(loss), lower is bettertorch.save({
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"epoch": current_epoch,
"loss": current_loss
}, "checkpoint.pth")
checkpoint = torch.load("checkpoint.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()
torch.save(model.state_dict(), "model.pth")
model = GPTModel(GPT_CONFIG)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()
context_length or increase training datatotal_tokens * train_ratio >= context_lengthmodel.train())max_norm in optimizer or use torch.nn.utils.clip_grad_norm_After training:
testing
How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.
testing
How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".
tools
How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.
testing
How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.