Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

abelrguezr/llm-pretraining-helper

Name: llm-pretraining-helper
Author: abelrguezr

skills/AI/AI-llm-architecture/6.-pre-training-and-loading-models/SKILL.md

npx skillsauth add abelrguezr/hacktricks-skills llm-pretraining-helper

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM Pre-training Helper

A skill for training language models from scratch using PyTorch, following best practices from the "LLMs from Scratch" methodology.

What this skill does

This skill helps you:

Set up GPT model architectures with proper configuration
Prepare training data with tokenization and data loaders
Configure training loops with loss monitoring and evaluation
Implement text generation with sampling strategies
Save and load model checkpoints
Visualize training progress (loss, perplexity)

When to use this skill

Use this skill when:

You want to train an LLM from scratch on your own dataset
You need to understand the pre-training workflow
You're setting up GPT model configurations
You want to monitor training metrics (loss, perplexity)
You need to save/load model checkpoints
You're implementing text generation with temperature/top-k sampling

Quick Start

1. Set up model configuration

GPT_CONFIG = {
    "vocab_size": 50257,      # GPT-2 vocabulary size
    "context_length": 256,    # Context window (adjust based on data)
    "emb_dim": 768,           # Embedding dimension
    "n_heads": 12,            # Attention heads
    "n_layers": 12,           # Transformer layers
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-key-value bias
}

2. Prepare your data

# Load your text data
text_data = "your training text here"

# Split into train/validation (90/10 is common)
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

# Create data loaders
train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG["context_length"],
    stride=GPT_CONFIG["context_length"],
    shuffle=True,
    drop_last=True
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG["context_length"],
    stride=GPT_CONFIG["context_length"],
    shuffle=False,
    drop_last=False
)

3. Initialize model and start training

import torch

# Set seed for reproducibility
torch.manual_seed(123)

# Initialize model
model = GPTModel(GPT_CONFIG)

# Select device
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device)

# Setup optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.0004,
    weight_decay=0.1
)

# Train
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs,
    eval_freq=5,           # Evaluate every 5 steps
    eval_iter=5,           # Use 5 batches for evaluation
    start_context="Your starting phrase",
    tokenizer=tokenizer
)

Core Components

Model Architecture

The GPT model consists of:

Token embeddings: Convert token IDs to vectors
Positional embeddings: Add position information
Transformer blocks: Multi-head attention + feed-forward
Output head: Maps embeddings back to vocabulary

Training Loop Structure

For each epoch:
  For each batch:
    1. Zero gradients
    2. Forward pass → get logits
    3. Calculate loss (cross-entropy)
    4. Backward pass → compute gradients
    5. Optimizer step → update weights
    6. (Optional) Evaluate and log metrics

Loss Functions

Cross-entropy loss: Measures difference between predicted and actual token distributions
Perplexity: exp(loss) - represents model uncertainty (lower is better)

Text Generation Strategies

| Strategy | Description | Use Case | |----------|-------------|----------| | Greedy | Always pick highest probability token | Deterministic output | | Top-k | Sample from top k tokens | Balanced diversity | | Temperature | Scale logits before softmax | Control randomness | | Top-p (nucleus) | Sample until cumulative probability threshold | Adaptive diversity |

Training Parameters Guide

Learning Rate

Small (1e-5 to 1e-4): Precise convergence, slower training
Large (1e-3 to 1e-2): Faster training, risk of overshooting
Recommended: Start with 4e-4 for AdamW

Batch Size

Small (1-4): More frequent updates, noisier gradients
Large (8-32): Smoother gradients, more memory
Recommended: 2-4 for CPU, 8-16 for GPU

Context Length

Short (128-256): Faster training, less context
Long (512-1024): More context, slower training
Recommended: Match your use case, start with 256

Number of Epochs

Few (5-10): Quick iteration, may underfit
Many (20-50): Better convergence, risk of overfitting
Recommended: Monitor validation loss, stop when it plateaus

Monitoring Training

Key Metrics to Track

Training Loss: Should decrease over time
Validation Loss: Should decrease, watch for overfitting
Perplexity: exp(loss), lower is better
Tokens Seen: Track progress through dataset

Signs of Good Training

Training loss steadily decreases
Validation loss follows training loss
Generated text becomes more coherent
Perplexity drops significantly

Signs of Problems

Overfitting: Training loss ↓, Validation loss ↑
Underfitting: Both losses stay high
Exploding gradients: Loss becomes NaN or inf
Vanishing gradients: Loss stops decreasing

Saving and Loading Models

Save Full Checkpoint (for resuming training)

torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "epoch": current_epoch,
    "loss": current_loss
}, "checkpoint.pth")

Load Full Checkpoint

checkpoint = torch.load("checkpoint.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()

Save Model Only (for inference)

torch.save(model.state_dict(), "model.pth")

Load Model Only

model = GPTModel(GPT_CONFIG)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()

Common Issues and Solutions

"Not enough tokens for training"

Solution: Reduce context_length or increase training data
Check: total_tokens * train_ratio >= context_length

"CUDA out of memory"

Solution: Reduce batch size or context length
Alternative: Use gradient accumulation

"Loss not decreasing"

Check: Learning rate (try 1e-4 to 1e-3)
Check: Data quality and tokenization
Check: Model is in training mode (model.train())

"Validation loss increasing"

Solution: Early stopping, reduce epochs
Alternative: Add regularization (dropout, weight decay)

Advanced Techniques (Not in Base Code)

Learning Rate Scheduling

Linear Warmup: Start small, increase to max LR
Cosine Decay: Gradually reduce LR after warmup

Gradient Clipping

Prevents exploding gradients
Set max_norm in optimizer or use torch.nn.utils.clip_grad_norm_

Top-p Sampling (Nucleus)

More adaptive than top-k
Sums probabilities until threshold (e.g., 0.9)

Beam Search

Explores multiple sequences simultaneously
Better quality than greedy, more expensive

Next Steps

After training:

Evaluate: Test on held-out data
Fine-tune: Adapt to specific tasks
Deploy: Use for inference or as base model
Iterate: Adjust hyperparameters and retrain

References

LLMs from Scratch
rasbt/LLMs-from-scratch
GPT-2 Architecture

abelrguezr/llm-pretraining-helper

skills/AI/AI-llm-architecture/6.-pre-training-and-loading-models/SKILL.md

How to train LLMs from scratch using PyTorch, including model architecture setup, data preparation, training loops, loss monitoring, and model saving/loading. Use this skill whenever the user wants to train a language model from scratch, understand pre-training workflows, set up GPT architectures, configure training parameters, monitor loss/perplexity, or load/save model checkpoints. Make sure to use this skill when users mention training LLMs, pre-training, model checkpoints, GPT architectures, training loops, or want to build language models from the ground up.

5 stars

tools

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add abelrguezr/hacktricks-skills llm-pretraining-helper

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:05 AM42.1s4 files scanned

SKILL.md

name:: llm-pretraining-helper
description:: How to train LLMs from scratch using PyTorch, including model architecture setup, data preparation, training loops, loss monitoring, and model saving/loading. Use this skill whenever the user wants to train a language model from scratch, understand pre-training workflows, set up GPT architectures, configure training parameters, monitor loss/perplexity, or load/save model checkpoints. Make sure to use this skill when users mention training LLMs, pre-training, model checkpoints, GPT architectures, training loops, or want to build language models from the ground up.

LLM Pre-training Helper

A skill for training language models from scratch using PyTorch, following best practices from the "LLMs from Scratch" methodology.

What this skill does

This skill helps you:

Set up GPT model architectures with proper configuration
Prepare training data with tokenization and data loaders
Configure training loops with loss monitoring and evaluation
Implement text generation with sampling strategies
Save and load model checkpoints
Visualize training progress (loss, perplexity)

When to use this skill

Use this skill when:

You want to train an LLM from scratch on your own dataset
You need to understand the pre-training workflow
You're setting up GPT model configurations
You want to monitor training metrics (loss, perplexity)
You need to save/load model checkpoints
You're implementing text generation with temperature/top-k sampling

Quick Start

1. Set up model configuration

GPT_CONFIG = {
    "vocab_size": 50257,      # GPT-2 vocabulary size
    "context_length": 256,    # Context window (adjust based on data)
    "emb_dim": 768,           # Embedding dimension
    "n_heads": 12,            # Attention heads
    "n_layers": 12,           # Transformer layers
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-key-value bias
}

2. Prepare your data

# Load your text data
text_data = "your training text here"

# Split into train/validation (90/10 is common)
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

# Create data loaders
train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG["context_length"],
    stride=GPT_CONFIG["context_length"],
    shuffle=True,
    drop_last=True
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG["context_length"],
    stride=GPT_CONFIG["context_length"],
    shuffle=False,
    drop_last=False
)

3. Initialize model and start training

import torch

# Set seed for reproducibility
torch.manual_seed(123)

# Initialize model
model = GPTModel(GPT_CONFIG)

# Select device
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

model.to(device)

# Setup optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.0004,
    weight_decay=0.1
)

# Train
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs,
    eval_freq=5,           # Evaluate every 5 steps
    eval_iter=5,           # Use 5 batches for evaluation
    start_context="Your starting phrase",
    tokenizer=tokenizer
)

Core Components

Model Architecture

The GPT model consists of:

Token embeddings: Convert token IDs to vectors
Positional embeddings: Add position information
Transformer blocks: Multi-head attention + feed-forward
Output head: Maps embeddings back to vocabulary

Training Loop Structure

For each epoch:
  For each batch:
    1. Zero gradients
    2. Forward pass → get logits
    3. Calculate loss (cross-entropy)
    4. Backward pass → compute gradients
    5. Optimizer step → update weights
    6. (Optional) Evaluate and log metrics

Loss Functions

Cross-entropy loss: Measures difference between predicted and actual token distributions
Perplexity: exp(loss) - represents model uncertainty (lower is better)

Text Generation Strategies

Training Parameters Guide

Learning Rate

Small (1e-5 to 1e-4): Precise convergence, slower training
Large (1e-3 to 1e-2): Faster training, risk of overshooting
Recommended: Start with 4e-4 for AdamW

Batch Size

Small (1-4): More frequent updates, noisier gradients
Large (8-32): Smoother gradients, more memory
Recommended: 2-4 for CPU, 8-16 for GPU

Context Length

Short (128-256): Faster training, less context
Long (512-1024): More context, slower training
Recommended: Match your use case, start with 256

Number of Epochs

Few (5-10): Quick iteration, may underfit
Many (20-50): Better convergence, risk of overfitting
Recommended: Monitor validation loss, stop when it plateaus

Monitoring Training

Key Metrics to Track

Training Loss: Should decrease over time
Validation Loss: Should decrease, watch for overfitting
Perplexity: exp(loss), lower is better
Tokens Seen: Track progress through dataset

Signs of Good Training

Training loss steadily decreases
Validation loss follows training loss
Generated text becomes more coherent
Perplexity drops significantly

Signs of Problems

Overfitting: Training loss ↓, Validation loss ↑
Underfitting: Both losses stay high
Exploding gradients: Loss becomes NaN or inf
Vanishing gradients: Loss stops decreasing

Saving and Loading Models

Save Full Checkpoint (for resuming training)

torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "epoch": current_epoch,
    "loss": current_loss
}, "checkpoint.pth")

Load Full Checkpoint

checkpoint = torch.load("checkpoint.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()

Save Model Only (for inference)

torch.save(model.state_dict(), "model.pth")

Load Model Only

model = GPTModel(GPT_CONFIG)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()

Common Issues and Solutions

"Not enough tokens for training"

Solution: Reduce context_length or increase training data
Check: total_tokens * train_ratio >= context_length

"CUDA out of memory"

Solution: Reduce batch size or context length
Alternative: Use gradient accumulation

"Loss not decreasing"

Check: Learning rate (try 1e-4 to 1e-3)
Check: Data quality and tokenization
Check: Model is in training mode (model.train())

"Validation loss increasing"

Solution: Early stopping, reduce epochs
Alternative: Add regularization (dropout, weight decay)

Advanced Techniques (Not in Base Code)

Learning Rate Scheduling

Linear Warmup: Start small, increase to max LR
Cosine Decay: Gradually reduce LR after warmup

Gradient Clipping

Prevents exploding gradients
Set max_norm in optimizer or use torch.nn.utils.clip_grad_norm_

Top-p Sampling (Nucleus)

More adaptive than top-k
Sums probabilities until threshold (e.g., 0.9)

Beam Search

Explores multiple sequences simultaneously
Better quality than greedy, more expensive

Next Steps

After training:

Evaluate: Test on held-out data
Fine-tune: Adapt to specific tasks
Deploy: Use for inference or as base model
Iterate: Adjust hyperparameters and retrain

References

LLMs from Scratch
rasbt/LLMs-from-scratch
GPT-2 Architecture

Related Skills

abelrguezr/house-of-lore-exploit

testing

VerifiedTrustedCommunity

How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-lore-exploit

abelrguezr/house-of-force-exploit

testing

VerifiedTrustedCommunity

How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-force-exploit

abelrguezr/house-of-einherjar

tools

VerifiedTrustedCommunity

How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-einherjar

abelrguezr/heap-overflow-exploitation

testing

VerifiedTrustedCommunity

How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/heap-overflow-exploitation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/abelrguezr/hacktricks-skills.git

# Copy into Claude Code skills folder (global)
cp -r hacktricks-skills/skills/AI/AI-llm-architecture/6.-pre-training-and-loading-models ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

abelrguezr/hacktricks-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT