Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

abelrguezr/llm-training-guide

Name: llm-training-guide
Author: abelrguezr

skills/AI/AI-llm-architecture/AI-llm-architecture/SKILL.md

npx skillsauth add abelrguezr/hacktricks-skills llm-training-guide

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM Training Guide

A comprehensive guide for building and training large language models from scratch, based on the Manning book "Build a Large Language Model from Scratch".

Overview

This skill covers the complete LLM training pipeline:

Tokenization - Converting text to token IDs
Data Sampling - Preparing training data
Token Embeddings - Vector representations
Attention Mechanisms - Capturing word relationships
LLM Architecture - Full model structure
Pre-training - Training from scratch
Fine-tuning - Adapting for specific tasks

Phase 1: Tokenization

Goal: Divide input text into tokens (IDs) in a meaningful way.

Key Concepts

Tokens: The basic units the model processes (can be characters, words, or subwords)
Vocabulary: The set of all unique tokens
Token IDs: Numeric identifiers for each token in the vocabulary

Implementation Steps

Build vocabulary from your training corpus
Create token-to-ID mapping (tokenizer)
Create ID-to-token mapping (for decoding)
Encode text → convert to token IDs
Decode IDs → convert back to text

Best Practices

Use subword tokenization (like BPE or WordPiece) for better coverage
Include special tokens: <pad>, <unk>, <bos>, <eos>
Keep vocabulary size reasonable (typically 50K-100K tokens)
Consider your domain when building vocabulary

Phase 2: Data Sampling

Goal: Sample input data and prepare it for training by separating into sequences of specific length and generating expected responses.

Key Concepts

Sequence length: Fixed number of tokens per training example
Context window: How much history the model sees
Target generation: What the model should predict (next token)

Implementation Steps

Load and concatenate all training text
Tokenize the entire corpus
Split into sequences of fixed length (e.g., 1024 tokens)
Create input/target pairs:
- Input: tokens [0, 1, 2, ..., n-1]
- Target: tokens [1, 2, 3, ..., n]
Batch sequences for efficient training

Best Practices

Use sequence lengths that fit your GPU memory
Shuffle sequences between epochs
Consider overlapping sequences for more training data
Balance dataset across domains if using mixed data

Phase 3: Token Embeddings

Goal: Assign each token a vector representation of desired dimensions. Each word becomes a point in X-dimensional space.

Key Concepts

Embedding dimension: Size of the vector (e.g., 512, 1024, 4096)
Learnable parameters: Embeddings are initialized randomly and trained
Position embeddings: Additional vectors encoding word position

Implementation Steps

Initialize token embeddings randomly (vocab_size × embedding_dim)
Initialize position embeddings randomly (max_seq_len × embedding_dim)
Combine embeddings: token_embedding + position_embedding
Train embeddings alongside model parameters

Position Embedding Types

Absolute: Fixed position encoding (simple, effective)
Relative: Encodes distance between tokens
Rotary: Rotates embeddings based on position (RoPE)

Best Practices

Embedding dimension should match model hidden size
Use learned embeddings rather than fixed ones
Consider sinusoidal position embeddings for extrapolation

Phase 4: Attention Mechanisms

Goal: Apply attention layers to capture relationships between words in the sentence.

Key Concepts

Self-attention: Each token attends to all tokens in the sequence
Query, Key, Value: Three projections for attention computation
Multi-head attention: Multiple attention heads in parallel
Causal masking: Prevents attending to future tokens (for training)

Implementation Steps

Project embeddings to Q, K, V matrices
Compute attention scores: Q × K^T / sqrt(d_k)
Apply causal mask (for decoder-only models)
Softmax to get attention weights
Weighted sum: attention_weights × V
Combine heads and project back

Attention Formula

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

Best Practices

Use multi-head attention (8-16 heads typical)
Apply layer normalization before and after attention
Use residual connections around attention blocks
Consider flash attention for efficiency

Phase 5: LLM Architecture

Goal: Develop the full LLM architecture by combining all components.

Standard Transformer Decoder Architecture

Input → Token Embedding → Position Embedding → [N × (Attention → MLP)] → Output Projection → Logits

Components

Embedding Layer: Token + Position embeddings
N Transformer Blocks:
- Multi-head self-attention
- Layer normalization
- Feed-forward MLP (2-4x hidden size)
- Layer normalization
Output Projection: Hidden size → vocabulary size
Loss Function: Cross-entropy on next token prediction

Implementation Steps

Define model class with all layers
Implement forward pass through all components
Implement training loop with loss computation
Implement generation (sampling, beam search, etc.)
Add saving/loading for model checkpoints

Best Practices

Use pre-norm architecture (norm before attention/MLP)
Initialize weights carefully (e.g., Xavier, He initialization)
Use gradient clipping to prevent exploding gradients
Implement mixed precision training for efficiency

Phase 6: Pre-training

Goal: Train the model from scratch using the defined architecture, loss functions, and optimizer.

Training Loop

for epoch in epochs:
    for batch in dataloader:
        # Forward pass
        logits = model(input_tokens)
        
        # Compute loss
        loss = cross_entropy(logits, target_tokens)
        
        # Backward pass
        loss.backward()
        
        # Update weights
        optimizer.step()
        optimizer.zero_grad()

Key Hyperparameters

Learning rate: 1e-4 to 3e-4 (with warmup)
Batch size: Depends on GPU memory (effective batch size 1024-4096)
Optimizer: AdamW with weight decay (0.01-0.1)
Learning rate schedule: Cosine decay or linear warmup + decay
Gradient accumulation: For larger effective batch sizes

Best Practices

Use learning rate warmup (first 10% of steps)
Monitor training loss and perplexity
Save checkpoints regularly
Use gradient checkpointing for memory efficiency
Consider distributed training for large models

Phase 7: Fine-tuning

7.0 LoRA (Low-Rank Adaptation)

Goal: Reduce computation needed for fine-tuning by training only small adapter matrices.

How LoRA Works

Freeze pre-trained weights
Add small rank-r matrices to attention layers
Train only the LoRA parameters
Merge LoRA weights with base model for inference

Implementation Steps

Freeze base model parameters
Add LoRA adapters to attention Q, V, (optionally K, O)
Train only LoRA parameters
Merge weights for deployment

Best Practices

Rank r: 8-64 (higher for more capacity)
Alpha: Scaling factor (typically 2× rank)
Apply to attention layers primarily
Use lower learning rate than pre-training

7.1 Fine-tuning for Classification

Goal: Adapt pre-trained model to classify text into categories.

Implementation Steps

Load pre-trained model (frozen or partially unfrozen)
Add classification head on top of embeddings
Prepare labeled dataset with categories
Train with cross-entropy loss on labels
Evaluate with accuracy, F1, etc.

Best Practices

Use mean pooling or [CLS] token for classification
Fine-tune last 1-2 layers initially
Use smaller learning rate than pre-training
Consider few-shot learning for limited data

7.2 Fine-tuning for Instruction Following

Goal: Adapt pre-trained model to follow instructions (chat, tasks, etc.).

Implementation Steps

Prepare instruction dataset (instruction, input, output format)

Format examples with special tokens:

<instruction> {instruction} <input> {input} <output> {output}

Train on formatted data with next-token prediction
Evaluate on instruction following benchmarks

Best Practices

Use diverse instruction templates
Include both simple and complex instructions
Consider supervised fine-tuning (SFT) before RLHF
Use quality datasets (e.g., Alpaca, Dolly)
Monitor for instruction following vs. memorization

Common Issues and Solutions

| Issue | Solution | |-------|----------| | Training loss not decreasing | Check learning rate, batch size, data quality | | Model generates repetitive text | Adjust temperature, use top-k/top-p sampling | | Out of memory | Reduce batch size, use gradient checkpointing | | Slow training | Use mixed precision, flash attention | | Poor generalization | More data, regularization, better architecture |

Next Steps

After completing these phases, you can:

Deploy your model for inference
Optimize with quantization, pruning
Scale to larger datasets and models
Experiment with different architectures
Fine-tune for your specific use case

References

Manning Book: "Build a Large Language Model from Scratch"
Original Transformer Paper: "Attention Is All You Need"
LoRA Paper: "LoRA: Low-Rank Adaptation of Large Language Models"
Various implementation guides and tutorials

abelrguezr/llm-training-guide

skills/AI/AI-llm-architecture/AI-llm-architecture/SKILL.md

Guide for building and training large language models from scratch. Use this skill whenever the user wants to understand LLM training concepts, implement tokenization, data sampling, embeddings, attention mechanisms, model architecture, pre-training, or fine-tuning workflows. Trigger on mentions of LLM training, building models from scratch, tokenization, embeddings, attention, pre-training, fine-tuning, LoRA, or any LLM development task.

5 stars

development

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add abelrguezr/hacktricks-skills llm-training-guide

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:06 AM32.8s1 file scanned

SKILL.md

name:: llm-training-guide
description:: Guide for building and training large language models from scratch. Use this skill whenever the user wants to understand LLM training concepts, implement tokenization, data sampling, embeddings, attention mechanisms, model architecture, pre-training, or fine-tuning workflows. Trigger on mentions of LLM training, building models from scratch, tokenization, embeddings, attention, pre-training, fine-tuning, LoRA, or any LLM development task.

LLM Training Guide

A comprehensive guide for building and training large language models from scratch, based on the Manning book "Build a Large Language Model from Scratch".

Overview

This skill covers the complete LLM training pipeline:

Tokenization - Converting text to token IDs
Data Sampling - Preparing training data
Token Embeddings - Vector representations
Attention Mechanisms - Capturing word relationships
LLM Architecture - Full model structure
Pre-training - Training from scratch
Fine-tuning - Adapting for specific tasks

Phase 1: Tokenization

Goal: Divide input text into tokens (IDs) in a meaningful way.

Key Concepts

Tokens: The basic units the model processes (can be characters, words, or subwords)
Vocabulary: The set of all unique tokens
Token IDs: Numeric identifiers for each token in the vocabulary

Implementation Steps

Build vocabulary from your training corpus
Create token-to-ID mapping (tokenizer)
Create ID-to-token mapping (for decoding)
Encode text → convert to token IDs
Decode IDs → convert back to text

Best Practices

Use subword tokenization (like BPE or WordPiece) for better coverage
Include special tokens: <pad>, <unk>, <bos>, <eos>
Keep vocabulary size reasonable (typically 50K-100K tokens)
Consider your domain when building vocabulary

Phase 2: Data Sampling

Goal: Sample input data and prepare it for training by separating into sequences of specific length and generating expected responses.

Key Concepts

Sequence length: Fixed number of tokens per training example
Context window: How much history the model sees
Target generation: What the model should predict (next token)

Implementation Steps

Load and concatenate all training text
Tokenize the entire corpus
Split into sequences of fixed length (e.g., 1024 tokens)
Create input/target pairs:
- Input: tokens [0, 1, 2, ..., n-1]
- Target: tokens [1, 2, 3, ..., n]
Batch sequences for efficient training

Best Practices

Use sequence lengths that fit your GPU memory
Shuffle sequences between epochs
Consider overlapping sequences for more training data
Balance dataset across domains if using mixed data

Phase 3: Token Embeddings

Goal: Assign each token a vector representation of desired dimensions. Each word becomes a point in X-dimensional space.

Key Concepts

Embedding dimension: Size of the vector (e.g., 512, 1024, 4096)
Learnable parameters: Embeddings are initialized randomly and trained
Position embeddings: Additional vectors encoding word position

Implementation Steps

Initialize token embeddings randomly (vocab_size × embedding_dim)
Initialize position embeddings randomly (max_seq_len × embedding_dim)
Combine embeddings: token_embedding + position_embedding
Train embeddings alongside model parameters

Position Embedding Types

Absolute: Fixed position encoding (simple, effective)
Relative: Encodes distance between tokens
Rotary: Rotates embeddings based on position (RoPE)

Best Practices

Embedding dimension should match model hidden size
Use learned embeddings rather than fixed ones
Consider sinusoidal position embeddings for extrapolation

Phase 4: Attention Mechanisms

Goal: Apply attention layers to capture relationships between words in the sentence.

Key Concepts

Self-attention: Each token attends to all tokens in the sequence
Query, Key, Value: Three projections for attention computation
Multi-head attention: Multiple attention heads in parallel
Causal masking: Prevents attending to future tokens (for training)

Implementation Steps

Project embeddings to Q, K, V matrices
Compute attention scores: Q × K^T / sqrt(d_k)
Apply causal mask (for decoder-only models)
Softmax to get attention weights
Weighted sum: attention_weights × V
Combine heads and project back

Attention Formula

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

Best Practices

Use multi-head attention (8-16 heads typical)
Apply layer normalization before and after attention
Use residual connections around attention blocks
Consider flash attention for efficiency

Phase 5: LLM Architecture

Goal: Develop the full LLM architecture by combining all components.

Standard Transformer Decoder Architecture

Input → Token Embedding → Position Embedding → [N × (Attention → MLP)] → Output Projection → Logits

Components

Embedding Layer: Token + Position embeddings
N Transformer Blocks:
- Multi-head self-attention
- Layer normalization
- Feed-forward MLP (2-4x hidden size)
- Layer normalization
Output Projection: Hidden size → vocabulary size
Loss Function: Cross-entropy on next token prediction

Implementation Steps

Define model class with all layers
Implement forward pass through all components
Implement training loop with loss computation
Implement generation (sampling, beam search, etc.)
Add saving/loading for model checkpoints

Best Practices

Use pre-norm architecture (norm before attention/MLP)
Initialize weights carefully (e.g., Xavier, He initialization)
Use gradient clipping to prevent exploding gradients
Implement mixed precision training for efficiency

Phase 6: Pre-training

Goal: Train the model from scratch using the defined architecture, loss functions, and optimizer.

Training Loop

for epoch in epochs:
    for batch in dataloader:
        # Forward pass
        logits = model(input_tokens)
        
        # Compute loss
        loss = cross_entropy(logits, target_tokens)
        
        # Backward pass
        loss.backward()
        
        # Update weights
        optimizer.step()
        optimizer.zero_grad()

Key Hyperparameters

Learning rate: 1e-4 to 3e-4 (with warmup)
Batch size: Depends on GPU memory (effective batch size 1024-4096)
Optimizer: AdamW with weight decay (0.01-0.1)
Learning rate schedule: Cosine decay or linear warmup + decay
Gradient accumulation: For larger effective batch sizes

Best Practices

Use learning rate warmup (first 10% of steps)
Monitor training loss and perplexity
Save checkpoints regularly
Use gradient checkpointing for memory efficiency
Consider distributed training for large models

Phase 7: Fine-tuning

7.0 LoRA (Low-Rank Adaptation)

Goal: Reduce computation needed for fine-tuning by training only small adapter matrices.

How LoRA Works

Freeze pre-trained weights
Add small rank-r matrices to attention layers
Train only the LoRA parameters
Merge LoRA weights with base model for inference

Implementation Steps

Freeze base model parameters
Add LoRA adapters to attention Q, V, (optionally K, O)
Train only LoRA parameters
Merge weights for deployment

Best Practices

Rank r: 8-64 (higher for more capacity)
Alpha: Scaling factor (typically 2× rank)
Apply to attention layers primarily
Use lower learning rate than pre-training

7.1 Fine-tuning for Classification

Goal: Adapt pre-trained model to classify text into categories.

Implementation Steps

Load pre-trained model (frozen or partially unfrozen)
Add classification head on top of embeddings
Prepare labeled dataset with categories
Train with cross-entropy loss on labels
Evaluate with accuracy, F1, etc.

Best Practices

Use mean pooling or [CLS] token for classification
Fine-tune last 1-2 layers initially
Use smaller learning rate than pre-training
Consider few-shot learning for limited data

7.2 Fine-tuning for Instruction Following

Goal: Adapt pre-trained model to follow instructions (chat, tasks, etc.).

Implementation Steps

Prepare instruction dataset (instruction, input, output format)

Format examples with special tokens:

<instruction> {instruction} <input> {input} <output> {output}

Train on formatted data with next-token prediction
Evaluate on instruction following benchmarks

Best Practices

Use diverse instruction templates
Include both simple and complex instructions
Consider supervised fine-tuning (SFT) before RLHF
Use quality datasets (e.g., Alpaca, Dolly)
Monitor for instruction following vs. memorization

Common Issues and Solutions

Next Steps

After completing these phases, you can:

Deploy your model for inference
Optimize with quantization, pruning
Scale to larger datasets and models
Experiment with different architectures
Fine-tune for your specific use case

References

Manning Book: "Build a Large Language Model from Scratch"
Original Transformer Paper: "Attention Is All You Need"
LoRA Paper: "LoRA: Low-Rank Adaptation of Large Language Models"
Various implementation guides and tutorials

Related Skills

abelrguezr/house-of-lore-exploit

testing

VerifiedTrustedCommunity

How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-lore-exploit

abelrguezr/house-of-force-exploit

testing

VerifiedTrustedCommunity

How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-force-exploit

abelrguezr/house-of-einherjar

tools

VerifiedTrustedCommunity

How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-einherjar

abelrguezr/heap-overflow-exploitation

testing

VerifiedTrustedCommunity

How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/heap-overflow-exploitation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/abelrguezr/hacktricks-skills.git

# Copy into Claude Code skills folder (global)
cp -r hacktricks-skills/skills/AI/AI-llm-architecture/AI-llm-architecture ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

abelrguezr/hacktricks-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT