Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

abelrguezr/llm-data-sampling

Name: llm-data-sampling
Author: abelrguezr

skills/AI/AI-llm-architecture/2.-data-sampling/SKILL.md

npx skillsauth add abelrguezr/hacktricks-skills llm-data-sampling

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM Data Sampling

A skill for preparing and sampling text data for training large language models (LLMs). This covers tokenization, sequence generation, sliding windows, and advanced sampling strategies.

When to Use This Skill

Use this skill when the user needs to:

Prepare text data for LLM training
Create input/target token sequences
Implement sliding window sampling
Apply advanced sampling strategies (temperature weighting, sequence packing, deduplication)
Optimize training data quality and security
Create PyTorch datasets and dataloaders for LLM training

Core Concepts

1. Tokenization

Breaking text into smaller units (tokens) that the model processes. Common approaches:

Word-level: Split by spaces
Subword-level: BPE, WordPiece (used by GPT-2, BERT)
Character-level: Individual characters

2. Sequence Length (max_length)

The number of tokens in each input sequence. Typical values:

Small models: 256-512 tokens
Medium models: 512-1024 tokens
Large models: 1024-4096+ tokens

3. Sliding Window

A method to create overlapping input sequences by moving a window over tokenized text.

4. Stride

The number of tokens the sliding window moves forward. Key tradeoffs:

Stride = 1: Maximum overlap, better context learning, higher overfitting risk
Stride = max_length: No overlap, less redundancy, may miss dependencies
Stride = 2-4×max_length: Recommended for most cases to balance context and efficiency

Step-by-Step Data Sampling

Basic Workflow

Load and tokenize text
Apply sliding window to create sequences
Generate input/target pairs (target is input shifted by 1 token)
Create dataset and dataloader for training

Example: Creating Input/Target Sequences

Given text: "Lorem ipsum dolor sit amet, consectetur adipiscing elit."

With max_length=4 and stride=1:

| Window | Input Sequence | Target Sequence | |--------|----------------|------------------| | 1 | ["Lorem", "ipsum", "dolor", "sit"] | ["ipsum", "dolor", "sit", "amet,"] | | 2 | ["ipsum", "dolor", "sit", "amet,"] | ["dolor", "sit", "amet,", "consectetur"] | | 3 | ["dolor", "sit", "amet,", "consectetur"] | ["sit", "amet,", "consectetur", "adipiscing"] |

Implementation Guide

Using the Sampling Script

The bundled script scripts/sample_data.py handles the complete data sampling pipeline:

# Basic usage
python scripts/sample_data.py \
  --input "path/to/text.txt" \
  --output "path/to/output.jsonl" \
  --max-length 256 \
  --stride 128 \
  --batch-size 8

# With advanced options
python scripts/sample_data.py \
  --input "data/" \
  --output "processed/" \
  --max-length 512 \
  --stride 512 \
  --temperature 0.7 \
  --deduplicate \
  --shuffle

Key Parameters

| Parameter | Description | Recommended Value | |-----------|-------------|-------------------| | max_length | Sequence length in tokens | 256-1024 | | stride | Window step size | ≥ max_length for most cases | | batch_size | Samples per batch | 8-32 (depends on GPU) | | temperature | Sampling temperature (α) | 0.7 for mixed corpora | | shuffle | Randomize order | True for training |

Advanced Sampling Strategies

1. Temperature-Based Mixture Weighting

When training on multiple data sources, use temperature weighting to balance corpus proportions:

p(i) = w_i^α / Σ(w_j^α)

w_i: Raw token percentage of corpus i
α (temperature): Value in (0,1]. Lower α flattens distribution, giving more weight to smaller high-quality corpora
Llama 2 used α = 0.7 and showed improved evaluation scores

When to use: Training on heterogeneous data (code, web, academic papers, forums)

2. Sequence Packing / Dynamic Batching

Concatenate multiple shorter sequences until exact max_length is reached, with attention masks to prevent cross-segment attention.

Benefits:

20-40% throughput improvement
No gradient change
Reduces padding waste

Implementation: Use HuggingFace DataCollatorForLanguageModeling(pad_to_multiple_of=...) or PyTorch torchtext.experimental.agents.PackedBatch

3. Deduplication & Quality Filtering

Deduplication:

MinHash/FAISS near-duplicate detection at document and n-gram level
Llama 2 removed ~15% of CommonCrawl using 8-gram MinHash
Target duplicate ratio: ≤0.04

Quality Filtering:

Remove documents with perplexity > µ + 3σ (noisy OCR, garbled HTML)
Block PII and sensitive content using regex & NER
Filter by source quality scores

Security & Privacy Considerations

Data Poisoning / Backdoor Attacks

Risk: Inserting <1% backdoored sentences can create hidden triggers

Mitigations:

Shuffled mixing: Ensure adjacent examples come from different sources
Gradient similarity scoring: Remove outliers with high gradient divergence
Dataset versioning: Freeze immutable tarballs, verify SHA-256 hashes

Membership Inference & Memorization

Risk: Long overlap between samples increases memorization of rare strings (phone numbers, keys)

Mitigations:

Use stride ≥ max_length (except for <1B parameter models with scarce data)
Random masking: Mask 1-3 tokens per window during training
OpenAI 2024 finding: Raising stride from 1× to 4× max_length reduces verbatim leakage by ~50%

Best Practices

For Training Data Preparation

Start with stride = max_length for most cases
Use stride = 1 only for small models (<1B params) with limited data
Apply deduplication before sampling (8-gram MinHash recommended)
Filter low-quality documents using perplexity thresholds
Version your datasets with SHA-256 hashes
Shuffle across sources to prevent gradient alignment attacks

For Production Pipelines

Use temperature weighting (α=0.7) for mixed corpora
Implement sequence packing for 20-40% throughput gains
Monitor duplicate ratios (target ≤0.04)
Apply PII filtering before training
Log sampling statistics for reproducibility

Common Issues & Solutions

| Issue | Solution | |-------|----------| | GPU memory wasted on padding | Use sequence packing with attention masks | | Model overfitting to repeated patterns | Increase stride, apply deduplication | | Slow training throughput | Use sequence packing, optimize batch size | | Memorization of sensitive data | Increase stride, add random masking | | Poor performance on knowledge tasks | Use temperature weighting (α=0.7) |

References

Build a Large Language Model from Scratch (Manning, 2024)
Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)
PoisonGPT: Assessing Backdoor Vulnerabilities (BlackHat EU 2023)
OpenAI Deduplicate Everything (2024)

Next Steps

After preparing your data:

Validate the sampled sequences with scripts/validate_sampling.py
Check for duplicates and quality issues
Create a training dataloader with appropriate batch size
Monitor for memorization during training
Adjust stride and temperature based on validation performance

abelrguezr/llm-data-sampling

skills/AI/AI-llm-architecture/2.-data-sampling/SKILL.md

How to prepare and sample text data for training large language models. Use this skill whenever the user mentions data preparation, tokenization, sliding windows, sequence generation, training data, LLM datasets, or needs to create input/target pairs for model training. This includes tasks like chunking text, creating dataloaders, applying sampling strategies, or optimizing training data quality.

5 stars

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add abelrguezr/hacktricks-skills llm-data-sampling

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:07 AM230.4s3 files scanned

SKILL.md

name:: llm-data-sampling
description:: How to prepare and sample text data for training large language models. Use this skill whenever the user mentions data preparation, tokenization, sliding windows, sequence generation, training data, LLM datasets, or needs to create input/target pairs for model training. This includes tasks like chunking text, creating dataloaders, applying sampling strategies, or optimizing training data quality.

LLM Data Sampling

A skill for preparing and sampling text data for training large language models (LLMs). This covers tokenization, sequence generation, sliding windows, and advanced sampling strategies.

When to Use This Skill

Use this skill when the user needs to:

Prepare text data for LLM training
Create input/target token sequences
Implement sliding window sampling
Apply advanced sampling strategies (temperature weighting, sequence packing, deduplication)
Optimize training data quality and security
Create PyTorch datasets and dataloaders for LLM training

Core Concepts

1. Tokenization

Breaking text into smaller units (tokens) that the model processes. Common approaches:

Word-level: Split by spaces
Subword-level: BPE, WordPiece (used by GPT-2, BERT)
Character-level: Individual characters

2. Sequence Length (max_length)

The number of tokens in each input sequence. Typical values:

Small models: 256-512 tokens
Medium models: 512-1024 tokens
Large models: 1024-4096+ tokens

3. Sliding Window

A method to create overlapping input sequences by moving a window over tokenized text.

4. Stride

The number of tokens the sliding window moves forward. Key tradeoffs:

Stride = 1: Maximum overlap, better context learning, higher overfitting risk
Stride = max_length: No overlap, less redundancy, may miss dependencies
Stride = 2-4×max_length: Recommended for most cases to balance context and efficiency

Step-by-Step Data Sampling

Basic Workflow

Load and tokenize text
Apply sliding window to create sequences
Generate input/target pairs (target is input shifted by 1 token)
Create dataset and dataloader for training

Example: Creating Input/Target Sequences

Given text: "Lorem ipsum dolor sit amet, consectetur adipiscing elit."

With max_length=4 and stride=1:

Implementation Guide

Using the Sampling Script

The bundled script scripts/sample_data.py handles the complete data sampling pipeline:

# Basic usage
python scripts/sample_data.py \
  --input "path/to/text.txt" \
  --output "path/to/output.jsonl" \
  --max-length 256 \
  --stride 128 \
  --batch-size 8

# With advanced options
python scripts/sample_data.py \
  --input "data/" \
  --output "processed/" \
  --max-length 512 \
  --stride 512 \
  --temperature 0.7 \
  --deduplicate \
  --shuffle

Key Parameters

Advanced Sampling Strategies

1. Temperature-Based Mixture Weighting

When training on multiple data sources, use temperature weighting to balance corpus proportions:

p(i) = w_i^α / Σ(w_j^α)

w_i: Raw token percentage of corpus i
α (temperature): Value in (0,1]. Lower α flattens distribution, giving more weight to smaller high-quality corpora
Llama 2 used α = 0.7 and showed improved evaluation scores

When to use: Training on heterogeneous data (code, web, academic papers, forums)

2. Sequence Packing / Dynamic Batching

Concatenate multiple shorter sequences until exact max_length is reached, with attention masks to prevent cross-segment attention.

Benefits:

20-40% throughput improvement
No gradient change
Reduces padding waste

Implementation: Use HuggingFace DataCollatorForLanguageModeling(pad_to_multiple_of=...) or PyTorch torchtext.experimental.agents.PackedBatch

3. Deduplication & Quality Filtering

Deduplication:

MinHash/FAISS near-duplicate detection at document and n-gram level
Llama 2 removed ~15% of CommonCrawl using 8-gram MinHash
Target duplicate ratio: ≤0.04

Quality Filtering:

Remove documents with perplexity > µ + 3σ (noisy OCR, garbled HTML)
Block PII and sensitive content using regex & NER
Filter by source quality scores

Security & Privacy Considerations

Data Poisoning / Backdoor Attacks

Risk: Inserting <1% backdoored sentences can create hidden triggers

Mitigations:

Shuffled mixing: Ensure adjacent examples come from different sources
Gradient similarity scoring: Remove outliers with high gradient divergence
Dataset versioning: Freeze immutable tarballs, verify SHA-256 hashes

Membership Inference & Memorization

Risk: Long overlap between samples increases memorization of rare strings (phone numbers, keys)

Mitigations:

Use stride ≥ max_length (except for <1B parameter models with scarce data)
Random masking: Mask 1-3 tokens per window during training
OpenAI 2024 finding: Raising stride from 1× to 4× max_length reduces verbatim leakage by ~50%

Best Practices

For Training Data Preparation

Start with stride = max_length for most cases
Use stride = 1 only for small models (<1B params) with limited data
Apply deduplication before sampling (8-gram MinHash recommended)
Filter low-quality documents using perplexity thresholds
Version your datasets with SHA-256 hashes
Shuffle across sources to prevent gradient alignment attacks

For Production Pipelines

Use temperature weighting (α=0.7) for mixed corpora
Implement sequence packing for 20-40% throughput gains
Monitor duplicate ratios (target ≤0.04)
Apply PII filtering before training
Log sampling statistics for reproducibility

Common Issues & Solutions

References

Build a Large Language Model from Scratch (Manning, 2024)
Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)
PoisonGPT: Assessing Backdoor Vulnerabilities (BlackHat EU 2023)
OpenAI Deduplicate Everything (2024)

Next Steps

After preparing your data:

Validate the sampled sequences with scripts/validate_sampling.py
Check for duplicates and quality issues
Create a training dataloader with appropriate batch size
Monitor for memorization during training
Adjust stride and temperature based on validation performance

Related Skills

abelrguezr/house-of-lore-exploit

testing

VerifiedTrustedCommunity

How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-lore-exploit

abelrguezr/house-of-force-exploit

testing

VerifiedTrustedCommunity

How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-force-exploit

abelrguezr/house-of-einherjar

tools

VerifiedTrustedCommunity

How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-einherjar

abelrguezr/heap-overflow-exploitation

testing

VerifiedTrustedCommunity

How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/heap-overflow-exploitation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/abelrguezr/hacktricks-skills.git

# Copy into Claude Code skills folder (global)
cp -r hacktricks-skills/skills/AI/AI-llm-architecture/2.-data-sampling ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

abelrguezr/hacktricks-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT