skills/AI/AI-llm-architecture/2.-data-sampling/SKILL.md
How to prepare and sample text data for training large language models. Use this skill whenever the user mentions data preparation, tokenization, sliding windows, sequence generation, training data, LLM datasets, or needs to create input/target pairs for model training. This includes tasks like chunking text, creating dataloaders, applying sampling strategies, or optimizing training data quality.
npx skillsauth add abelrguezr/hacktricks-skills llm-data-samplingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for preparing and sampling text data for training large language models (LLMs). This covers tokenization, sequence generation, sliding windows, and advanced sampling strategies.
Use this skill when the user needs to:
Breaking text into smaller units (tokens) that the model processes. Common approaches:
The number of tokens in each input sequence. Typical values:
A method to create overlapping input sequences by moving a window over tokenized text.
The number of tokens the sliding window moves forward. Key tradeoffs:
Given text: "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
With max_length=4 and stride=1:
| Window | Input Sequence | Target Sequence | |--------|----------------|------------------| | 1 | ["Lorem", "ipsum", "dolor", "sit"] | ["ipsum", "dolor", "sit", "amet,"] | | 2 | ["ipsum", "dolor", "sit", "amet,"] | ["dolor", "sit", "amet,", "consectetur"] | | 3 | ["dolor", "sit", "amet,", "consectetur"] | ["sit", "amet,", "consectetur", "adipiscing"] |
The bundled script scripts/sample_data.py handles the complete data sampling pipeline:
# Basic usage
python scripts/sample_data.py \
--input "path/to/text.txt" \
--output "path/to/output.jsonl" \
--max-length 256 \
--stride 128 \
--batch-size 8
# With advanced options
python scripts/sample_data.py \
--input "data/" \
--output "processed/" \
--max-length 512 \
--stride 512 \
--temperature 0.7 \
--deduplicate \
--shuffle
| Parameter | Description | Recommended Value |
|-----------|-------------|-------------------|
| max_length | Sequence length in tokens | 256-1024 |
| stride | Window step size | ≥ max_length for most cases |
| batch_size | Samples per batch | 8-32 (depends on GPU) |
| temperature | Sampling temperature (α) | 0.7 for mixed corpora |
| shuffle | Randomize order | True for training |
When training on multiple data sources, use temperature weighting to balance corpus proportions:
p(i) = w_i^α / Σ(w_j^α)
w_i: Raw token percentage of corpus iα (temperature): Value in (0,1]. Lower α flattens distribution, giving more weight to smaller high-quality corporaWhen to use: Training on heterogeneous data (code, web, academic papers, forums)
Concatenate multiple shorter sequences until exact max_length is reached, with attention masks to prevent cross-segment attention.
Benefits:
Implementation: Use HuggingFace DataCollatorForLanguageModeling(pad_to_multiple_of=...) or PyTorch torchtext.experimental.agents.PackedBatch
Deduplication:
Quality Filtering:
Risk: Inserting <1% backdoored sentences can create hidden triggers
Mitigations:
Risk: Long overlap between samples increases memorization of rare strings (phone numbers, keys)
Mitigations:
| Issue | Solution | |-------|----------| | GPU memory wasted on padding | Use sequence packing with attention masks | | Model overfitting to repeated patterns | Increase stride, apply deduplication | | Slow training throughput | Use sequence packing, optimize batch size | | Memorization of sensitive data | Increase stride, add random masking | | Poor performance on knowledge tasks | Use temperature weighting (α=0.7) |
After preparing your data:
scripts/validate_sampling.pytesting
How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.
testing
How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".
tools
How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.
testing
How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.