skills/AI/AI-llm-architecture/1.-tokenizing/SKILL.md
How to tokenize text for LLMs and NLP models. Use this skill whenever the user needs to convert text into token IDs, understand tokenization methods (BPE, WordPiece, Unigram), work with vocabularies, or implement tokenization for machine learning. Make sure to use this skill when users mention tokenizing, token IDs, vocabulary creation, BPE, WordPiece, or any text preprocessing for ML models.
npx skillsauth add abelrguezr/hacktricks-skills text-tokenizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for tokenizing text into numerical IDs for machine learning models.
Tokenizing is the process of breaking down text into smaller pieces called tokens, each assigned a unique numerical ID. This is fundamental for preparing text for ML models, especially in NLP.
Goal: Divide input into tokens (IDs) in a way that makes sense for the model.
"Hello, world!" → ["Hello", ",", "world", "!"][BOS] (Beginning of Sequence): Marks text start[EOS] (End of Sequence): Marks text end[PAD] (Padding): Makes sequences same length in batches[UNK] (Unknown): Represents tokens not in vocabulary"Hello, world!" → [64, 455, 78, 467][UNK]"Bye, world!" → [987, 455, 78, 467] (assuming [UNK] = 987)[UNK] token"playing" → ["play", "ing"]"unhappiness" → ["un", "happy", "ness"]"internationalization" → ["international", "ization"]import tiktoken
# Load GPT-2 tokenizer
encoding = tiktoken.get_encoding("gpt2")
# Encode text to token IDs
token_ids = encoding.encode("Hello, world!")
print(token_ids) # [15496, 11, 995, 0]
# Decode token IDs back to text
text = encoding.decode(token_ids)
print(text) # "Hello, world!"
# Encode with special tokens allowed
token_ids = encoding.encode("Hello, world!", allowed_special={"[EOS]"})
# Check token count
print(len(token_ids))
import urllib.request
# Download text file
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)
# Read and tokenize
with open(file_path, "r", encoding="utf-8") as f:
raw_text = f.read()
token_ids = tiktoken.get_encoding("gpt2").encode(raw_text, allowed_special={"[EOS]"})
# Print first 50 tokens
print(token_ids[:50])
[BOS], [EOS] as needed for your use casetesting
How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.
testing
How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".
tools
How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.
testing
How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.