Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

abelrguezr/text-tokenizer

Name: text-tokenizer
Author: abelrguezr

skills/AI/AI-llm-architecture/1.-tokenizing/SKILL.md

npx skillsauth add abelrguezr/hacktricks-skills text-tokenizer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Text Tokenizer

A skill for tokenizing text into numerical IDs for machine learning models.

What is Tokenizing?

Tokenizing is the process of breaking down text into smaller pieces called tokens, each assigned a unique numerical ID. This is fundamental for preparing text for ML models, especially in NLP.

Goal: Divide input into tokens (IDs) in a way that makes sense for the model.

Basic Tokenization

1. Splitting Text

Simple tokenizer splits text into words and punctuation
Example: "Hello, world!" → ["Hello", ",", "world", "!"]

2. Creating a Vocabulary

Maps each token to a numerical ID
Special tokens:
- [BOS] (Beginning of Sequence): Marks text start
- [EOS] (End of Sequence): Marks text end
- [PAD] (Padding): Makes sequences same length in batches
- [UNK] (Unknown): Represents tokens not in vocabulary
Example: "Hello, world!" → [64, 455, 78, 467]

3. Handling Unknown Words

Words not in vocabulary get replaced with [UNK]
Example: "Bye, world!" → [987, 455, 78, 467] (assuming [UNK] = 987)

Advanced Tokenization Methods

Byte Pair Encoding (BPE)

Purpose: Reduces vocabulary size, handles rare/unknown words
How it works:
- Starts with individual characters as tokens
- Iteratively merges most frequent token pairs
- Continues until no more frequent pairs exist
Benefits:
- Eliminates need for [UNK] token
- More efficient and flexible vocabulary
Example: "playing" → ["play", "ing"]

WordPiece

Used by: BERT and similar models
Purpose: Similar to BPE, breaks words into subword units
How it works:
- Begins with base vocabulary of individual characters
- Iteratively adds most frequent subword that maximizes training data likelihood
- Uses probabilistic model for merging decisions
Benefits:
- Balances vocabulary size with word representation
- Efficiently handles rare and compound words
Example: "unhappiness" → ["un", "happy", "ness"]

Unigram Language Model

Used by: SentencePiece
Purpose: Uses probabilistic model for optimal subword selection
How it works:
- Starts with large set of potential tokens
- Iteratively removes tokens that least improve model probability
- Finalizes vocabulary with most probable subword units
Benefits:
- Flexible and natural language modeling
- Often results in more efficient tokenizations
Example: "internationalization" → ["international", "ization"]

Implementation with tiktoken

Basic Usage

import tiktoken

# Load GPT-2 tokenizer
encoding = tiktoken.get_encoding("gpt2")

# Encode text to token IDs
token_ids = encoding.encode("Hello, world!")
print(token_ids)  # [15496, 11, 995, 0]

# Decode token IDs back to text
text = encoding.decode(token_ids)
print(text)  # "Hello, world!"

With Special Tokens

# Encode with special tokens allowed
token_ids = encoding.encode("Hello, world!", allowed_special={"[EOS]"})

# Check token count
print(len(token_ids))

Processing Files

import urllib.request

# Download text file
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# Read and tokenize
with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

token_ids = tiktoken.get_encoding("gpt2").encode(raw_text, allowed_special={"[EOS]"})

# Print first 50 tokens
print(token_ids[:50])

Common Use Cases

Preprocessing text for training - Convert training data to token IDs
Understanding model input requirements - Know what format your model expects
Debugging tokenization issues - Inspect how text is being tokenized
Comparing different tokenization methods - Evaluate BPE vs WordPiece vs Unigram

Best Practices

Choose tokenizer based on your model - GPT-2 uses BPE, BERT uses WordPiece
Handle special tokens appropriately - Include [BOS], [EOS] as needed for your use case
Consider vocabulary size vs. tokenization quality tradeoff - Larger vocabularies may tokenize more efficiently but use more memory
Test with edge cases - Rare words, special characters, different languages
Use the right encoding - Match the tokenizer to your model architecture

Troubleshooting

Unknown tokens appearing

Check if your vocabulary is large enough
Consider using BPE or WordPiece instead of basic tokenization
Verify special tokens are properly configured

Token count seems too high

Try a different tokenization method (BPE often produces fewer tokens)
Check if you're including unnecessary whitespace or special characters

Decoding produces unexpected output

Ensure you're using the same encoding for encode/decode
Check if special tokens are being handled correctly
Verify the token IDs are valid for your vocabulary

References

Build a Large Language Model from Scratch
LLMs from Scratch - Tokenization

abelrguezr/text-tokenizer

skills/AI/AI-llm-architecture/1.-tokenizing/SKILL.md

How to tokenize text for LLMs and NLP models. Use this skill whenever the user needs to convert text into token IDs, understand tokenization methods (BPE, WordPiece, Unigram), work with vocabularies, or implement tokenization for machine learning. Make sure to use this skill when users mention tokenizing, token IDs, vocabulary creation, BPE, WordPiece, or any text preprocessing for ML models.

5 stars

data-ai

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add abelrguezr/hacktricks-skills text-tokenizer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:04 AM66.1s2 files scanned

SKILL.md

name:: text-tokenizer
description:: How to tokenize text for LLMs and NLP models. Use this skill whenever the user needs to convert text into token IDs, understand tokenization methods (BPE, WordPiece, Unigram), work with vocabularies, or implement tokenization for machine learning. Make sure to use this skill when users mention tokenizing, token IDs, vocabulary creation, BPE, WordPiece, or any text preprocessing for ML models.

Text Tokenizer

A skill for tokenizing text into numerical IDs for machine learning models.

What is Tokenizing?

Tokenizing is the process of breaking down text into smaller pieces called tokens, each assigned a unique numerical ID. This is fundamental for preparing text for ML models, especially in NLP.

Goal: Divide input into tokens (IDs) in a way that makes sense for the model.

Basic Tokenization

1. Splitting Text

Simple tokenizer splits text into words and punctuation
Example: "Hello, world!" → ["Hello", ",", "world", "!"]

2. Creating a Vocabulary

Maps each token to a numerical ID
Special tokens:
- [BOS] (Beginning of Sequence): Marks text start
- [EOS] (End of Sequence): Marks text end
- [PAD] (Padding): Makes sequences same length in batches
- [UNK] (Unknown): Represents tokens not in vocabulary
Example: "Hello, world!" → [64, 455, 78, 467]

3. Handling Unknown Words

Words not in vocabulary get replaced with [UNK]
Example: "Bye, world!" → [987, 455, 78, 467] (assuming [UNK] = 987)

Advanced Tokenization Methods

Byte Pair Encoding (BPE)

Purpose: Reduces vocabulary size, handles rare/unknown words
How it works:
- Starts with individual characters as tokens
- Iteratively merges most frequent token pairs
- Continues until no more frequent pairs exist
Benefits:
- Eliminates need for [UNK] token
- More efficient and flexible vocabulary
Example: "playing" → ["play", "ing"]

WordPiece

Used by: BERT and similar models
Purpose: Similar to BPE, breaks words into subword units
How it works:
- Begins with base vocabulary of individual characters
- Iteratively adds most frequent subword that maximizes training data likelihood
- Uses probabilistic model for merging decisions
Benefits:
- Balances vocabulary size with word representation
- Efficiently handles rare and compound words
Example: "unhappiness" → ["un", "happy", "ness"]

Unigram Language Model

Used by: SentencePiece
Purpose: Uses probabilistic model for optimal subword selection
How it works:
- Starts with large set of potential tokens
- Iteratively removes tokens that least improve model probability
- Finalizes vocabulary with most probable subword units
Benefits:
- Flexible and natural language modeling
- Often results in more efficient tokenizations
Example: "internationalization" → ["international", "ization"]

Implementation with tiktoken

Basic Usage

import tiktoken

# Load GPT-2 tokenizer
encoding = tiktoken.get_encoding("gpt2")

# Encode text to token IDs
token_ids = encoding.encode("Hello, world!")
print(token_ids)  # [15496, 11, 995, 0]

# Decode token IDs back to text
text = encoding.decode(token_ids)
print(text)  # "Hello, world!"

With Special Tokens

# Encode with special tokens allowed
token_ids = encoding.encode("Hello, world!", allowed_special={"[EOS]"})

# Check token count
print(len(token_ids))

Processing Files

import urllib.request

# Download text file
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# Read and tokenize
with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

token_ids = tiktoken.get_encoding("gpt2").encode(raw_text, allowed_special={"[EOS]"})

# Print first 50 tokens
print(token_ids[:50])

Common Use Cases

Preprocessing text for training - Convert training data to token IDs
Understanding model input requirements - Know what format your model expects
Debugging tokenization issues - Inspect how text is being tokenized
Comparing different tokenization methods - Evaluate BPE vs WordPiece vs Unigram

Best Practices

Choose tokenizer based on your model - GPT-2 uses BPE, BERT uses WordPiece
Handle special tokens appropriately - Include [BOS], [EOS] as needed for your use case
Consider vocabulary size vs. tokenization quality tradeoff - Larger vocabularies may tokenize more efficiently but use more memory
Test with edge cases - Rare words, special characters, different languages
Use the right encoding - Match the tokenizer to your model architecture

Troubleshooting

Unknown tokens appearing

Check if your vocabulary is large enough
Consider using BPE or WordPiece instead of basic tokenization
Verify special tokens are properly configured

Token count seems too high

Try a different tokenization method (BPE often produces fewer tokens)
Check if you're including unnecessary whitespace or special characters

Decoding produces unexpected output

Ensure you're using the same encoding for encode/decode
Check if special tokens are being handled correctly
Verify the token IDs are valid for your vocabulary

References

Build a Large Language Model from Scratch
LLMs from Scratch - Tokenization

Related Skills

abelrguezr/house-of-lore-exploit

testing

VerifiedTrustedCommunity

How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-lore-exploit

abelrguezr/house-of-force-exploit

testing

VerifiedTrustedCommunity

How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-force-exploit

abelrguezr/house-of-einherjar

tools

VerifiedTrustedCommunity

How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-einherjar

abelrguezr/heap-overflow-exploitation

testing

VerifiedTrustedCommunity

How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/heap-overflow-exploitation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/abelrguezr/hacktricks-skills.git

# Copy into Claude Code skills folder (global)
cp -r hacktricks-skills/skills/AI/AI-llm-architecture/1.-tokenizing ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

abelrguezr/hacktricks-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT