skills/adaptbpe-general-purpose-specialized/SKILL.md
Adapt general-purpose BPE tokenizers into domain- or language-specialized tokenizers using the AdaptBPE post-training strategy. Replaces low-utility tokens with high-frequency domain-specific tokens to improve tokenization efficiency without retraining from scratch. Trigger phrases: "adapt tokenizer to domain", "specialize BPE for medical text", "optimize tokenizer for French", "reduce token fertility for code", "adapt vocabulary for legal documents", "domain-specific tokenizer"
npx skillsauth add ndpvt-web/arxiv-claude-skills adaptbpe-general-purpose-specializedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to guide users through adapting an existing BPE tokenizer (e.g., from GPT-2, LLaMA, Mistral) to a specific domain (medical, legal, code) or language by selectively replacing low-utility tokens with domain-relevant alternatives. The technique, from Liyanage & Yvon (EACL 2026), avoids retraining tokenizers from scratch. Instead, it scores every token in the existing vocabulary by its utility on an adaptation corpus, prunes the least useful tokens, and fills the freed slots with high-frequency subword units from the target domain -- all while keeping vocabulary size constant.
The Problem. Standard BPE tokenizers are trained on broad web-crawl data and allocate vocabulary slots to tokens that cover general English text well. When applied to specialized domains (e.g., biomedical literature with terms like "electroencephalography") or underrepresented languages, the tokenizer fragments words into many small pieces. This increases sequence length, wastes compute, and can degrade model quality.
The AdaptBPE Solution. Rather than training a new tokenizer from scratch (which would require retraining all model embeddings), AdaptBPE performs a vocabulary swap: (1) Score every token in the existing vocabulary by its utility on the target domain corpus -- utility combines token frequency with how much compression it provides. (2) Identify low-utility tokens -- those rarely used or providing little compression on domain text. (3) Run BPE merge learning on the adaptation corpus to discover candidate replacement tokens. (4) Replace the lowest-utility tokens with the highest-utility new candidates, keeping total vocabulary size fixed. The result is a tokenizer that shares most of its vocabulary with the original (preserving pretrained embeddings) but swaps out dead weight for domain-relevant subwords.
Why It Works. Most general-purpose tokenizers have a long tail of tokens that encode rare Unicode sequences, emoji combinations, or whitespace patterns unused in any given domain. These can be safely replaced. The adapted tokenizer then produces shorter token sequences on domain text (lower fertility), which means faster inference, longer effective context windows, and better alignment between token boundaries and meaningful linguistic units.
Select the base tokenizer. Load the pretrained tokenizer you want to adapt (e.g., AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")). Record its vocabulary size V and the full merge list.
Prepare the adaptation corpus. Collect 1--10 million tokens of representative text from the target domain or language. Clean it (remove boilerplate, deduplicate). This corpus drives both utility scoring and new token discovery.
Tokenize the adaptation corpus with the base tokenizer. Run the base tokenizer over the entire adaptation corpus and collect token frequency counts. Compute fertility (average tokens per whitespace-delimited word) and proportion of continued words (fraction of words split into 2+ tokens) as baseline metrics.
Score token utility. For each token t in the vocabulary, compute a utility score combining: (a) frequency of t in the tokenized adaptation corpus, and (b) the compression contribution of t (how many characters it covers per occurrence). Tokens with zero or near-zero frequency on the domain corpus are prime candidates for replacement. Rank all tokens by utility ascending.
Select tokens to replace. Choose the bottom K tokens by utility score for removal. A typical starting point is K = 5--15% of vocabulary size. Exclude special tokens (<bos>, <eos>, <pad>, <unk>) and single-byte tokens from removal -- these must be preserved for correctness.
Learn new merge candidates from the adaptation corpus. Run standard BPE training on the raw adaptation corpus to generate a fresh merge list. Extract candidate tokens that are NOT already in the base vocabulary. Rank these candidates by their frequency in the adaptation corpus.
Swap tokens. Replace the K lowest-utility tokens with the top K new candidates. For each replacement, update the vocabulary mapping (token-to-ID) and add the corresponding merge rule to the tokenizer's merge list.
Re-initialize embeddings for new tokens. For each newly added token, initialize its embedding as the mean of the embeddings of its constituent subtokens in the original model. This gives the model a reasonable starting point for the new token.
Validate the adapted tokenizer. Re-tokenize the adaptation corpus with the new tokenizer. Measure fertility and proportion of continued words. Verify fertility decreased and single-token word coverage increased. Spot-check domain-specific terms to confirm they are tokenized into fewer pieces.
Fine-tune the model (optional but recommended). Run a short continued pretraining or fine-tuning pass on domain data so the model learns proper contextual representations for newly added tokens. Even a few hundred steps can significantly improve downstream performance.
Example 1: Adapting LLaMA tokenizer for biomedical text
User: "I'm fine-tuning LLaMA-3-8B on PubMed abstracts. The tokenizer splits medical terms into too many pieces. How can I adapt it?"
Approach:
cardio, pathy, neur, oscopy, pharma, glyc, emia, etc.Output:
from transformers import AutoTokenizer
import collections
# Step 1: Load base tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
vocab = tokenizer.get_vocab()
V = len(vocab)
# Step 2: Load adaptation corpus
with open("pubmed_abstracts.txt") as f:
corpus = f.read()
# Step 3: Tokenize and count frequencies
token_ids = tokenizer.encode(corpus)
freq = collections.Counter(token_ids)
# Step 4: Score utility (frequency * avg chars per token)
utility = {}
for tid, count in freq.items():
token_str = tokenizer.decode([tid])
utility[tid] = count * len(token_str)
# Tokens with zero frequency on domain corpus
zero_freq = [tid for tid in range(V) if tid not in freq]
# These are immediate replacement candidates
# Step 5: Rank and select bottom K (excluding special + single-byte)
special_ids = set(tokenizer.all_special_ids)
removable = [tid for tid in zero_freq if tid not in special_ids
and len(tokenizer.decode([tid]).encode('utf-8')) > 1]
K = min(len(removable), int(0.10 * V))
to_remove = removable[:K]
print(f"Replacing {K} tokens ({100*K/V:.1f}% of vocabulary)")
print(f"Baseline fertility: {len(token_ids) / len(corpus.split()):.3f}")
Example 2: Specializing a tokenizer for French
User: "Our GPT-2 based model tokenizes French poorly -- words like 'aujourd'hui' and 'c'est' get fragmented badly. Can we fix the tokenizer?"
Approach:
aujourd, qu', c'est, ment, tion, common French affixes.Output:
# Measure before/after fertility for French text
test_sentences = [
"Aujourd'hui, c'est une journée magnifique.",
"L'électroménager est en promotion.",
"Qu'est-ce que vous en pensez?"
]
print("=== Before Adaptation ===")
for sent in test_sentences:
tokens = tokenizer.tokenize(sent)
words = sent.split()
print(f" '{sent}' -> {len(tokens)} tokens, "
f"fertility={len(tokens)/len(words):.2f}")
print(f" Tokens: {tokens}")
# After adaptation...
print("\n=== After Adaptation ===")
for sent in test_sentences:
tokens = adapted_tokenizer.tokenize(sent)
words = sent.split()
print(f" '{sent}' -> {len(tokens)} tokens, "
f"fertility={len(tokens)/len(words):.2f}")
print(f" Tokens: {tokens}")
Example 3: Adapting for source code tokenization
User: "I want to use a general LLM for code review, but the tokenizer wastes tokens on common programming patterns. How do I adapt it?"
Approach:
def __init__, self., import, return, function, const, async/await.self., __init__, return, import, function, common variable patterns like _id, _name.Output:
Before: def __init__(self, name): -> ['def', ' __', 'init', '__(', 'self', ',', ' name', '):'] (8 tokens)
After: def __init__(self, name): -> ['def', ' __init__', '(self', ',', ' name', '):'] (6 tokens)
Before fertility on Python corpus: 1.72
After fertility on Python corpus: 1.41
Token savings: ~18% fewer tokens for equivalent code
| Problem | Cause | Solution |
|---------|-------|----------|
| Tokenizer produces <unk> after adaptation | Single-byte fallback tokens were removed | Restore all 256 byte-level tokens; never remove them |
| Model quality degrades after adaptation | Too many tokens replaced or embeddings not initialized | Reduce K, use mean-subtoken embedding init, add fine-tuning steps |
| Fertility barely improves | Replacement candidates overlap with existing tokens | Verify new candidates are genuinely absent from original vocab; lower the merge threshold |
| Merge list inconsistencies | New merge rules conflict with existing ones | Insert new merges at the end of the merge list (lowest priority); let existing merges take precedence |
| OOV errors on general text | Over-specialized vocabulary lost general coverage | Keep K below 10% for models that must handle mixed-domain input |
Liyanage, V. & Yvon, F. (2026). AdaptBPE: From General Purpose to Specialized Tokenizers. EACL 2026. arXiv:2601.21665
Key sections to consult: the utility scoring formula (Section 3), the token replacement algorithm (Algorithm 1), and the multilingual evaluation (Section 5) showing fertility improvements across domains and languages.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".