skills/nlp/bpe-offset-char-alignment/SKILL.md
Reconstruct character-level offsets for BPE tokens by decoding each token individually and accumulating lengths for precise span mapping
npx skillsauth add wenmin-wu/ds-skills nlp-bpe-offset-char-alignmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a tokenizer doesn't provide reliable character offsets (e.g. some HuggingFace tokenizers with added whitespace), reconstruct them by decoding each BPE token individually and tracking the cumulative character position. Then map character-level span labels to token-level start/end indices by checking which tokens overlap with the annotated span.
def compute_offsets(token_ids, tokenizer):
"""Reconstruct character offsets by decoding each token."""
offsets = []
idx = 0
for tid in token_ids:
word = tokenizer.decode([tid])
offsets.append((idx, idx + len(word)))
idx += len(word)
return offsets
def char_span_to_token_span(offsets, char_start, char_end):
"""Map character-level span to token indices."""
token_start, token_end = None, None
for i, (a, b) in enumerate(offsets):
if a < char_end and b > char_start:
if token_start is None:
token_start = i
token_end = i
return token_start, token_end
# Usage
enc = tokenizer.encode(text, add_special_tokens=False)
offsets = compute_offsets(enc.ids, tokenizer)
# Find where selected_text starts in the original text
char_start = text.find(selected_text)
char_end = char_start + len(selected_text)
tok_start, tok_end = char_span_to_token_span(offsets, char_start, char_end)
.offsets() which can be buggydata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF