skills/nlp/bio-tagging-sliding-window/SKILL.md
Splits long documents into overlapping fixed-length windows with BIO NER tags for BERT token classification on sequences exceeding max length.
npx skillsauth add wenmin-wu/ds-skills nlp-bio-tagging-sliding-windowInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
BERT token classifiers have a fixed max length (typically 512 tokens). For long documents, split sentences into overlapping windows so no entity is cut in half. Tag each window independently with BIO labels, then merge results across windows. The overlap ensures entities near window boundaries are captured by at least one window.
MAX_LENGTH = 400
OVERLAP = 100
def shorten_sentences(sentences):
short = []
for sentence in sentences:
words = sentence.split()
if len(words) > MAX_LENGTH:
for start in range(0, len(words), MAX_LENGTH - OVERLAP):
short.append(" ".join(words[start:start + MAX_LENGTH]))
else:
short.append(sentence)
return short
def tag_sentence(sentence, labels):
words = sentence.split()
tags = ["O"] * len(words)
for label in labels:
label_words = label.split()
for i in range(len(words) - len(label_words) + 1):
if words[i:i+len(label_words)] == label_words:
tags[i] = "B"
for j in range(i+1, i+len(label_words)):
tags[j] = "I"
return list(zip(words, tags))
MAX_LENGTH words with OVERLAP overlapdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF