skills/nlp/regex-hybrid-ner-fallback/SKILL.md
Supplements transformer NER predictions with regex-based detection for structured entities (email, phone, URL), aligning regex matches back to token indices via subsequence search.
npx skillsauth add wenmin-wu/ds-skills nlp-regex-hybrid-ner-fallbackInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transformer NER models excel at contextual entity recognition but often miss structured patterns — email addresses, phone numbers, URLs — that follow rigid formats. A regex fallback layer detects these patterns in the raw text, aligns matches back to token indices using subsequence search, converts to BIO labels, and merges with the model's predictions. This hybrid approach typically adds 1-3% F-score by catching entities the model missed, with zero additional model inference cost.
import re
import spacy
nlp = spacy.blank("en")
EMAIL_RE = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
PHONE_RE = re.compile(r'\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}')
def find_span(target_tokens, doc_tokens):
"""Find token subsequence in document."""
spans = []
for i in range(len(doc_tokens) - len(target_tokens) + 1):
if doc_tokens[i:i+len(target_tokens)] == target_tokens:
spans.append(list(range(i, i + len(target_tokens))))
return spans
def regex_augment(tokens, full_text, doc_id):
"""Detect structured entities via regex and convert to BIO labels."""
extra_preds = []
for pattern, label in [(EMAIL_RE, 'EMAIL'), (PHONE_RE, 'PHONE_NUM')]:
for match in pattern.finditer(full_text):
target = [t.text for t in nlp.tokenizer(match.group())]
for span in find_span(target, tokens):
for i, token_idx in enumerate(span):
prefix = 'B' if i == 0 else 'I'
extra_preds.append({
'document': doc_id,
'token': token_idx,
'label': f'{prefix}-{label}',
})
return extra_preds
# Merge with model predictions
all_preds = model_preds + regex_augment(tokens, text, doc_id)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF