skills/nlp/japanese-transliteration-normalization/SKILL.md
Converts Japanese scripts (Hiragana, Katakana, Kanji) to romanized ASCII using pykakasi for cross-script entity matching.
npx skillsauth add wenmin-wu/ds-skills nlp-japanese-transliteration-normalizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multilingual entity matching fails when the same entity appears in different scripts — "東京タワー" and "Tokyo Tower" won't match by string similarity. Pykakasi converts Japanese Hiragana, Katakana, and Kanji to romanized ASCII (romaji), enabling standard string similarity metrics to work across scripts. Apply selectively to Japanese records before computing matching features.
import pykakasi
def romanize_japanese(df, text_cols, country_col="country"):
"""Convert Japanese text fields to romaji for cross-script matching."""
kks = pykakasi.kakasi()
def convert_text(text):
if not isinstance(text, str):
return text
result = kks.convert(text)
return " ".join([item["hepburn"] for item in result])
jp_mask = df[country_col] == "JP"
for col in text_cols:
df.loc[jp_mask, col] = df.loc[jp_mask, col].apply(convert_text)
return df
# Romanize name and address fields for Japanese records
df = romanize_japanese(df, ["name", "address", "city", "state"])
pip install pykakasidata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF