skills/cv/structured-gt-serialization-tokens/SKILL.md
Serialize structured ground truth (chart data, table rows, form fields) as special-token-delimited sequences for generative vision-language model training
npx skillsauth add wenmin-wu/ds-skills cv-structured-gt-serialization-tokensInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When training a generative vision model to extract structured data from images (charts, tables, forms), encode the ground truth as a flat token sequence using custom special tokens as delimiters. Add type tokens, separator tokens, and field boundary tokens to the tokenizer, then serialize each sample's annotations into a string the decoder can learn to produce.
from transformers import DonutProcessor
PROMPT_TOKEN = "<|PROMPT|>"
X_START, X_END = "<x_start>", "<x_end>"
Y_START, Y_END = "<y_start>", "<y_end>"
TYPE_TOKENS = ["<line>", "<bar>", "<scatter>", "<dot>"]
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base")
processor.tokenizer.add_tokens([PROMPT_TOKEN, X_START, X_END, Y_START, Y_END] + TYPE_TOKENS)
def serialize_gt(chart_type, x_values, y_values):
type_token = f"<{chart_type}>"
x_str = X_START + ";".join(map(str, x_values)) + X_END
y_str = Y_START + ";".join(map(str, y_values)) + Y_END
return PROMPT_TOKEN + type_token + x_str + y_str
gt_string = serialize_gt("line", ["2020", "2021", "2022"], [10.5, 15.2, 20.1])
# "<|PROMPT|><line><x_start>2020;2021;2022<x_end><y_start>10.5;15.2;20.1<y_end>"
model.decoder.resize_token_embeddings(len(tokenizer)); to delimit list items — avoids ambiguity with commas in numbersresize_token_embeddings after adding tokensdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF