skills/llm/synthetic-data-augmentation/SKILL.md
Generates additional training examples using a stronger LLM (e.g., GPT-3.5) to augment small labeled datasets.
npx skillsauth add wenmin-wu/ds-skills llm-synthetic-data-augmentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When the labeled training set is tiny (100-500 examples), use a stronger LLM to generate synthetic training data in the same format. For multiple-choice QA, prompt GPT-3.5/4 to create questions with answer options from a knowledge source. This can improve fine-tuned model accuracy by 2-5% with minimal cost.
from openai import OpenAI
client = OpenAI()
def generate_mcq(topic, n=10):
prompt = f"""Generate {n} multiple-choice science questions about {topic}.
Format each as:
Question: ...
A) ... B) ... C) ... D) ... E) ...
Answer: X"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.8,
)
return parse_mcq(response.choices[0].message.content)
# Combine with real data
synthetic = generate_mcq("physics", n=100)
train_df = pd.concat([real_train_df, synthetic_df]).reset_index(drop=True)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF