skills/better-as-generators-than/SKILL.md
Generate synthetic labeled datasets with LLMs to train smaller, cheaper classifiers -- especially for low-resource languages and niche tasks. Use when: 'generate training data for my classifier', 'I need labeled data in [language]', 'distill this LLM into a smaller model', 'create synthetic examples for fine-tuning', 'bootstrap a text classifier without manual annotation', 'train a multilingual classifier with no labeled data'.
npx skillsauth add ndpvt-web/arxiv-claude-skills better-as-generators-thanInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and execute a synthetic data generation pipeline where a large LLM acts as a teacher/generator producing labeled training examples, which are then used to train or prompt smaller, cheaper student models that consistently outperform the generator itself on classification tasks. The technique is most powerful for low-resource languages and underrepresented tasks where labeled data is scarce, achieving up to 40% improvement over direct LLM classification with as few as 50 synthetic samples per class.
The core insight is that LLMs are better used as data generators than as direct classifiers. When you prompt a large LLM to classify text zero-shot, it performs reasonably but not great -- especially in non-English languages. However, when you instead use that same LLM to generate synthetic labeled examples, then train a smaller model (like XLM-RoBERTa or an 8B-parameter LLM) on those examples, the smaller model consistently beats the large generator. This works because generation leverages the LLM's broad linguistic knowledge to produce diverse, natural-sounding examples, while the smaller model specializes entirely on the classification boundary.
The pipeline has three stages: (1) Generate synthetic labeled text using a large LLM with carefully structured prompts that include 10 human-labeled seed examples per class as in-context demonstrations, producing 6 diverse samples per inference call. (2) Filter the generated data using the generator itself as a quality judge, removing duplicates, language-contaminated samples (e.g., English leaking into Telugu output), and low-quality text. (3) Train a smaller model via fine-tuning (XLM-RoBERTa), LoRA instruction-tuning (8B LLMs), or use the synthetic samples as in-context examples for compact LLMs.
Critical finding on sample efficiency: performance gains plateau around 200 samples per class, and even 50 samples per class is enough to outperform the large generator across all language resource levels. Beyond 400 samples, synthetic data diversity drops and marginal returns diminish. This means the pipeline is cheap to run -- a few hundred generation calls per task are sufficient.
Define the classification schema. Identify the exact task (sentiment, topic, intent, etc.), enumerate all class labels, and specify the target language(s). For each label, write a clear, unambiguous definition that the generator LLM can follow.
Collect 5-10 seed examples per class. These are real human-written examples that anchor generation quality. If you have zero labeled data, manually write or translate 5-10 examples per label. These serve as in-context demonstrations in the generation prompt.
Construct the generation prompt. Use this template structure:
Please create 6 different {task_description} texts in the {language} language,
separated by "|||". Each text should express {label_definition}.
Here are examples of {label} texts in {language}:
1. {seed_example_1}
2. {seed_example_2}
...
10. {seed_example_10}
Output only the texts in {language} and nothing else. Do not number the texts.
Set generation parameters: temperature=0.7, top_p=0.9, repetition_penalty=1.2.
Generate 200 samples per class. Run the generation prompt in batches (6 samples per call), iterating across all labels. For a binary task this means ~67 calls per class; for a 10-class task, ~334 calls per class. Track the label distribution to ensure balance.
Filter generated samples for quality. Apply three filters sequentially:
langdetect, fasttext-langid) to flag samples containing unexpected language mixing.Choose the student model and training strategy based on deployment constraints:
Evaluate on held-out real data. Always test on human-labeled examples, never on synthetic data. Compare the student model against the generator's zero-shot and few-shot classification performance as baselines. Report per-language and per-class metrics.
Iterate on weak spots. If specific classes or languages underperform, generate additional targeted samples for those categories, re-filter, and retrain. Low-resource languages benefit most from even small increases in synthetic data volume.
Example 1: Bootstrapping a sentiment classifier for Indonesian product reviews
User: "I need a sentiment classifier for Indonesian customer reviews but I have no labeled data."
Approach:
positive, negative.language=Indonesian, task_description=product review sentiment.Output pipeline code:
import json
# Step 3: Generation prompt for one class
def build_generation_prompt(label, seed_examples, language="Indonesian"):
examples_text = "\n".join(f"{i+1}. {ex}" for i, ex in enumerate(seed_examples))
return f"""Please create 6 different product review texts in the {language} language,
separated by "|||". Each review should express {label} sentiment about a product.
Here are examples of {label} reviews in {language}:
{examples_text}
Output only the review texts in {language} and nothing else. Do not number the texts."""
# Step 5: Quality filter prompt
def build_filter_prompt(text, label, language="Indonesian"):
return f"""Is the following text a natural {label} product review written entirely in {language}?
Text: "{text}"
Answer only "yes" or "no"."""
# Generation parameters
gen_params = {
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 4096,
"repetition_penalty": 1.2,
}
Example 2: Multi-class intent classifier for a Welsh-language voice assistant
User: "Build me an intent classifier that works in Welsh for a voice assistant. I have 10 intent categories."
Approach:
set_alarm, play_music, get_weather, send_message, etc.).Key consideration for Welsh (low-resource): Generate an extra 50% buffer of samples because the language purity filter will reject more outputs than for high-resource languages.
Example 3: Rapid prototyping with ICL instead of fine-tuning
User: "I need a quick topic classifier for Swahili news articles. No time to fine-tune."
Approach:
Classify the following Swahili text into one of these topics:
politics, sports, technology, health, business, entertainment, science.
Examples:
Text: "{synthetic_example_1}" -> politics
Text: "{synthetic_example_2}" -> sports
...
Text: "{user_input}" ->
temperature=0 for deterministic classification. Limit output to 20 tokens.repetition_penalty=1.2 to maximize diversity within each batch.| Problem | Cause | Fix | |---------|-------|-----| | Generated text contains English phrases mixed in | LLM defaults to English for unfamiliar concepts | Add explicit instruction: "Do not use any English words." Re-filter with language detection. Generate extra buffer samples. | | All generated samples sound similar | Temperature too low or repetition penalty too low | Increase temperature to 0.8, increase repetition_penalty to 1.3. Vary seed examples across batches. | | Student model performs worse than generator | Too few samples or severe class imbalance | Verify at least 50 samples per class survived filtering. Check class distribution. Regenerate for underrepresented classes. | | Fine-tuning overfits quickly | Synthetic data lacks diversity | Use dropout=0.3, reduce epochs, add early stopping. Consider mixing in any available real data. | | Generator refuses to produce certain content | Safety filters triggered by task labels (e.g., toxicity detection) | Reframe the prompt: instead of "generate toxic text," use "generate examples of text that would be flagged as inappropriate by a content moderator." | | Language detection rejects valid samples | Mixed-script languages or loanwords | Use a per-language threshold rather than hard cutoff. For languages with heavy borrowing (e.g., Swahili with Arabic/English loanwords), relax the purity check. |
Better as Generators Than Classifiers (Pecher et al., EACL 2026 Findings) -- Focus on Section 4 (experimental setup with prompt templates and generation parameters), Table 2 (performance by language resource level), and Section 5.2 (analysis of sample efficiency showing the 50-sample threshold).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".