skills/ai-generating-data/SKILL.md
Generate synthetic training data when you do not have enough real examples. Use when you are starting from scratch with no data, need a proof of concept fast, have too few examples for optimization, cannot use real customer data for privacy or compliance, need to fill gaps in edge cases, have unbalanced categories, added new categories, or changed your schema. Also used for create training data with AI, not enough examples to train, augment small dataset, generate labeled examples from scratch, cold start problem for AI, need data but cannot label manually, privacy-safe synthetic data, test data generation for ML, create diverse training examples, data augmentation for NLP, bootstrap dataset from nothing, DSPy synthetic data generation, quality filtering, bootstrapping from zero.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-generating-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through generating high-quality synthetic training data with DSPy. This solves the "I do not have data" problem that blocks every other AI workflow.
Ask the user:
Your generator's outputs should match your task's inputs and expected outputs.
import dspy
# Your task — what the AI will do in production
class ClassifyTicket(dspy.Signature):
"""Classify a support ticket into a category."""
ticket_text: str = dspy.InputField()
category: str = dspy.OutputField()
# Generator — produces examples for your task
class GenerateTicketExample(dspy.Signature):
"""Generate a realistic support ticket with its correct category."""
category: str = dspy.InputField(desc="the target category to generate an example for")
ticket_text: str = dspy.OutputField(desc="a realistic support ticket for this category")
The generator's output fields become inputs to your task. Think of it as: "given what I want the answer to be, generate a realistic input."
Start with 5-10 hand-written examples. These anchor the generator's understanding of what "realistic" means for your domain.
seeds = [
dspy.Example(
ticket_text="I was charged twice for my subscription this month. Order #4521.",
category="billing"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="The app crashes when I try to upload a profile photo on Android.",
category="bug"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="How do I export my data to CSV? I cannot find the option anywhere.",
category="how-to"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="I would love to see dark mode added. The white background hurts my eyes.",
category="feature-request"
).with_inputs("ticket_text"),
dspy.Example(
ticket_text="My account got locked after too many login attempts. Please help.",
category="account"
).with_inputs("ticket_text"),
]
Even 5 seeds dramatically improve generation quality over zero.
Pick the strategy that fits your gap:
| Strategy | When to use | Example |
|----------|------------|---------|
| Category-driven | Fix class imbalance, new categories | Generate N per category |
| Seed-and-vary | Augment existing examples with different tones | Vary each seed by tone, length, complexity |
| Scenario-driven | Target specific edge cases | Generate from failure scenario descriptions |
| Difficulty-driven | Build a balanced difficulty curve | Generate easy/medium/hard separately |
| Diversity trick (sindex) | Prevent repetitive outputs | Add random seed index to break LM patterns |
| Programmatic (Faker) | Structured fields with known formats | Names, addresses, dates, IDs cheaply at scale |
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)
categories = ["billing", "bug", "how-to", "feature-request", "account"]
examples = []
generator = dspy.Predict(GenerateTicketExample)
for category in categories:
for i in range(50):
result = generator(category=category)
examples.append(
dspy.Example(ticket_text=result.ticket_text, category=category)
.with_inputs("ticket_text")
)
print(f"Generated {len(examples)} examples")
class GenerateScenarioTicket(dspy.Signature):
"""Generate a support ticket matching a specific scenario."""
category: str = dspy.InputField()
scenario: str = dspy.InputField(desc="the specific scenario to generate")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateScenarioTicket)
scenarios = [
("billing", "customer charged in wrong currency"),
("billing", "refund for a cancelled subscription"),
("bug", "issue only happens on slow network connections"),
("how-to", "customer is non-technical and confused by jargon"),
]
for category, scenario in scenarios:
result = gen(category=category, scenario=scenario)
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))
For structured fields (names, addresses, dates, phone numbers), Faker generates hundreds of thousands of examples instantly with zero LM cost. In one production case, 500K synthetic name records were generated with Faker + custom cultural providers, then used to fine-tune models to 96% accuracy.
from faker import Faker
from faker.providers import BaseProvider
fake = Faker()
# Custom provider for domain-specific data
class TicketProvider(BaseProvider):
def order_id(self):
return f"ORD-{self.random_int(1000, 99999)}"
def product_name(self):
return self.random_element(["Pro Plan", "Starter", "Enterprise", "Team"])
fake.add_provider(TicketProvider)
# Generate structured training records at scale
examples = []
for _ in range(10_000):
examples.append(dspy.Example(
ticket_text=f"Hi, I'm {fake.name()}. Order {fake.order_id()} for {fake.product_name()} "
f"was charged to {fake.email()} but I need it on a different card.",
category="billing"
).with_inputs("ticket_text"))
When to use Faker vs LM generation:
Add a random sindex field to push the LM toward varied outputs:
import random
class GenerateDiverse(dspy.Signature):
"""Generate a unique and realistic support ticket."""
category: str = dspy.InputField()
sindex: str = dspy.InputField(desc="a unique seed index for diversity")
ticket_text: str = dspy.OutputField()
gen = dspy.Predict(GenerateDiverse)
for category in categories:
for i in range(50):
result = gen(category=category, sindex=str(random.randint(0, 1_000_000)))
examples.append(dspy.Example(ticket_text=result.ticket_text, category=category).with_inputs("ticket_text"))
Generated data always contains bad examples. Generate 2-3x what you need, keep ~50%.
program = dspy.ChainOfThought(ClassifyTicket)
filtered = []
for ex in examples:
pred = program(**ex.inputs())
if metric(ex, pred):
filtered.append(ex)
print(f"Kept {len(filtered)}/{len(examples)} ({100*len(filtered)//len(examples)}%)")
class AssessExample(dspy.Signature):
"""Is this a realistic and correctly labeled example?"""
ticket_text: str = dspy.InputField()
category: str = dspy.InputField()
is_realistic: bool = dspy.OutputField(desc="true if this looks like a real support ticket")
is_correctly_labeled: bool = dspy.OutputField(desc="true if the category matches the ticket")
assessor = dspy.Predict(AssessExample)
filtered = [ex for ex in examples
if (r := assessor(ticket_text=ex.ticket_text, category=ex.category)).is_realistic and r.is_correctly_labeled]
seen = set()
unique = [ex for ex in filtered if (k := ex.ticket_text.strip().lower()) not in seen and not seen.add(k)]
filtered = unique
Optimizing the prompt used to generate data dramatically improves downstream quality. This is meta-optimization: better generator prompts produce better data.
class DataGenerator(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(GenerateTicketExample)
def forward(self, category):
return self.generate(category=category)
def generator_metric(example, prediction, trace=None):
classifier = dspy.Predict(ClassifyTicket)
task_example = dspy.Example(ticket_text=prediction.ticket_text, category=example.category).with_inputs("ticket_text")
task_pred = classifier(**task_example.inputs())
return task_pred.category.lower() == example.category.lower()
optimizer = dspy.BootstrapFewShot(metric=generator_metric)
optimized_generator = optimizer.compile(DataGenerator(), trainset=seeds)
from dspy.evaluate import Evaluate
random.shuffle(filtered)
split = int(len(filtered) * 0.8)
trainset, devset = filtered[:split], filtered[split:]
program = dspy.ChainOfThought(ClassifyTicket)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
optimized = optimizer.compile(program, trainset=trainset)
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
score = evaluator(optimized)
print(f"Score on synthetic dev set: {score:.1f}%")
# Typical: 70-85% on synthetic, validate on real data when available
optimized.save("optimized_program.json")
If you have even a small number of real examples, use them as the dev set instead — real data gives more trustworthy evaluation.
.with_inputs() on generated Examples. Every synthetic dspy.Example must call .with_inputs("field1", ...) to mark input fields. Without this, the optimizer passes all fields (including expected outputs) to the program, inflating scores.n=N batch parameter is not supported by all providers. Claude defaults to dspy.Predict(sig, n=20) for batch generation, but Anthropic and some other providers do not support the n parameter. Use the loop pattern as a reliable fallback for any provider.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-improving-accuracy/ai-fine-tuning/ai-kickoff/ai-sorting/dspy-signatures/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.