skills/general/create-data/SKILL.md
Generate synthetic training data from a ruleset file. Use when the user wants to create training examples, generate augmented data, or produce samples following defined rules.
npx skillsauth add beam-ai-team/beam-next-skills create-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
AI agents need test data to evaluate whether they work correctly. But real production data is scarce, sensitive, or doesn't cover edge cases. This skill generates synthetic test data that mimics production reality — including the messy, broken, and deceptive inputs the agent will face.
The key insight: a ruleset defines WHAT to generate, this skill executes HOW. The ruleset is the blueprint (created by /create-ruleset). This skill reads the blueprint and produces concrete input/output pairs the agent can be tested against.
Why not just generate data directly? Without a ruleset, generated data lacks diversity (all happy path), realism (artificial inputs), and coverage (misses edge cases). The ruleset-first approach ensures every generated sample serves a specific testing purpose.
/create-ruleset. Why: The ruleset is the contract between what to generate and what quality looks like. Without it, the skill has no spec to follow.schema_version: "2.0". If v1 or missing, warn and suggest /create-ruleset upgrade. Why: v2.0 introduced Test Scenarios with What/Why/Varies cards. Older formats lack the structure needed for generation briefs.scripts/file_renderer.py) plus its assigned visual archetype description, so it can write a custom renderer. If it needs to score, include the scoring formula. Don't assume shared context.| Resource | Path | Purpose | |----------|------|---------| | Context Discovery | context-discovery.md | Auto-discovery of client/project files | | Expert 1: The Practitioner | experts/expert-1-production-fidelity.md | Production fidelity critic (Taleb's antifragility). Gold standard + batch review. | | Expert 2: The Calibrator | experts/expert-2-derivation-auditor.md | Statistical & derivation auditor (Kahneman's dual-process). Gold standard + batch review. | | Batch Review Persona | experts/expert-persona.md | Generic batch review rubric (Phase 3b post-batch) | | Review Protocol | experts/review-loop.md | Two-phase review, variation matrix, generation briefs | | Generation Prompt | prompts/generation-prompt.md | Domain-agnostic generation prompt blueprint | | Verification Prompt | prompts/verification-prompt.md | Domain-agnostic evaluation prompt blueprint | | Coverage Report | experts/coverage-report-template.md | Coverage matrix format and classification logic | | File Generation | scripts/file-generation.md | Render spec schema, layouts, section types | | File Renderer | scripts/file_renderer.py | Universal renderer for all document types | | Gold Standard Review | experts/gold-standard-review-template.html | HTML template for gold standard review reports | | Post-Gen Feedback | ../create-ruleset/generation/feedback-loop-format.md | Feedback loop back to ruleset | | Interface Contract | ../shared/interface-contract.md | What sections to read from ruleset, YAML blocks, feedback protocol |
Export scripts: scripts/export.py (dispatcher), scripts/csv_export.py, scripts/xlsx_export.py, scripts/pdf_export.py, scripts/docx_export.py.
Parsers: parsers/pdf_parser.py, parsers/pdf_to_images.py, parsers/csv_reader.py, parsers/pptx_parser.py. For scanned PDFs: convert to images with pdf_to_images.py, then Read visually.
| Setting | Default | |---------|---------| | Samples per batch | 10 | | Min consensus score | 0.8 (80/100) | | Require unanimous validity | true | | Minimum evaluators | 3 | | Schema pre-check | true (for JSON/structured output) |
The steps are a RULESET REVIEW with the user. The purpose is NOT to present information you already have — it's to walk the user through the ruleset's content so they can spot errors, correct assumptions, and refine the rules BEFORE any data is generated. Fixing a ruleset issue at Step 2 costs 1 edit; fixing it after generating 12 samples costs 12 regenerations.
CRITICAL: Never jump straight to generation, even if the user just created the ruleset in the same session. The user may have approved the ruleset at a high level during create-ruleset but will notice issues when they see the specifics presented back to them in a data-generation context. Every step is a correction window.
CRITICAL: Wait for user confirmation at EVERY step. Do not present Step 2 until the user confirms Step 1. Do not batch steps unless the user explicitly says "looks good, keep going" or "skip ahead." Silence is not confirmation.
When user feedback conflicts with the ruleset, flag it:
⚠️ CONFLICT: [dimension]
📄 Ruleset says: [what the ruleset defines]
🗣️ You said: [what the user requested]
💡 Recommendation: [update ruleset or treat as exception]
Ask: "Update the ruleset or keep as exception for this batch?"
When the user changes input format, output format, fields, or file types during Steps 1-4 → update the ruleset HTML file in-place. The ruleset is the source of truth. Briefly mention: "Updated the ruleset to reflect this."
Do NOT update for batch-only preferences (fewer samples, skip a dimension this run).
Each step has a what (what you present), a why (why this step exists), and a question (what you ask the user). Never skip the question.
Step 1: The Task
Step 2: Input
Step 3: Output
Step 4: Test Scenarios
Step 5: Gold Standard (3 examples)
scripts/file_renderer.py for realistic filesscripts/file_renderer.py) for fast iteration — visual archetype diversity is enforced in batch generation (Phase 2), not here.python scripts/file_renderer.py spec.json --output sample.pdfStep 6: Sample Plan
Step 7: Generate & Review
Collapsing: You may batch Steps 2+3 ONLY if the user explicitly says "looks good, keep going" after Step 1. NEVER skip Step 5 (Gold Standards). NEVER batch Steps 1-4 — the whole point is the correction window.
Why this order matters: Each step narrows the scope. Step 1 confirms the agent's purpose. Step 2 confirms what goes in. Step 3 confirms what comes out. Step 4 confirms how we test it. Step 5 confirms the quality bar. By Step 7, there should be zero surprises — every decision was made and confirmed incrementally.
If $ARGUMENTS has context, use it. Otherwise ask: "What data do you want to generate?"
Then launch parallel subagents:
a) Ruleset + data scan:
dataset/**/rulesets/*.html. If none → STOP, direct to /create-ruleset./create-ruleset._feedback.md — post-generation feedback from previous runs.b) Context discovery (if client mentioned): Follow context-discovery.md.
Present ONE unified summary. Smart defaults:
/create-ruleset.Skip if no existing data. Map all existing I/O pairs against the ruleset's edge cases and variation dimensions. Only generate what's missing.
Steps:
dataset/[client]/[use-case]/augmented/coverage-report.md.Present summary: "Analyzed N samples against M edge cases. Coverage: X%. Missing: [list]. Generate only gaps?"
If --analyze-only: present report and STOP.
schema_version: "2.0". See interface contract for which sections to extract.planning (from hidden PLANNING YAML comment) and test_scenarios (from TEST SCENARIOS section — each scenario's Varies chips provide the variation dimensions and distributions). These machine-readable blocks drive generation briefs.Skip if approved gold standards exist and user agreed to reuse.
This follows the Step 5 mini-conversation from the protocol:
1. Generate one realistic input (targeting the most common scenario). Present to user. Iterate until input feels right.
2. Ask user to rate BEFORE showing output: "What rating would YOU give this?" This catches scoring calibration issues early.
3. Generate output using the ruleset's OUTPUT DERIVATION PROCEDURE. Compare with user's expected rating. Resolve disagreements now.
4. Iterate until locked. Run 2-expert review via parallel subagents (mandatory — both must pass). Save to [output_dir]/gold-standards/.
For 2+ gold standards: second targets a different category (edge case or adversarial).
Every gold standard MUST be reviewed by both expert personas before approval. Launch as 2 parallel subagents, each receiving the gold standard samples + full ruleset + their persona file:
| Expert | Persona File | Focus | Mental Model | |--------|-------------|-------|--------------| | Expert 1: The Practitioner | expert-1-production-fidelity.md | Production realism, structural fidelity, fragility | Taleb's antifragility — stress-tests against production messiness | | Expert 2: The Calibrator | expert-2-derivation-auditor.md | Derivation correctness, scoring calibration, bias detection | Kahneman's dual-process — catches confabulation and anchoring |
Pass criteria: Both experts must return PASS (avg >= 7.5, no dimension < 6.0). If either fails:
Why two experts, not one: A single reviewer conflates production realism with derivation correctness. The Practitioner catches "this doesn't look like production data." The Calibrator catches "the output wasn't actually derived from the input." These are orthogonal failure modes — a sample can look realistic but have a confabulated output, or be correctly derived but unrealistically clean.
Present full plan referencing the gold standard: "44 more like that, across these scenarios..."
Show as ASCII table. User confirms.
Do NOT proceed to generation until user approves both gold standard AND plan.
Follow experts/review-loop.md variation matrix spec.
If coverage report exists → incremental matrix targeting gaps only:
If no coverage report → full cross-product.
Also include:
scripts/file-generation.md. Never assign the same archetype twice in one batch.Each subagent receives THREE things:
1. Gold Standard — quality bar (realism, specificity, depth). Defines "good," NOT content.
2. Generation Brief — specific assignment locking structural dimensions (scenario, industry, geography, format) and freeing content dimensions (names, companies, stories, skills, dates, wording). The brief maps directly to one TEST SCENARIO card — it locks the scenario's Varies dimensions and frees content details. MUST include the ruleset's content length constraints (e.g., "articles must be 400-1200 words per the INPUT SCHEMA") and content texture descriptions (e.g., "wire-service style: 400-700 words" vs "analytical: 700-1200 words"). Without explicit length requirements, subagents default to artificially short content that doesn't match production reality.
3. Anti-Repetition Blocklist — names/companies/titles from previous batches. "Do NOT reuse these."
4. Visual Archetype Assignment (if file rendering) — a visual archetype describing the target software system the document should appear to come from (e.g., "ATS Export," "Canva Modern," "LaTeX Academic"). Each subagent gets a DIFFERENT archetype from the list in scripts/file-generation.md. The subagent writes a custom renderer script based on the reference renderer (scripts/file_renderer.py), produces files with genuinely different visual DNA, and runs the renderer as part of generation — not as a separate post-processing step.
Blocklist accumulation protocol: After gold standards are locked, extract all person names, company names, university names, and job titles used. This forms the initial blocklist. After each batch of subagents completes, scan their outputs and append new names/companies to the blocklist before launching the next batch. For parallel subagents within the same batch, pre-assign distinct name pools (e.g., "subagent 1: use names starting A-F, subagent 2: G-L") to prevent collisions without blocking.
Batch sizing:
| Total | Strategy | |-------|----------| | ≤10 | Inline (no subagents) | | 11-30 | 3 subagents | | 31-50 | 5 subagents | | 51-100 | 10 subagents | | 100+ | 10 subagents × multiple rounds with blocklist accumulation |
Small batches (≤10): Generate inline per plan.
Large batches (11+): Subagents with briefs write files to disk, return 1-line summaries.
For each sample:
scripts/file-generation.md#visual-diversity-protocol-critical for archetypes, rendering dimensions to vary, and anti-patterns.Sample format:
{
"transformation_rules": ["scenario: rich_batch_clean_data", "industry: tech"],
"input": "agent input matching INPUT SCHEMA",
"output": "expected output matching OUTPUT SCHEMA",
"metadata": {
"coverage_category": "happy_path",
"variation_slots": {"industry": "tech", "format": "pdf"},
"quality_score": 87,
"consensus_scores": [85, 88, 88],
"generated_at": "2026-03-31T10:30:00Z"
}
}
For each sample, launch 3 parallel evaluator subagents using verification prompt.
Schema pre-check (for structured output): Before 3-LLM consensus, validate:
If pre-check fails → auto-reject (saves 3 LLM calls).
Consensus pass criteria:
is_valid: truequality_score >= 80passes_ruleset: trueLaunch review subagents (10 samples each) with rubric from experts/expert-persona.md.
review-status.md with PASS/FAIL per sampleSave to: dataset/[client]/[use-case]/augmented/index.json
Fallback: dataset/augmented/[ruleset_name]/
Phase 4a: File Generation (if requested) — for batch generation (11+ files), each subagent already wrote and ran its own custom renderer during Phase 2 (files are on disk). For gold standards and small inline batches (≤10), use the reference renderer scripts/file_renderer.py. For non-PDF formats (DOCX, XLSX, PPTX, HTML), the same principle applies: subagents write custom styling code, not just different parameters on shared scripts. See scripts/file-generation.md for visual archetypes, rendering dimensions to vary, and domain-specific guidance.
Phase 4b: Export (if requested):
python 01-skills/create-data/scripts/export.py dataset/[path]/index.json --format csv,pdf,docx,xlsx
Same principle: the existing export dispatcher and format scripts are reference implementations. For different tasks, rework them to fit the actual data schema.
After generation completes, scan for patterns and write feedback for the ruleset:
Save to: dataset/[client]/[use-case]/rulesets/[domain]_feedback.md
See post-generation-feedback.md for format.
Before writing, check if a _feedback.md already exists with an acknowledged block. If so, only report NEW issues — skip items that were already acknowledged by create-ruleset. See interface contract for the protocol.
This file is read by create-ruleset on the next run, closing the improvement loop.
=== Data Generation Summary ===
Ruleset: [name] (schema v2.0)
Generated: [total] | Verified: [passed] | Pass Rate: [X]%
Duplicates Removed: [count]
Diversity Distribution:
┌─────────────────┬──────────┬──────────┐
│ Dimension │ Target │ Actual │
├─────────────────┼──────────┼──────────┤
│ [dimension] │ │ │
└─────────────────┴──────────┴──────────┘
Quality: Avg [X] | Min [X] | Max [X]
Output: dataset/[path]/index.json
Feedback: dataset/[path]/rulesets/[domain]_feedback.md
| File Type | Tool | Fallback |
|-----------|------|----------|
| PDF (text) | parsers/pdf_parser.py | If garbled → parsers/pdf_to_images.py + Read visually |
| PDF (scanned) | parsers/pdf_to_images.py → Read each image | — |
| CSV | parsers/csv_reader.py | — |
| PPTX | parsers/pptx_parser.py | — |
| Images | Claude's native Read (visual) | — |
Dependencies: pdfplumber, python-pptx.
| Error | Action |
|-------|--------|
| No ruleset found | STOP. Direct to /create-ruleset. |
| Ruleset schema < v2.0 | Warn user, suggest /create-ruleset upgrade. Proceed only if user overrides. |
| No examples in ruleset | Ask user for 2-3 examples |
| High rejection (>50%) | Pause, ask user to review ruleset |
| Low diversity | Explicitly target underrepresented categories |
| Consensus failure | Log scores, use for prompt improvement |
| Subagent API error (500/timeout) | Auto-retry the failed subagent up to 2 times with a 10s delay. If still failing after 2 retries, log which sample IDs failed and continue with successful subagents. After the batch completes, relaunch failed samples as a recovery batch. Never silently drop samples — the user must see the full count. |
| Subagent produces invalid output | If output JSON is malformed or missing required fields, treat as a generation failure — re-run with the same brief. Do not attempt to fix malformed output manually. |
| Formula mismatch detected | If ai_rating ≠ round(weighted sum of sub-scores), auto-reject the sample. Log the discrepancy. Regenerate with explicit formula verification instruction. This is the #1 defect in scored datasets. |
_feedback.md so the ruleset improvestesting
Audit registry.yaml against disk, validate SKILL.md frontmatter, find duplicates and orphans. Load when user says 'audit skills registry', 'validate beam-next-skills', 'registry drift', 'skills catalog audit', 'check registry yaml'.
tools
All Workable ATS operations — fetch JDs, search candidates, post assessments/reviews. Load when user says "fetch JD", "search workable", "push to workable", "post review", "rate candidate", "workable", "push assessment", "list jobs", or after interview-coach completes an evaluation. Replaces workable-fetch-jd and workable-push-assessment.
data-ai
Load when user mentions "tavily research", "market intelligence", "competitive research", "GTM research", or needs real-time market data for sales, marketing, or vertical strategy.
development
Shared resource library for Slack integration skills. DO NOT load directly - provides common references (setup, API docs, error handling, authentication) and scripts used by slack-connect and individual Slack skills.