skills/general/data-baker/SKILL.md
Generate realistic synthetic datasets for AI agent training and testing across diverse domains. Use when user provides meeting transcripts, requirements documents, process flows, or requests synthetic data generation for identity documents, resumes, financial records, medical records, legal contracts, product catalogs, customer profiles, organizational knowledge, or email communications. Generates 10-100+ pages of realistic, cross-referenced, domain-appropriate synthetic data with proper formatting, metadata, and RAG-friendly structure. Keywords: synthetic data, dataset generation, dummy data, test data, mock documents, institutional knowledge, training data, sample documents, realistic data, batch generation, identity cards, resumes, invoices, bank statements, medical records, contracts, organizational documentation.
npx skillsauth add beam-ai-team/beam-next-skills data-bakerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate realistic synthetic datasets for AI agent training, testing, and RAG knowledge bases from meeting transcripts, requirements docs, or process flows.
Data Baker transforms input context (meeting transcripts, requirements documents, process flows) into comprehensive, realistic synthetic datasets suitable for AI agent training, testing, and RAG (Retrieval Augmented Generation) systems. It generates 10-100+ pages of domain-appropriate, cross-referenced documents with proper metadata, realistic formatting, and logical consistency.
Key Features:
Time Estimate: 15-45 minutes depending on dataset size and complexity
Create TodoWrite with all workflow steps:
- [ ] Analyze input context and determine data type
- [ ] Load appropriate reference guides
- [ ] Define dataset scope and structure
- [ ] Generate master entity registry (if applicable)
- [ ] Generate documents with cross-references
- [ ] Validate consistency and realism
- [ ] Package output with README
- [ ] Close session to save progress
This creates transparency and allows progress tracking.
Mark tasks complete as you finish each step.
Goal: Understand what type of synthetic data is needed
Actions:
Decision Logic:
Mark this todo complete before proceeding.
Goal: Load relevant reference files into context
Actions:
Always load: references/realism-guidelines.md
Load if batch generation: references/batch-generation.md
Load for strategy selection: references/generation-strategies.md
Mark this todo complete before proceeding.
Goal: Plan exactly what will be generated
Actions:
Define entities:
Define documents:
Define variation parameters:
Choose output structure:
Example Plan:
## Dataset: Customer Journey for E-commerce Platform
### Entities:
- 10 customers (varied demographics, ages 25-65)
- 5 products (price range $50-$500)
### Documents:
- 10 customer profiles (1 per customer)
- 30 invoices (3 per customer on average, range 1-5)
- 15 support tickets (not all customers have tickets)
- 20 emails (mix of marketing, support, transactional)
### Relationships:
- Invoices reference customers and products
- Support tickets reference invoices
- Emails reference customers and may reference tickets
### Output Structure: Entity-centric
customer_001/
profile.md
invoice_001.md
invoice_002.md
support_ticket_001.md
emails/
Mark this todo complete before proceeding.
Goal: Create consistent reference for all entities (if batch generation or network of entities)
Actions:
Generate core entities first:
Create registry file (YAML or JSON):
customers:
- id: cust_001
name: "Sarah Chen"
email: "[email protected]"
address: "123 Oak Street, Chicago, IL 60614"
customer_since: "2022-03-15"
tier: "gold"
- id: cust_002
name: "Michael Rodriguez"
email: "[email protected]"
...
products:
- id: prod_001
sku: "LAPTOP-001-XL"
name: "ProBook 15 Laptop"
price: 1299.99
category: "Electronics"
Validate uniqueness:
Mark this todo complete before proceeding.
Goal: Generate all documents with proper relationships
Actions:
Follow generation order (for dependent documents):
For each document:
Maintain running state (for time-series data):
Validate as you go:
Example Document with Proper Structure:
---
document_id: INV-2024-001234
document_type: Invoice
date: 2024-11-15
customer_id: cust_001
customer_name: Sarah Chen
amount: 1234.56
status: paid
---
# Invoice #INV-2024-001234
**Date**: November 15, 2024
**Due Date**: December 15, 2024
## Bill To
Sarah Chen
123 Oak Street
Chicago, IL 60614
[email protected]
## Items
| Description | Quantity | Unit Price | Total |
|-------------|----------|------------|-------|
| ProBook 15 Laptop (SKU: LAPTOP-001-XL) | 1 | $1,299.99 | $1,299.99 |
**Subtotal**: $1,299.99
**Tax (8.75%)**: $113.75
**Total**: $1,413.74
## Payment Information
**Status**: PAID
**Payment Date**: November 18, 2024
**Payment Method**: Credit Card (****1234)
---
*For support inquiries, see Support Ticket #TICKET-001*
Mark this todo complete before proceeding.
Goal: Ensure dataset is believable and error-free
Actions:
Run automated checks:
Check realism:
Spot-check samples:
Fix any issues before finalizing
Mark this todo complete before proceeding.
Goal: Deliver organized, documented dataset
Actions:
Create README.md:
Create QUICK-START.md (optional):
Create manifest file (optional):
Organize output:
output/
├── README.md
├── QUICK-START.md
├── manifest.yaml
├── entities/
│ ├── customers.yaml
│ └── products.yaml
├── documents/
│ ├── customer_001/
│ ├── customer_002/
│ └── ...
└── metadata/
└── cross-references.yaml
Mark this todo complete before proceeding.
Once the workflow is complete, automatically trigger the close-session skill:
Auto-triggering close-session to save progress...
The close-session skill will:
This is the final mandatory step. Do not skip - it ensures all progress is preserved.
references/generation-strategies.md - 9 data generation strategies:
references/realism-guidelines.md - Guidelines for realistic data:
references/batch-generation.md - Strategies for generating 10-500+ documents:
No scripts currently included. This skill relies on Claude's generation capabilities guided by the reference documentation.
No assets currently included. Generated documents are created from scratch based on input context and reference guidelines.
About Input Context:
About Realism Levels:
About RAG Optimization:
About Cultural Awareness:
Common Pitfalls to Avoid:
testing
Audit registry.yaml against disk, validate SKILL.md frontmatter, find duplicates and orphans. Load when user says 'audit skills registry', 'validate beam-next-skills', 'registry drift', 'skills catalog audit', 'check registry yaml'.
tools
All Workable ATS operations — fetch JDs, search candidates, post assessments/reviews. Load when user says "fetch JD", "search workable", "push to workable", "post review", "rate candidate", "workable", "push assessment", "list jobs", or after interview-coach completes an evaluation. Replaces workable-fetch-jd and workable-push-assessment.
data-ai
Load when user mentions "tavily research", "market intelligence", "competitive research", "GTM research", or needs real-time market data for sales, marketing, or vertical strategy.
development
Shared resource library for Slack integration skills. DO NOT load directly - provides common references (setup, API docs, error handling, authentication) and scripts used by slack-connect and individual Slack skills.