skills/general/create-ruleset/SKILL.md
Create a ruleset for synthetic data generation. Use when the user wants to define rules for generating training data, create a new dataset domain, or set up data augmentation guidelines.
npx skillsauth add beam-ai-team/beam-next-skills create-rulesetInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
AI agents need test data, but crafting test data by hand is slow, biased toward happy paths, and misses edge cases. A ruleset is the blueprint that tells create-data exactly what to generate — what the agent receives, what it should produce, and what makes each test scenario hard.
Why a ruleset instead of just "generate some test data"? Without a ruleset, generated data lacks three things: (1) coverage — it misses the scenarios that break agents in production, (2) realism — inputs are artificially short and clean, and (3) consistency — each generation run produces different quality levels. The ruleset solves all three by encoding what "production-realistic" means for this specific agent.
Why HTML? The ruleset is a single self-contained HTML file that is human-readable in a browser AND machine-readable by create-data. No external dependencies, no build step. A stakeholder can open it in Chrome and understand what data will be generated. The agent can parse its structured sections to build generation briefs.
Domain-agnostic — works for any use case: refund analysis, insurance claims, customer service, code generation, document creation, financial analysis, vehicle telemetry, etc.
Agent-first framing. The domain name alone (e.g., "CV generation") is ambiguous. Clarify the agent's task before proposing content. Think: agent input → agent processing → agent output. Why: The same domain can have completely different agents. "CV generation" could mean "generate a CV from a profile" or "screen a CV against a job description." The task determines what the ruleset covers.
Step-by-step presentation. Never dump everything at once. Walk through each concept incrementally and get alignment before moving on. See the Step-by-Step Protocol. Why: Users can't evaluate 5 dimensions at once. Presenting one at a time with a confirmation question catches errors early — a wrong assumption at Step 1 cascades through the entire ruleset if not caught.
Ground in production data. Read the actual agent spec, sample inputs/outputs, production scoring weights, and integration details BEFORE writing rules. Generic assumptions produce rulesets that don't match reality. Search the workspace for agent specifications, build trackers, sample data, and client-specific working documents — discover their locations rather than assuming fixed paths. Why: A ruleset built from generic domain knowledge produces generic test data. A ruleset built from real samples produces data that matches what the agent actually encounters. The difference shows up immediately in generation quality.
Be opinionated. For well-known domains, propose reasonable defaults. Only ask when you genuinely cannot infer the answer from domain knowledge or discovered data. Why: Asking too many questions slows the user down and makes them feel like they're doing the work. Propose a default with rationale — the user corrects if wrong, confirms if right. Either way, faster.
Hypothesis-driven. Explore before proposing. Ask when you lack information. Propose hypotheses when you have information. Never generate until you have "enough context" (see Generation Gate). Why: Flagging hypotheses (🔶) makes assumptions visible. The user can confirm or correct. Unflagged assumptions become invisible errors in the ruleset that surface as bad data during generation.
ASCII diagrams + emojis. Use the templates in resources/ascii-box-templates.md for all proposals. Add footnote citations and flag hypotheses (see Presentation Rules). Why: Visual structure makes complex information scannable. The user processes a box diagram in 5 seconds vs. 30 seconds for equivalent prose. Footnotes and hypothesis flags add traceability without cluttering the content.
| Resource | Path | Purpose | |----------|------|---------| | Context Discovery | context-discovery.md | Auto-discovery of client/project files | | Schema Inference | generation/schema-inference.md | Detect input/output structure from existing data | | Template | resources/ruleset-section-structure.md | Exact output format (schema v2.0) | | Examples | resources/few-shot-examples.md | Complete production-quality sample rulesets | | Best Practices | generation/sizing-and-coverage.md | Sizing, coverage, complexity classification | | Review Protocol | experts/review-protocol.md | 3-expert review system, iteration protocol, exit conditions | | Expert 1: Priya | experts/expert-1-production-realism.md | Data Quality Lead — spec-vs-reality audit, content length, file structure, texture, source fidelity | | Expert 2: James | experts/expert-2-internal-consistency.md | Solutions Architect — cross-reference audit, schema↔examples, derivation↔output, composition math | | Expert 3: Tomoko | experts/expert-3-generative-readiness.md | Data Generation Engineer — pipeline feasibility, variation axes, named slots, anti-patterns | | Presentation Templates | resources/ascii-box-templates.md | ASCII box diagram templates | | Post-Generation Feedback | generation/feedback-loop-format.md | Feedback loop from create-data | | HTML Review Template | resources/ruleset-review-template.html | Human-readable review document template | | Input Parsers | parsers/ | PDF, CSV, PPTX extraction scripts | | Interface Contract | ../shared/interface-contract.md | What create-data expects from the ruleset |
Parsers: parsers/pdf_parser.py (text), parsers/pdf_to_images.py (visual), parsers/csv_reader.py, parsers/pptx_parser.py. For scanned PDFs: convert to images then Read visually.
Start here. Always. Determine what you're building before searching.
Three entry states:
| User provides | Action | |---------------|--------| | Client + specific use case (e.g., "Booth, CV screening") | Skip to Phase 1 with both | | Client only, no use case (e.g., "create ruleset for Volkswagen") | → Phase 0b: Use Case Discovery | | No client, no use case | Ask: "What data do you want to create a ruleset for? Which client or domain?" |
Goal: Exhaust all available repo context, then present grounded use case options with a recommendation. Do NOT ask the user to pick before you've done the research.
Step 1: Silent deep-dive — Launch parallel subagents:
a) Account context — Explore subagent:
b) Existing data — Explore subagent:
*.html files with schema_version in their frontmatter, scoped to client-named directories.rulesets/, seed/, or augmented/ subdirectories under client-named paths.*.html files with schema_version frontmatter.c) Web research (if external company): Search for the company's products, AI initiatives, and public information relevant to agent use cases. Save as notes/{company-slug}-research.md.
Step 2: Read key docs — After subagents return, read the most important files yourself (proposals, agent specs, technical docs). Don't rely solely on subagent summaries for documents that describe the agent's architecture, input/output, or technical behavior. The subagent finds them; you read them.
Step 3: Synthesize and present use cases — Present a table of grounded use case options:
┌───┬──────────────────────────┬──────────────────────────────────────┬──────────┐
│ # │ Use Case │ What the Agent Does │ Fit │
├───┼──────────────────────────┼──────────────────────────────────────┼──────────┤
│ 1 │ [name] │ [receives X → produces Y] │ [rating] │
│ 2 │ [name] │ [receives X → produces Y] │ [rating] │
│ 3 │ [name] │ [receives X → produces Y] │ [rating] │
└───┴──────────────────────────┴──────────────────────────────────────┴──────────┘
Requirements for each use case:
Step 4: Recommend one — State which use case to build first and why. Structure the reasoning:
Step 5: Wait for user confirmation — User picks a use case (or accepts recommendation). THEN proceed to Phase 1 with the chosen use case.
CRITICAL: Do not present Step 1 (Agent Task) until the use case is chosen. Phase 0b resolves WHAT to build; Phase 1+ resolves HOW to build it.
Goal: Do all searching silently, then present ONE cohesive summary. By this point, the use case is already chosen (from Phase 0 or Phase 0b).
Launch these in parallel:
a) Repo scan — Explore subagent (scoped to client):
*.html files with schema_version frontmatter in client-scoped directories. Note: domain, agent task, has planning section?*.json files in directories named seed/ under the client's dataset path. Note: topics, example counts, read 1-2 representative entries.augmented/ under the client's dataset path. Note: domains, sample counts, gold standards, coverage from any index.json.*_feedback.md files alongside rulesets. Note: any post-generation feedback from previous runs.b) Context discovery (if client name mentioned): Follow context-discovery.md.
c) Web research (if company name mentioned and not already done in Phase 0b): Search and save as notes/{company-slug}-research.md. Skip if already done or no company name.
Note: If Phase 0b already ran, much of this context is already loaded. Don't re-search what you already have — just supplement with any use-case-specific files not yet read (e.g., agent specs, sample data for the chosen use case).
If Step 1 found existing samples, infer input/output schema per generation/schema-inference.md. Read 2-3 samples, detect structure, apply defaults.
First message covers:
Do NOT present input structure, output structure, complexity, or sizing yet. Those come in subsequent steps.
If critical gaps remain after the user responds, ask targeted questions. If you have enough context, proceed directly to Phase 2.
_feedback.md): Pre-apply recommended fixes. If an acknowledged block already exists, skip those items — they were applied in a previous cycle. After incorporating new feedback, append an acknowledgment block per ../shared/interface-contract.md.Continue through the Step-by-Step Protocol. Each step is a separate message with an ASCII diagram from resources/ascii-box-templates.md.
Step 2 (Input): Each entity with File Type (what format in production AND in the generated dataset — e.g., "PDF file rendered per sample" or "JSON object in index.json") + Content fields + Content Texture (what the data looks like in production — styles, lengths, structural characteristics). NO presentation/variation distributions here — those live exclusively in Test Scenarios. File type is mandatory per entity — without it, create-data cannot generate samples.
Step 3 (Output): File Type per output (same as input — specify exact format in the dataset) + fields, scoring dimensions, constraints. Include Output Derivation Procedure if scoring is involved. File type is mandatory per output — the generator needs to know whether to write a rendered file, a JSON field, or a separate document.
Step 4 (Test Scenarios & Sizing): Present ALL 3 purposes (flow_check/score/battle_test), then for the chosen purpose show the scenario cards: each with What (the setup), Why it's hard (the challenge), Varies (parameter distributions). This replaces the old separate Complexity, Sizing, Pairing Strategy, and Per-Category Composition steps.
After Step 4 approval → generate the ruleset. No further interaction needed until the review.
CRITICAL: Before presenting input/output, READ the actual agent spec and sample data. Use real field names, scoring weights, and format distributions. If spec and deployment differ, use deployment values and document the deviation.
If the user provides corrections:
Never present the full proposal in one message. Minimum 2 interaction rounds.
| Step | Present | Wait for | |------|---------|----------| | 1. Agent Task | What the agent receives, does, produces. One ASCII box. | Confirm or correct | | 2. Input Structure | Entities with File Type + Content fields + Content Texture. ASCII boxes per entity. | Confirm or add | | 3. Output Structure | File Type per output + fields, scoring, constraints, derivation procedure. ASCII box. | Confirm | | 4. Test Scenarios & Sizing | All 3 purposes first. Then for the chosen purpose: scenario cards with What/Why it's hard/Varies. | Pick purpose, confirm scenarios |
Collapsing: If user moves fast ("looks good"), batch Steps 2+3. Never batch all 4.
"Generate now" override: Proceed with smart defaults for remaining steps. Mention what you assumed.
Use ASCII box diagrams from resources/ascii-box-templates.md for ALL proposals.
Sources footer section inside the same box (separated by ├───┤)🔶 Hypotheses section below Sources (or as sole footer if no sources)Run internally after each user response. Three experts evaluate gathered context:
1. Domain Expert — Are business rules specific enough? (concrete thresholds, not vague principles) 2. Data Engineer — Can I build diverse data? (clear input/output patterns, 3-4 distinct scenarios) 3. Quality Evaluator — Can I judge good from bad? (unambiguous correctness criteria, known edge cases)
Decision logic:
| Experts satisfied | Action | |---|---| | 0-1 | Well-known domain: fill gaps with recommendations. Unfamiliar: ask about critical gap. | | 2 (one has concerns) | Present concerns as recommendations with defaults. Proceed unless user objects. | | All 3 | Proceed to generation. |
Frame concerns as recommendations, not questions. "I'd recommend these edge cases: [list]" not "What edge cases should I include?"
Hard gate — do NOT generate until ALL conditions are met:
| # | Condition | How to verify | |---|-----------|---------------| | 1 | Agent task is unambiguous | Confirmed with user in Step 1 | | 2 | Existing project data checked | Ran Explore subagent to find client's dataset directory | | 3 | Relationship to existing data clear | Asked user if replacement/extension/new | | 4 | Input structure concrete (fields + value ranges) | Have example or user described it | | 5 | File type declared for every input entity and output | Each has explicit format (PDF, JSON, text string, etc.) | | 6 | Output structure concrete | Have example or user confirmed | | 7 | Core business rules specific (thresholds, not platitudes) | Rules say "reject if < 60%" not "reject if low" | | 8 | At least 3 distinct scenarios defined | Discussed categories with user | | 9 | User-specific context captured | Asked about unique constraints or confirmed N/A |
Soft conditions (aim for, can proceed without): edge cases (3-5), distributions, anti-patterns, few-shot examples, quality criteria.
User override: "generate now" with 0-1 experts → use defaults + mention assumptions. With 2+ experts → proceed immediately.
Override keywords: "generate now" → proceed. "skip"/"defaults" → use defaults. "looks good" → accept.
| Field | Default | |-------|---------| | Context Levels | Minimal 20%, Standard 50%, Detailed 25%, Full 5% | | Complexity | Simple 40%, Standard 40%, Complex 20% | | Edge Cases | ~10% of dataset | | Tone | Formal 60%, Conversational 40% | | Code Styles | Compact 20%, Standard 30%, Detailed 30%, Error Handling 20% |
These principles govern the structure and content of every ruleset. They emerged from production use and iteration — follow them strictly.
Each piece of information appears in exactly ONE section. If it appears in two places, consolidate it.
| Information | Lives in | NOT in | |---|---|---| | File types (per entity and output) | INPUT SCHEMA + OUTPUT SCHEMA (File Type line) | Test Scenarios, Planning | | Field names, types, constraints | INPUT SCHEMA | Test Scenarios | | Content styles, lengths, structural characteristics | INPUT SCHEMA (Content Texture) | Test Scenarios | | Variation dimensions and distributions | TEST SCENARIOS (Varies chips) | Input Schema | | Sample counts per scenario | TEST SCENARIOS (card headers) | Separate Planning section | | Complexity/sizing metadata | Hidden YAML comment | Visible section |
Why: Duplication causes drift. When two sections define the same dimension with different distributions, the generator doesn't know which to trust.
Each input entity and each output must specify its file type — the exact format in production AND in the generated dataset. This is non-negotiable.
Examples:
Any file type is valid. create-data has pre-made reference scripts for PDF, DOCX, XLSX, CSV, JSON, and TXT. For other formats (PPTX, HTML, ODF, RTF, etc.), create-data will write a task-specific generation script using the PDF renderer pattern as a structural template. Do not limit file types to what the renderer currently supports — if the agent processes DOCX files in production, the ruleset should say DOCX. create-data handles the generation.
Why: Without an explicit file type, create-data cannot generate samples. It doesn't know whether to render a PDF, write a JSON field, or create a separate file. This was the #1 source of ambiguity in early rulesets — the skill described fields and texture but never said "this is a PDF." The user had to ask, and every answer changed the generation plan.
Where it lives: In the INPUT SCHEMA and OUTPUT SCHEMA sections, as a File Type: line immediately after each entity/output header — before the content fields table. See ruleset-section-structure.md for exact format.
When a file type is declared (PDF, DOCX, XLSX, etc.), the ruleset should describe what that file looks like in production — not just the content, but the visual characteristics that vary across real-world sources. This helps create-data generate files that look like they came from different templates, tools, and people — not a uniform batch.
Include in the INPUT SCHEMA's Content Texture (or OUTPUT SCHEMA for output files):
Why: create-data applies a visual diversity protocol where every generated file must differ on 3+ dimensions (layout, color, font, structure, spacing, etc.). The more production context the ruleset provides about what real files look like, the more realistic the generated diversity will be.
Input Schema answers: "What format is it, what does it contain, and what does it look like?"
Input Schema does NOT answer: "How does it vary across scenarios?" — that's Test Scenarios.
Why: Mixing structure and variation in one section forces the reader to mentally separate "what exists" from "how it shifts." Keeping them apart means each section has one job.
Each scenario card has three parts:
Why: "What" and "Why" give the human reader context. "Varies" gives the generator instructions. Three questions, three sections — no ambiguity about what each part does.
A non-technical person should understand the scenario from its title alone, without reading the description.
| Bad | Good | Why | |---|---|---| | "Broken Input" | "Missing & Corrupted Articles" | Names what's actually wrong | | "Sparse & Noisy Batch" | "Gaps, Paywalls & Stale Articles" | Names the specific challenges | | "Edge Case" | "Low-Visibility Executive, Vague Purpose" | Names the difficulty factors | | "Happy Path" | "Known Partner, Rich Data" | Names what makes it easy |
Why: Scenario names are the first thing a reader scans. Generic labels like "Edge Case #2" force them to read the description. Descriptive names make the scenario table self-documenting.
Every ruleset must include BOTH types of anti-patterns:
Domain-specific (unique to this agent): e.g., "generic implications that don't name AF divisions"
Generic LLM failures (apply to any generated data):
Why: Domain anti-patterns prevent wrong data. Generic anti-patterns prevent lazy data. Most generation failures are in the generic bucket — LLMs optimize for brevity and template-following unless explicitly told not to.
Complexity signals, surface area factors, and sizing calculations are planning artifacts used during the step-by-step protocol. They don't belong in the final ruleset as a visible section — they clutter the document.
Keep as: Hidden YAML comment in the HTML (machine-readable for create-data) Show instead: Sample counts in Test Scenario card headers (e.g., "7 samples")
Why: The reader of a ruleset cares about WHAT to generate (scenarios + distributions), not HOW the sizing was calculated.
After Step 4 approval, generate a single HTML ruleset file. This is the only output — no separate .md file.
Discover, don't assume. Before creating output directories, find where this client's data lives:
rulesets/, seed/, or augmented/ subdirectories).dataset/, datasets/, data/, training-data/). Create the client folder there.dataset/ at the workspace root as the default.Expected structure once resolved:
[datasets-root]/
└── [client-name]/ # e.g., booth-and-partners
└── [use-case]/ # e.g., cv-screening
├── rulesets/
│ └── [domain].html # The ruleset (ONLY OUTPUT)
├── seed/ # Populated later
└── augmented/ # Populated by create-data
If no client specified (generic domain), save to [datasets-root]/rulesets/[domain].html.
Resolve output path — Follow the folder structure discovery above, then mkdir -p [resolved-path]/{rulesets,seed,augmented}
Generate the ruleset HTML using resources/ruleset-review-template.html (schema_version: "2.0"). Follow the section structure from ruleset-section-structure.md for content guidance. Must include:
Few-Shot Example Rules (CRITICAL):
... followed by a bracketed note indicating the full article length and what the remaining content covers. Example: ...\n\n[Article continues with state incentives and union negotiations — ~450 words total]Run 3-expert review (MANDATORY — do not skip). Launch 3 parallel subagents per experts/review-protocol.md:
| Expert | Lens | Catches | |--------|------|---------| | 1. Priya (Data Quality Lead) | Does the ruleset match production reality? | Content lengths too short, bundled vs individual files, vague constraints, missing content texture | | 2. James (Solutions Architect) | Does the ruleset contradict itself? | Examples not matching schema, derivation producing invalid output, composition numbers not adding up | | 3. Tomoko (Generation Engineer) | Will this produce diverse data on first pass? | Not enough variation axes, unnamed slots, monotonous examples, missing anti-patterns |
Each expert scores 5 dimensions (0-10). Pass: avg >= 7.0, no dim < 5.0.
Protocol:
Each subagent receives a dedicated persona file that defines their identity, mental model, framework, and output format. The subagent MUST read the persona file FIRST, then the ruleset.
| Subagent | Persona file | Also receives |
|----------|-------------|---------------|
| Expert 1 | experts/expert-1-production-realism.md | Ruleset + source docs (production samples, agent specs, corporate profiles) |
| Expert 2 | experts/expert-2-internal-consistency.md | Ruleset only (checks internal consistency, no external sources needed) |
| Expert 3 | experts/expert-3-generative-readiness.md | Ruleset + index.json of generated samples (if any exist) |
Subagent prompt template: "Read the persona file at [path] FIRST — it defines your identity, mental model, and scoring framework. Adopt this persona. Then read the ruleset at [path]. Follow the framework exactly, score all 5 dimensions, and return your review in the specified output format."
Why mandatory: A flawed ruleset multiplies errors across every sample. Catching "article lengths too short" here costs 1 fix; catching it after generating 35 samples costs 35 regenerations.
Scoring formulas must be airtight (when applicable):
min(10, ...))Save the ruleset to the resolved output path: [datasets-root]/[client]/[use-case]/rulesets/[domain].html
HALT — Present the file path and ask the user to open it in their browser. Do NOT proceed until user responds. If feedback, update and re-present.
After approval — Conversationally mention what was used vs defaulted, suggest next step (/create-data).
User: "Booth, CV screening" Assistant: [Searches — finds agent spec, sample CVs, BUILD_TRACKER]
Step 1: Agent Task ASCII box → User: "yes" Step 2: Input entities (CV + JD) with Content fields + Content Texture → User: "yes" Step 3: Output structure + derivation → User: "love it" Step 4: Test Scenarios with What/Why/Varies cards → User: "great" → [Generates ruleset.html, runs expert review, presents for approval]
User: "insurance claims triage agent — generate now" [No existing data found] Assistant: "No existing data. Two things I can't assume: (1) Insurance type? (2) Triage only or approval too?" [After answer, collapses Steps 1-4 into one confirmation, generates with smart defaults]
User: "customer support classification" [Finds existing ruleset + 152 examples + feedback file] Assistant: "Found ruleset + 152 examples. Previous generation feedback suggests 3 ambiguous rules. Is this a replacement or update?" [Pre-applies feedback, proceeds through Steps 1-4 with pre-populated values]
testing
Audit registry.yaml against disk, validate SKILL.md frontmatter, find duplicates and orphans. Load when user says 'audit skills registry', 'validate beam-next-skills', 'registry drift', 'skills catalog audit', 'check registry yaml'.
tools
All Workable ATS operations — fetch JDs, search candidates, post assessments/reviews. Load when user says "fetch JD", "search workable", "push to workable", "post review", "rate candidate", "workable", "push assessment", "list jobs", or after interview-coach completes an evaluation. Replaces workable-fetch-jd and workable-push-assessment.
data-ai
Load when user mentions "tavily research", "market intelligence", "competitive research", "GTM research", or needs real-time market data for sales, marketing, or vertical strategy.
development
Shared resource library for Slack integration skills. DO NOT load directly - provides common references (setup, API docs, error handling, authentication) and scripts used by slack-connect and individual Slack skills.