skills/beam/beam-tools/beam-ape-optimizer/SKILL.md
Automated Prompt Engineering (APE) optimization loop for Beam agent nodes. Adds reasoning traces, runs test batches against a golden dataset, uses 3-agent credit assignment (Doer/Critic/Editor) to identify and fix failing prompt instructions, and redeploys improved prompts. Use when user says "optimize prompts", "APE loop", "improve beam agent accuracy", "prompt engineering", "test and fix prompts", "run APE", "credit assignment", or when agent accuracy needs systematic improvement rather than manual prompt editing.
npx skillsauth add beam-ai-team/beam-next-skills beam-ape-optimizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic prompt optimization for Beam agent nodes using Automated Prompt Engineering (APE). Runs test batches, identifies which prompt instructions cause failures, and surgically rewrites only the failing parts.
.envFor each node being optimized, add a reasoning output parameter:
| Param | Type | Description |
|-------|------|-------------|
| reasoning | string | Step-by-step chain-of-thought explaining each decision |
Append reasoning instructions to the node's prompt (before the Input section):
# Reasoning
Before producing your outputs, think through each step:
1. [Step specific to this node's task]
2. [Another step]
3. [Final decision step]
Write your full reasoning in the 'reasoning' output.
Why reasoning only (no combined_output): APE analyzes each node's prompt independently. We need to see WHY the model made each decision, then compare individual outputs against ground truth. Aggregating outputs at the exit node adds complexity without value — each node is analyzed separately.
Deploy reasoning additions:
reasoning outputBefore running tests, analyze the dataset to ensure maximum coverage with minimum samples:
Output a coverage table:
| Category | Count | Selected Samples |
|----------|-------|-----------------|
| product_complaint | 12 | GD-44, GD-52 |
| spare_parts | 8 | GD-88, GD-91 |
| missing_info | 5 | GD-38 |
| ... | ... | ... |
Run selected samples one at a time (concurrent tasks cause input corruption):
python3 04-workspace/scripts/trigger_beam_task.py \
--agent {AGENT_ID} \
--msg "path/to/sample.msg" \
--poll --poll-timeout 600
For each completed task, collect:
reasoning output (the chain-of-thought trace)CRITICAL: Never rerun tasks without explicit user approval. Present all results first, report stuck/failed tasks, and wait for the user to say "rerun".
For each test case, compare node outputs against ground truth:
Ground truth precedence (if using reviewed dataset):
Create a results table:
| Sample | Expected | Got | Match | Node | Reasoning Summary |
|--------|----------|-----|-------|------|-------------------|
| GD-44 | product_complaint | product_complaint | PASS | N4 | — |
| GD-38 | missing_info | spare_parts | FAIL | N4 | "Found product ref..." |
For each MISMATCH, run the Critic and Editor agents.
The Beam agent itself is the Doer. The reasoning output is the chain-of-thought trace.
A Claude prompt that takes the failing node's prompt + reasoning + output + ground truth, and assigns credit to each instruction:
Input:
Output: Per-instruction labels:
KEEP — instruction followed correctlyMODIFY — instruction contributed to the error (with explanation)NEUTRAL — instruction not relevant to this errorSee references/critic-prompt.md for the full Critic prompt template.
Takes the Critic's output and rewrites ONLY the MODIFY instructions:
Rules:
KEEP instructionsMODIFY instructions using failure reasoning as guideSee references/editor-prompt.md for the full Editor prompt template.
Convergence criteria:
Iteration order:
Prompt versioning: Save each version:
plan/prompt-versions/
n1-v1.txt, n1-v2.txt, ...
n4-v1.txt, n4-v2.txt, ...
Results tracking:
| Iteration | Node | Accuracy | Changes | Regressions |
|-----------|------|----------|---------|-------------|
| v1 (baseline) | N4 | 7/10 | — | — |
| v2 | N4 | 8/10 | Rule 2 rewrite | 0 |
| v3 | N4 | 9/10 | Added examples | 0 |
beam-graph-creator — Create and deploy the agent graph (do this before APE)beam-agent-manager — API rules and PATCH/publish workflowstesting
Audit registry.yaml against disk, validate SKILL.md frontmatter, find duplicates and orphans. Load when user says 'audit skills registry', 'validate beam-next-skills', 'registry drift', 'skills catalog audit', 'check registry yaml'.
tools
All Workable ATS operations — fetch JDs, search candidates, post assessments/reviews. Load when user says "fetch JD", "search workable", "push to workable", "post review", "rate candidate", "workable", "push assessment", "list jobs", or after interview-coach completes an evaluation. Replaces workable-fetch-jd and workable-push-assessment.
data-ai
Load when user mentions "tavily research", "market intelligence", "competitive research", "GTM research", or needs real-time market data for sales, marketing, or vertical strategy.
development
Shared resource library for Slack integration skills. DO NOT load directly - provides common references (setup, API docs, error handling, authentication) and scripts used by slack-connect and individual Slack skills.