toolkit/packages/skills/cost-aware-pipeline/SKILL.md
Cost-aware LLM pipeline patterns for optimal model routing, narrow retry strategies, and prompt caching. Reduces API costs 40-70% through intelligent model selection, targeted retries, and cache-friendly prompt structures. Use when: (1) Building multi-model pipelines, (2) Optimizing API costs, (3) Designing retry strategies for LLM calls, (4) Implementing prompt caching, (5) Choosing between haiku/sonnet/opus for sub-tasks.
npx skillsauth add stevengonsalvez/agents-in-a-box cost-aware-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Route tasks to the cheapest model that can handle them reliably.
| Model | Input | Output | Relative Cost | |-------|-------|--------|---------------| | Haiku | $0.80 | $4.00 | 1x (baseline) | | Sonnet | $3.00 | $15.00 | ~4x | | Opus | $15.00 | $75.00 | ~19x |
| Task Complexity | Route To | Examples | |-----------------|----------|----------| | Simple (< 100 lines, clear pattern) | Haiku | File renaming, simple search, format conversion, status checks | | Moderate (100-500 lines, some judgment) | Sonnet | Code review, test writing, refactoring, documentation | | Complex (500+ lines, deep reasoning) | Opus | Architecture design, debugging subtle issues, multi-file refactoring | | Creative (open-ended, high quality bar) | Opus | System design, novel algorithms, critical security review |
# In agent orchestration, specify model based on task complexity
def route_to_model(task_description: str, estimated_complexity: str) -> str:
"""Return model parameter for Agent tool."""
routing = {
"simple": "haiku",
"moderate": "sonnet",
"complex": "opus",
"creative": "opus",
}
return routing.get(estimated_complexity, "sonnet")
When spawning sub-agents:
model: "haiku" for Explore agents doing simple file searchesmodel: "sonnet" (default) for most implementation workmodel: "opus" only for architecture reviews, complex debugging, distinguished-engineer tasks| Scenario | Before (all Opus) | After (routed) | Savings | |----------|-------------------|----------------|---------| | 10 file searches | $0.15 | $0.008 (haiku) | 95% | | 5 code reviews | $0.75 | $0.20 (sonnet) | 73% | | 1 architecture review | $0.15 | $0.15 (opus) | 0% | | Typical session | $1.05 | $0.36 | 66% |
When an LLM call fails, retry intelligently — not blindly.
Attempt 1: haiku (cheapest)
| fail
v
Attempt 2: sonnet + error context
| fail
v
Attempt 3: opus + error context + examples
| fail
v
Escalate to human
RETRY_LADDER = [
{"model": "haiku", "extra_context": None},
{"model": "sonnet", "extra_context": "Previous attempt failed: {error}"},
{"model": "opus", "extra_context": "Two attempts failed. Errors: {errors}. Please reason carefully."},
]
async def call_with_retry(prompt: str, task: str) -> str:
errors = []
for attempt in RETRY_LADDER:
try:
result = await call_llm(
model=attempt["model"],
prompt=prompt,
extra_context=attempt["extra_context"].format(
error=errors[-1] if errors else "",
errors="; ".join(errors)
) if attempt["extra_context"] and errors else prompt,
)
return result
except Exception as e:
errors.append(str(e))
raise EscalationNeeded(f"Failed after {len(RETRY_LADDER)} attempts: {errors}")
Structure prompts to maximize cache hits and reduce costs.
+-------------------------------------+
| STATIC SYSTEM PROMPT (cached) | <- Rarely changes, high cache hit rate
| - Agent instructions |
| - Tool definitions |
| - Project context (CLAUDE.md) |
+-------------------------------------+
| SEMI-STATIC CONTEXT (cached) | <- Changes per-task, not per-turn
| - File contents being worked on |
| - Relevant documentation |
| - Test results |
+-------------------------------------+
| DYNAMIC CONTENT (not cached) | <- Changes every turn
| - User message |
| - Latest tool results |
| - Conversation tail |
+-------------------------------------+
With Anthropic's prompt caching:
Strategy: Keep system prompt + project context stable across turns. Only the user message and tool results should change.
| Session Type | Typical Tokens | Estimated Cost | |-------------|----------------|----------------| | Quick fix (5 min) | 50K in / 10K out | $0.15 - $0.90 | | Feature implementation (30 min) | 500K in / 100K out | $1.50 - $9.00 | | Deep debugging (1 hr) | 1M in / 200K out | $3.00 - $18.00 | | Multi-agent swarm (2 hr) | 5M in / 1M out | $15.00 - $90.00 |
# Per-session cost cap (used by agent-ops skill)
export AGENT_COST_CAP=5.00
# Per-agent cost cap for sub-agents
export SUBAGENT_COST_CAP=1.00
# Warning threshold (fraction of cap)
export AGENT_COST_CAP_WARN=0.80
# View today's costs
grep "$(date +%Y-%m-%d)" "$HOME/{{TOOL_DIR}}/metrics/costs.jsonl" | \
python3 -c "import sys,json; rows=[json.loads(l) for l in sys.stdin]; \
print(f'Sessions: {len(set(r[\"session_id\"] for r in rows))}'); \
print(f'Total: \${sum(r[\"estimated_cost_usd\"] for r in rows):.4f}')"
documentation
Report reflect drain spend over a time window — tokens split by cached (cache_read), uncached writes (cache_creation), and io (input+output), with a $ estimate, grouped by day / outcome / model / transcript. Reads the drainer's cost log and surfaces outlier runs and cache-reuse health (the 41.5M-token failure mode = low cache reuse + high cache writes). Use to answer "what is reflection costing me" for the last day / week.
development
Show fleet status — every claude session running on the host, merged across ainb + claude-peers broker + background jobs. Use when you need to enumerate sessions before composing an action, see which sessions have a peer registered (broker-routable) vs tmux-only, check the `summary` of each session, or pipe the list into jq for filtering. Default output: text table. Pass --format json for LLM consumption.
testing
Ordered multi-step prompts to fleet targets, ack-gated between steps via JSONL assistant-turn-end detection. Use for cycles like disconnect→reconnect→verify, or any flow where step N+1 requires step N to have completed first. The skill BLOCKS until each target's transcript shows the next assistant turn finishing OR per-step timeout fires (default 300s).
development
Center control panel — enumerate every claude session that is blocked waiting on something: a user answer (AskUserQuestion fired), an API error retry, an idle assistant turn-end with no follow-up, or an explicit WAITING: marker. Returns rich JSON with signal kind + context per session. Use this when you've stepped away from the fleet and want one place to see everything that wants your attention and answer it.