skills/cost-verification-auditor/SKILL.md
Audit LLM token cost estimates against actual API usage. Activate on 'cost verification', 'token estimate accuracy', 'API cost audit', 'estimation variance'. NOT for pricing lookups, budget planning, or cost optimization strategies.
npx skillsauth add curiositech/windags-skills cost-verification-auditorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Verify that token cost estimates are within ±20% of actual Claude API usage.
✅ Use for:
❌ NOT for:
Has estimator? ──No──→ Build estimator first (see Calibration Guidelines)
│
Yes
↓
Define 3+ test cases (simple/medium/complex)
↓
Estimate BEFORE execution (no peeking!)
↓
Execute against real API
↓
Calculate variance: (actual - estimated) / estimated
↓
Variance ≤ ±20%? ──Yes──→ PASS ✓
│
No
↓
Apply fixes from Anti-Patterns section
↓
Re-run verification
const inputVariance = (actual.inputTokens - estimate.inputTokens) / estimate.inputTokens;
const outputVariance = (actual.outputTokens - estimate.outputTokens) / estimate.outputTokens;
const costVariance = (actual.totalCost - estimate.totalCost) / estimate.totalCost;
// PASS if both input AND output within ±20%
const passed = Math.abs(inputVariance) <= 0.20 && Math.abs(outputVariance) <= 0.20;
Novice thinking: "Claude Code adds ~500 tokens overhead, so add that to every estimate."
Reality: Direct API calls have ~10 token overhead. The 500+ overhead is ONLY when using Claude Code's full context (system prompts, tools, conversation history).
Timeline:
What to use instead: | Context | Overhead | |---------|----------| | Direct API call | ~10 tokens | | With system prompt | 50-200 tokens | | With tools/functions | 100-500 tokens | | Claude Code full context | 500-2000 tokens |
How to detect: Consistent 40-90% overestimation = overhead too high.
Novice thinking: "Every node must be within ±20% or the estimator is broken."
Reality: LLM output length is non-deterministic. Per-node output variance of 30-50% is normal. What matters is aggregate cost accuracy.
What to use instead:
How to detect: Input estimates accurate, output varies wildly = normal LLM behavior.
Novice thinking: "Let me run the API call first to see what tokens we get, then build the estimator."
Reality: This produces perfectly-fitted estimates that fail on new prompts. Estimation must happen BEFORE execution.
Correct approach:
// Calibrated 2026-01-30
const inputTokens = Math.ceil(prompt.length / CHARS_PER_TOKEN) + OVERHEAD;
| Text Type | CHARS_PER_TOKEN | Notes | |-----------|-----------------|-------| | English prose | 4.0 | Most consistent | | Code | 3.0-3.5 | Symbols tokenize differently | | Mixed | 3.5 | Balanced (recommended default) | | JSON/structured | 3.0 | Punctuation heavy |
| Prompt Constraint | Multiplier | Notes | |-------------------|------------|-------| | "List exactly N items" | 0.8x input | Highly constrained | | "Brief summary" | 1.0x input | Moderate | | "Explain in detail" | 2-3x input | Expansive | | Unconstrained | 1.5x input | Variable |
Always: Minimum 100 output tokens for any meaningful response.
| Model | Output Tendency | |-------|-----------------| | Claude Opus | Longer, more detailed | | Claude Sonnet | Balanced | | Claude Haiku | Concise, efficient |
| Symptom | Cause | Fix | |---------|-------|-----| | Overestimating by 40%+ | Overhead too high | Reduce from 500 → 10 | | Underestimating inputs | Chars/token too high | Reduce from 4.0 → 3.5 | | Output wildly varies | LLM non-determinism | Use constrained prompts | | Total cost accurate but per-node off | Normal aggregation | Accept it, focus on totals |
(actual - estimated) / estimatedSee /references/calibration-data.md for detailed calibration tables and historical data.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.