skills/alphago-deep-rl/SKILL.md
Strategic patterns for solving intractable problems through cascading approximation, self-improvement, and heterogeneous evaluation from DeepMind's AlphaGo system
npx skillsauth add curiositech/windags-skills alphago-deep-rlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Problem characteristics:
├── Search space tractable (b^d < 10^9)
│ └── Use single best evaluator → standard optimization
└── Search space intractable (b^d > 10^12)
├── Uniform evaluation budget available
│ └── Use cascading approximation:
│ 1. Fast pruning (policy network) → narrow beam
│ 2. Medium evaluation (value network) → estimate quality
│ 3. Cheap exploration (rollouts) → breadth coverage
└── Non-uniform evaluation budget
└── Adaptive allocation:
├── Critical positions → expensive evaluation
└── Routine positions → cheap heuristics
System requirements:
├── Latency critical (<1ms response)
│ ├── Accuracy tolerance high (>10% error acceptable)
│ │ └── Fast rollouts only (2µs/evaluation)
│ └── Accuracy tolerance low (<5% error)
│ └── Cached policy network + heuristic fallback
├── Latency moderate (1-100ms response)
│ ├── High accuracy needed
│ │ └── Value network (3ms) + policy network guidance
│ └── Balanced accuracy/speed
│ └── Mixed evaluation: λ=0.5 value + rollouts
└── Latency tolerant (>100ms)
└── Full cascade: policy → value → rollouts with MCTS
When choosing multiple evaluators:
├── All evaluators have similar failure modes
│ └── Use single best evaluator (no diversity benefit)
├── Evaluators have complementary errors
│ ├── Fast/noisy + slow/accurate pairing
│ │ └── Mix at λ=0.3-0.7 (bias toward accurate)
│ ├── Conservative + exploratory pairing
│ │ └── Mix at λ=0.5 (equal weighting)
│ └── Learned + handcrafted pairing
│ └── Dynamic mixing based on confidence scores
└── Unknown error correlation
└── Empirical testing: measure performance on holdout set
Symptom: Adding more approximation layers but seeing diminishing returns Root cause: Each layer adds overhead without proportional accuracy gain Fix: Profile computational cost vs. accuracy improvement; remove layers with poor ROI Detection rule: If adding layer N improves accuracy <5% but increases latency >20%, you have schema bloat
Symptom: CPU utilization drops while waiting for GPU evaluation Root cause: Tight coupling between fast search and slow neural network evaluation Fix: Implement asynchronous queuing with stale value tolerance Detection rule: If fast components (search, rollouts) have >50% idle time, you have synchronous coupling
Symptom: Move prediction accuracy improves but win rate decreases Root cause: Optimizing for human imitation instead of objective success Fix: Switch from supervised learning to reinforcement learning with true objective Detection rule: If proxy metric and outcome metric trend in opposite directions >10 evaluations, you have proxy fixation
Symptom: Discarding rollouts because value network has higher individual accuracy Root cause: Assuming best individual component makes best system component Fix: Test mixture performance empirically; diversity often beats individual quality Detection rule: If you're selecting components by individual performance without testing combinations, you have single evaluator fallacy
Symptom: Spending equal evaluation time on obvious moves and critical positions Root cause: Not adapting computational budget to decision importance Fix: Implement adaptive allocation based on position uncertainty and outcome sensitivity Detection rule: If evaluation time variance across positions <20% of mean, you have uniform allocation
Scenario: Design AI system for reviewing pull requests with 50-500 changed lines, need <10s response time, must catch 95% of bugs.
Step 1: Analyze computational constraints
Step 2: Apply cascading decision tree Following "search space intractable" path since exhaustive analysis exceeds budget:
Step 3: Budget allocation strategy
Step 4: Mixture evaluation implementation Instead of using "best" evaluator (semantic analysis), mix ML + semantic scores:
final_score = 0.6 * semantic + 0.4 * mlNovice approach: Run semantic analysis on everything → timeout after first few files Expert insight: The cascade allows spending expensive compute only where it matters most, while ensuring every file gets some evaluation
[ ] N evaluators tested: Minimum 3 evaluators with different speed/accuracy profiles implemented and benchmarked
[ ] Error correlation measured: Pairwise correlation matrix computed showing evaluators have <0.7 correlation on test set
[ ] Mixture vs. single evaluated: Empirical comparison showing mixture outperforms best individual component by >5%
[ ] Latency requirements met: 95th percentile response time under specified threshold with full cascade
[ ] Accuracy requirements met: Error rate on holdout set meets specified threshold (e.g., <5% false negative rate)
[ ] Budget allocation profiled: Computational cost breakdown shows budget distributed according to decision importance
[ ] Failure mode testing: System tested under adversarial conditions (distribution shift, resource constraints, edge cases)
[ ] Asynchronous coordination verified: Fast components maintain >80% utilization even when slow components are saturated
[ ] Scalability validated: Performance degrades gracefully when increasing load 10x beyond nominal capacity
[ ] Monitoring instrumentation: Metrics track proxy vs. outcome alignment, component utilization, and cascade effectiveness
Do NOT use this skill for:
Delegate to other skills when:
This skill focuses specifically on: Architectural patterns for coordinating multiple imperfect evaluators under computational constraints, not the implementation details of individual components.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.