plugins/yzmir-deep-rl/skills/using-deep-rl/SKILL.md
Routes to appropriate deep-RL skills based on problem type and algorithm family
npx skillsauth add tachyon-beep/skillpacks using-deep-rlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Invoke this meta-skill when you encounter:
This is the entry point for the deep-rl pack. It routes to 13 specialized skills based on problem characteristics.
IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.
When this skill is loaded from:
skills/using-deep-rl/SKILL.md
Reference sheets like rl-foundations.md are at:
skills/using-deep-rl/rl-foundations.md
NOT at:
skills/rl-foundations.md ← WRONG PATH
Problem type determines algorithm family.
The correct approach depends on:
Always clarify the problem BEFORE suggesting algorithms.
Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.
| Condition | Route To | Why | |-----------|----------|-----| | Small action space (< 100) + online | value-based-methods (DQN) | Q-networks excel at discrete | | Large action space OR need policy flexibility | policy-gradient-methods (PPO) | Scales to larger spaces |
| Condition | Route To | Why | |-----------|----------|-----| | Sample efficiency critical | actor-critic-methods (SAC) | Off-policy, automatic entropy | | Stability critical | actor-critic-methods (TD3) | Deterministic, handles overestimation | | Simplicity preferred | policy-gradient-methods (PPO) | On-policy, simpler |
CRITICAL: NEVER suggest DQN for continuous actions. DQN requires discrete actions.
→ offline-rl (CQL, IQL)
Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.
| Problem | Route To | Key Consideration |
|---------|----------|-------------------|
| Multiple agents | multi-agent-rl | Non-stationarity, credit assignment |
| Sample efficiency extreme | model-based-rl (DreamerV3, TD-MPC2) or actor-critic (DroQ, CrossQ) | Learns env model OR high UTD |
| Counterfactual/causal | counterfactual-reasoning | HER, off-policy evaluation |
| Agentic / LLM tool-use RL | policy-gradient-methods (GRPO) → yzmir-llm-specialist | Outcome-supervised, sparse reward |
| Problem | Route To | Why | |---------|----------|-----| | "Not learning" / reward flat | rl-debugging FIRST | 80% of issues are bugs, not algorithms | | Exploration problems | exploration-strategies | Curiosity, RND, intrinsic motivation | | Reward design issues | reward-shaping | Potential-based shaping, inverse RL | | Environment setup | rl-environments | Gym API, wrappers, vectorization | | Evaluation questions | rl-evaluation | Deterministic vs stochastic, multiple seeds |
Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first.
| Rationalization | Reality | Counter-Guidance | |-----------------|---------|------------------| | "Just use PPO for everything" | PPO is general but not optimal for all cases | Clarify: discrete or continuous? Sample efficiency constraints? | | "DQN for continuous actions" | DQN requires discrete actions | Use SAC or TD3 for continuous | | "Offline RL is just RL on a dataset" | Offline has distribution shift, needs special algorithms | Route to offline-rl for CQL, IQL | | "More data always helps" | Sample efficiency and distribution matter | Off-policy vs on-policy matters | | "My algorithm isn't learning, I need a better one" | Usually bugs, not algorithm | Route to rl-debugging first | | "I'll discretize continuous actions for DQN" | Discretization loses precision, explodes action space | Use actor-critic-methods | | "Epsilon-greedy is enough for exploration" | Complex environments need sophisticated exploration | Route to exploration-strategies | | "I'll just increase the reward when it doesn't learn" | Reward scaling breaks learning | Route to rl-debugging | | "I can reuse online RL code for offline data" | Offline needs conservative algorithms | Route to offline-rl | | "Test reward lower than training = overfitting" | Exploration vs exploitation difference | Route to rl-evaluation |
Watch for these signs of incorrect routing:
If any red flag triggered → STOP → Ask diagnostic questions → Route correctly
START: RL problem
├─ Need foundations? → rl-foundations
│
├─ DISCRETE actions?
│ ├─ Small space + online → value-based-methods (DQN)
│ └─ Large space → policy-gradient-methods (PPO)
│
├─ CONTINUOUS actions?
│ ├─ Sample efficiency → actor-critic-methods (SAC)
│ ├─ Stability → actor-critic-methods (TD3)
│ └─ Simplicity → policy-gradient-methods (PPO)
│
├─ OFFLINE data? → offline-rl (CQL, IQL) [CRITICAL]
│
├─ MULTI-AGENT? → multi-agent-rl
│
├─ Sample efficiency EXTREME? → model-based-rl
│
├─ COUNTERFACTUAL? → counterfactual-reasoning
│
└─ DEBUGGING?
├─ Not learning → rl-debugging
├─ Exploration → exploration-strategies
├─ Reward design → reward-shaping
├─ Environment → rl-environments
└─ Evaluation → rl-evaluation
| User Request | Correct Pack | Reason | |--------------|--------------|--------| | "Train classifier on labeled data" | training-optimization | Supervised learning | | "Design transformer architecture" | neural-architectures | Architecture design | | "Deploy model to production" | ml-production | Deployment | | "Fine-tune LLM with RLHF / DPO / GRPO on prompts" | llm-specialist | LLM-specific tooling (TRL, reward models, KL schedules) | | "Preference optimization (DPO/IPO/KTO/SimPO)" | llm-specialist | Not policy-gradient; route out |
Note on GRPO: The algorithm is covered in policy-gradient-methods.md because it is a general PG technique. The LLM-specific recipe (reward models, format rewards, length bias, trainer integration) lives in yzmir-llm-specialist.
See multi-skill-scenarios.md for detailed routing sequences:
After routing, load the appropriate specialist skill for detailed guidance:
development
Use when **managing the delivery of work** rather than building it — running a project or a program, not writing its code. Use when a team is busy but outcomes are not landing, when "when will it be done" has no defensible answer, when status is green every week until it is suddenly red, when dependencies surprise you, when a RAID log is a graveyard, or when several projects must be coordinated toward one outcome (a program). Lean/agile-leaning, honest about where program scale needs predictive structure. Pairs with `/axiom-planning` (turning one workstream into an implementation plan) and `/axiom-sdlc-engineering` (process maturity, requirements traceability, formal governance). Do not load for writing code, picking an architecture, or designing a single feature.
tools
--- name: using-product-management description: Use when a Claude is taking **standing ownership** of a software product and driving it end-to-end across many sessions — discovery, strategy, specs, delivery orchestration, and value validation — deciding *what to build, why, for whom,* and *whether it worked*, with continuity, decision provenance, and an authority boundary that escalates anything irreversible or outward-facing to the human owner. Owns the product disciplines: opportunity assessme
tools
Use when designing, implementing, or auditing an MCP (Model Context Protocol) server — tool API design, idempotency under agent retry, structured error envelopes agents can recover from, schema versioning across model drift, transport reliability (stdio / HTTP), output-shape and pagination discipline, and choosing between tools / resources / prompts / sampling. Also use when an MCP server's tools confuse agents, return unstructured errors, deadlock under concurrent calls, double-execute under retry, or lose state across reconnects. Do not use for general REST/GraphQL API design (use `/web-backend`), for client-side prompt engineering or tool-loop design (use `/llm-specialist`), for general in-process plugin architecture (use `/system-architect`), or for cryptographic-provenance audit trails (use `/audit-pipelines`).
development
Use when running **SQLite or DuckDB inside an application process** as the durable store — not as a development convenience but as the production database. Use when scaling an SQLite layer that worked at low concurrency and is now hitting SQLITE_BUSY, WAL bloat, lock contention, schema-migration ceremony, or correctness gaps under multi-process writers. Use when introducing DuckDB as an OLAP complement to an OLTP SQLite store, or when picking between the two for a new component. Pairs with `/web-backend` (the API surface above the DB) and `/audit-pipelines` (when the DB is also the audit trail). Do not load for server databases (Postgres, MySQL), key-value stores, or ORM choice in isolation.