Using Deep RL Meta-Skill

When to Use This Skill

Invoke this meta-skill when you encounter:

RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)
Agent Training: Training agents in environments (games, robotics, control systems)
Sequential Decision-Making: Problems requiring learning from trial and error
Policy Optimization: Learning policies that maximize cumulative rewards
RL Debugging: Debugging training issues, agents not learning, reward problems

This is the entry point for the deep-rl pack. It routes to 13 specialized skills based on problem characteristics.

How to Access Reference Sheets

IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.

When this skill is loaded from: skills/using-deep-rl/SKILL.md

Reference sheets like rl-foundations.md are at: skills/using-deep-rl/rl-foundations.md

NOT at: skills/rl-foundations.md ← WRONG PATH

Core Principle

Problem type determines algorithm family.

The correct approach depends on:

Action Space: Discrete (button presses) vs Continuous (joint angles)
Data Regime: Online (interact with environment) vs Offline (fixed dataset)
Experience Level: Need foundations vs ready to implement
Special Requirements: Multi-agent, model-based, exploration, reward design

Always clarify the problem BEFORE suggesting algorithms.

The 13 Deep RL Skills

rl-foundations - MDP formulation, Bellman equations, value vs policy basics
value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow, R2D2/Agent57/BBF
policy-gradient-methods - REINFORCE, PPO, TRPO, GRPO, policy optimization
actor-critic-methods - A2C, A3C, SAC, TD3, REDQ/DroQ/CrossQ, advantage functions
model-based-rl - World models, MBPO, Dreamer / DreamerV3, TD-MPC2, MuZero / EfficientZero
offline-rl - CQL, IQL, BCQ, TD3+BC, AWAC, Decision Transformer; D4RL → Minari
multi-agent-rl - QMIX, MADDPG, MAPPO/IPPO, PettingZoo, SMACv2
exploration-strategies - ε-greedy, UCB, ICM, RND, Go-Explore, NGU/Agent57, BYOL-Explore
reward-shaping - Reward design, potential-based shaping, inverse RL
counterfactual-reasoning - Causal inference, HER, off-policy evaluation, twin networks
rl-debugging - Common RL bugs, why not learning, systematic debugging
rl-environments - Gymnasium, MuJoCo, PettingZoo, Brax, Isaac Lab, EnvPool, Minari
rl-evaluation - Evaluation methodology, variance, sample efficiency metrics

Routing Decision Framework

Step 1: Assess Experience Level

If user asks "what is RL" or "how does RL work" → rl-foundations
If confused about value vs policy, on-policy vs off-policy → rl-foundations
If user has specific problem and RL background → Continue to Step 2

Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.

Step 2: Classify Action Space

Discrete Actions (buttons, menu selections, discrete signals)

| Condition | Route To | Why | |-----------|----------|-----| | Small action space (< 100) + online | value-based-methods (DQN) | Q-networks excel at discrete | | Large action space OR need policy flexibility | policy-gradient-methods (PPO) | Scales to larger spaces |

Continuous Actions (joint angles, motor forces, steering)

| Condition | Route To | Why | |-----------|----------|-----| | Sample efficiency critical | actor-critic-methods (SAC) | Off-policy, automatic entropy | | Stability critical | actor-critic-methods (TD3) | Deterministic, handles overestimation | | Simplicity preferred | policy-gradient-methods (PPO) | On-policy, simpler |

CRITICAL: NEVER suggest DQN for continuous actions. DQN requires discrete actions.

Step 3: Identify Data Regime

Online Learning (Agent Interacts with Environment)

Discrete → value-based-methods OR policy-gradient-methods
Continuous → actor-critic-methods
Sample efficiency critical → Consider model-based-rl

Offline Learning (Fixed Dataset, No Interaction)

→ offline-rl (CQL, IQL)

Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.

Step 4: Special Problem Types

| Problem | Route To | Key Consideration | |---------|----------|-------------------| | Multiple agents | multi-agent-rl | Non-stationarity, credit assignment | | Sample efficiency extreme | model-based-rl (DreamerV3, TD-MPC2) or actor-critic (DroQ, CrossQ) | Learns env model OR high UTD | | Counterfactual/causal | counterfactual-reasoning | HER, off-policy evaluation | | Agentic / LLM tool-use RL | policy-gradient-methods (GRPO) → yzmir-llm-specialist | Outcome-supervised, sparse reward |

Step 5: Debugging and Infrastructure

| Problem | Route To | Why | |---------|----------|-----| | "Not learning" / reward flat | rl-debugging FIRST | 80% of issues are bugs, not algorithms | | Exploration problems | exploration-strategies | Curiosity, RND, intrinsic motivation | | Reward design issues | reward-shaping | Potential-based shaping, inverse RL | | Environment setup | rl-environments | Gym API, wrappers, vectorization | | Evaluation questions | rl-evaluation | Deterministic vs stochastic, multiple seeds |

Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first.

Rationalization Resistance Table

| Rationalization | Reality | Counter-Guidance | |-----------------|---------|------------------| | "Just use PPO for everything" | PPO is general but not optimal for all cases | Clarify: discrete or continuous? Sample efficiency constraints? | | "DQN for continuous actions" | DQN requires discrete actions | Use SAC or TD3 for continuous | | "Offline RL is just RL on a dataset" | Offline has distribution shift, needs special algorithms | Route to offline-rl for CQL, IQL | | "More data always helps" | Sample efficiency and distribution matter | Off-policy vs on-policy matters | | "My algorithm isn't learning, I need a better one" | Usually bugs, not algorithm | Route to rl-debugging first | | "I'll discretize continuous actions for DQN" | Discretization loses precision, explodes action space | Use actor-critic-methods | | "Epsilon-greedy is enough for exploration" | Complex environments need sophisticated exploration | Route to exploration-strategies | | "I'll just increase the reward when it doesn't learn" | Reward scaling breaks learning | Route to rl-debugging | | "I can reuse online RL code for offline data" | Offline needs conservative algorithms | Route to offline-rl | | "Test reward lower than training = overfitting" | Exploration vs exploitation difference | Route to rl-evaluation |

Red Flags Checklist

Watch for these signs of incorrect routing:

[ ] Algorithm-First Thinking: Recommending algorithm before asking about action space, data regime
[ ] DQN for Continuous: Suggesting DQN/Q-learning for continuous action spaces
[ ] Offline Blindness: Not recognizing fixed dataset requires offline-rl
[ ] PPO Cargo-Culting: Defaulting to PPO without considering alternatives
[ ] No Problem Characterization: Not asking: discrete vs continuous? online vs offline?
[ ] Skipping Foundations: Implementing algorithms when user doesn't understand RL basics
[ ] Debug-Last: Suggesting algorithm changes before systematic debugging
[ ] Sample Efficiency Ignorance: Not asking about sample constraints

If any red flag triggered → STOP → Ask diagnostic questions → Route correctly

Routing Decision Tree Summary

START: RL problem

├─ Need foundations? → rl-foundations
│
├─ DISCRETE actions?
│  ├─ Small space + online → value-based-methods (DQN)
│  └─ Large space → policy-gradient-methods (PPO)
│
├─ CONTINUOUS actions?
│  ├─ Sample efficiency → actor-critic-methods (SAC)
│  ├─ Stability → actor-critic-methods (TD3)
│  └─ Simplicity → policy-gradient-methods (PPO)
│
├─ OFFLINE data? → offline-rl (CQL, IQL) [CRITICAL]
│
├─ MULTI-AGENT? → multi-agent-rl
│
├─ Sample efficiency EXTREME? → model-based-rl
│
├─ COUNTERFACTUAL? → counterfactual-reasoning
│
└─ DEBUGGING?
   ├─ Not learning → rl-debugging
   ├─ Exploration → exploration-strategies
   ├─ Reward design → reward-shaping
   ├─ Environment → rl-environments
   └─ Evaluation → rl-evaluation

Diagnostic Questions

Action Space

"Discrete choices or continuous values?"
"How many actions? Small (< 100), large, or infinite?"

Data Regime

"Can agent interact with environment, or fixed dataset?"
"Online learning or offline?"

Experience Level

"New to RL, or specific problem?"
"Understand MDPs, value functions, policy gradients?"

Special Requirements

"Multiple agents? Cooperate or compete?"
"Sample efficiency critical? How many episodes?"
"Sparse reward (only at goal) or dense (every step)?"

When NOT to Use This Pack

| User Request | Correct Pack | Reason | |--------------|--------------|--------| | "Train classifier on labeled data" | training-optimization | Supervised learning | | "Design transformer architecture" | neural-architectures | Architecture design | | "Deploy model to production" | ml-production | Deployment | | "Fine-tune LLM with RLHF / DPO / GRPO on prompts" | llm-specialist | LLM-specific tooling (TRL, reward models, KL schedules) | | "Preference optimization (DPO/IPO/KTO/SimPO)" | llm-specialist | Not policy-gradient; route out |

Note on GRPO: The algorithm is covered in policy-gradient-methods.md because it is a general PG technique. The LLM-specific recipe (reward models, format rewards, length bias, trainer integration) lives in yzmir-llm-specialist.

Multi-Skill Scenarios

See multi-skill-scenarios.md for detailed routing sequences:

Complete beginner to RL
Continuous control (robotics)
Offline RL from dataset
Multi-agent cooperative task
Sample-efficient learning
Sparse reward problem
RL-controlled neural architecture

Final Reminders

Problem characterization BEFORE algorithm selection
DQN for discrete ONLY (never continuous)
Offline data needs offline-rl (CQL, IQL)
PPO is not universal (good general-purpose, not optimal everywhere)
Debug before changing algorithms (route to rl-debugging)
Ask questions, don't assume (action space? data regime?)

Deep RL Specialist Skills

After routing, load the appropriate specialist skill for detailed guidance:

rl-foundations.md - MDP formulation, Bellman equations, value vs policy basics
value-based-methods.md - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
policy-gradient-methods.md - REINFORCE, PPO, TRPO, policy optimization
actor-critic-methods.md - A2C, A3C, SAC, TD3, advantage functions
model-based-rl.md - World models, Dyna, MBPO, planning with learned models
offline-rl.md - Batch RL, CQL, IQL, learning from fixed datasets
multi-agent-rl.md - MARL, cooperative/competitive, communication
exploration-strategies.md - ε-greedy, UCB, curiosity, RND, intrinsic motivation
reward-shaping-engineering.md - Reward design, potential-based shaping, inverse RL
counterfactual-reasoning.md - Causal inference, HER, off-policy evaluation, twin networks
rl-debugging.md - Common RL bugs, why not learning, systematic debugging
rl-environments.md - Gym, MuJoCo, custom envs, wrappers, vectorization
rl-evaluation.md - Evaluation methodology, variance, sample efficiency metrics
multi-skill-scenarios.md - Common problem routing sequences

Using Deep RL Meta-Skill

When to Use This Skill

Invoke this meta-skill when you encounter:

RL Implementation: Implementing reinforcement learning algorithms (Q-learning, DQN, PPO, SAC, etc.)
Agent Training: Training agents in environments (games, robotics, control systems)
Sequential Decision-Making: Problems requiring learning from trial and error
Policy Optimization: Learning policies that maximize cumulative rewards
RL Debugging: Debugging training issues, agents not learning, reward problems

This is the entry point for the deep-rl pack. It routes to 13 specialized skills based on problem characteristics.

How to Access Reference Sheets

IMPORTANT: All reference sheets are located in the SAME DIRECTORY as this SKILL.md file.

When this skill is loaded from: skills/using-deep-rl/SKILL.md

Reference sheets like rl-foundations.md are at: skills/using-deep-rl/rl-foundations.md

NOT at: skills/rl-foundations.md ← WRONG PATH

Core Principle

Problem type determines algorithm family.

The correct approach depends on:

Action Space: Discrete (button presses) vs Continuous (joint angles)
Data Regime: Online (interact with environment) vs Offline (fixed dataset)
Experience Level: Need foundations vs ready to implement
Special Requirements: Multi-agent, model-based, exploration, reward design

Always clarify the problem BEFORE suggesting algorithms.

The 13 Deep RL Skills

rl-foundations - MDP formulation, Bellman equations, value vs policy basics
value-based-methods - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow, R2D2/Agent57/BBF
policy-gradient-methods - REINFORCE, PPO, TRPO, GRPO, policy optimization
actor-critic-methods - A2C, A3C, SAC, TD3, REDQ/DroQ/CrossQ, advantage functions
model-based-rl - World models, MBPO, Dreamer / DreamerV3, TD-MPC2, MuZero / EfficientZero
offline-rl - CQL, IQL, BCQ, TD3+BC, AWAC, Decision Transformer; D4RL → Minari
multi-agent-rl - QMIX, MADDPG, MAPPO/IPPO, PettingZoo, SMACv2
exploration-strategies - ε-greedy, UCB, ICM, RND, Go-Explore, NGU/Agent57, BYOL-Explore
reward-shaping - Reward design, potential-based shaping, inverse RL
counterfactual-reasoning - Causal inference, HER, off-policy evaluation, twin networks
rl-debugging - Common RL bugs, why not learning, systematic debugging
rl-environments - Gymnasium, MuJoCo, PettingZoo, Brax, Isaac Lab, EnvPool, Minari
rl-evaluation - Evaluation methodology, variance, sample efficiency metrics

Routing Decision Framework

Step 1: Assess Experience Level

If user asks "what is RL" or "how does RL work" → rl-foundations
If confused about value vs policy, on-policy vs off-policy → rl-foundations
If user has specific problem and RL background → Continue to Step 2

Why foundations first: Cannot implement algorithms without understanding MDPs, Bellman equations, and exploration-exploitation tradeoffs.

Step 2: Classify Action Space

Discrete Actions (buttons, menu selections, discrete signals)

Continuous Actions (joint angles, motor forces, steering)

CRITICAL: NEVER suggest DQN for continuous actions. DQN requires discrete actions.

Step 3: Identify Data Regime

Online Learning (Agent Interacts with Environment)

Discrete → value-based-methods OR policy-gradient-methods
Continuous → actor-critic-methods
Sample efficiency critical → Consider model-based-rl

Offline Learning (Fixed Dataset, No Interaction)

→ offline-rl (CQL, IQL)

Red Flag: If user has fixed dataset and suggests DQN/PPO/SAC, STOP and route to offline-rl. Standard algorithms assume online interaction and will fail.

Step 4: Special Problem Types

Step 5: Debugging and Infrastructure

Red Flag: If user immediately wants to change algorithms because "it's not learning," route to rl-debugging first.

Rationalization Resistance Table

Red Flags Checklist

Watch for these signs of incorrect routing:

[ ] Algorithm-First Thinking: Recommending algorithm before asking about action space, data regime
[ ] DQN for Continuous: Suggesting DQN/Q-learning for continuous action spaces
[ ] Offline Blindness: Not recognizing fixed dataset requires offline-rl
[ ] PPO Cargo-Culting: Defaulting to PPO without considering alternatives
[ ] No Problem Characterization: Not asking: discrete vs continuous? online vs offline?
[ ] Skipping Foundations: Implementing algorithms when user doesn't understand RL basics
[ ] Debug-Last: Suggesting algorithm changes before systematic debugging
[ ] Sample Efficiency Ignorance: Not asking about sample constraints

If any red flag triggered → STOP → Ask diagnostic questions → Route correctly

Routing Decision Tree Summary

START: RL problem

├─ Need foundations? → rl-foundations
│
├─ DISCRETE actions?
│  ├─ Small space + online → value-based-methods (DQN)
│  └─ Large space → policy-gradient-methods (PPO)
│
├─ CONTINUOUS actions?
│  ├─ Sample efficiency → actor-critic-methods (SAC)
│  ├─ Stability → actor-critic-methods (TD3)
│  └─ Simplicity → policy-gradient-methods (PPO)
│
├─ OFFLINE data? → offline-rl (CQL, IQL) [CRITICAL]
│
├─ MULTI-AGENT? → multi-agent-rl
│
├─ Sample efficiency EXTREME? → model-based-rl
│
├─ COUNTERFACTUAL? → counterfactual-reasoning
│
└─ DEBUGGING?
   ├─ Not learning → rl-debugging
   ├─ Exploration → exploration-strategies
   ├─ Reward design → reward-shaping
   ├─ Environment → rl-environments
   └─ Evaluation → rl-evaluation

Diagnostic Questions

Action Space

"Discrete choices or continuous values?"
"How many actions? Small (< 100), large, or infinite?"

Data Regime

"Can agent interact with environment, or fixed dataset?"
"Online learning or offline?"

Experience Level

"New to RL, or specific problem?"
"Understand MDPs, value functions, policy gradients?"

Special Requirements

"Multiple agents? Cooperate or compete?"
"Sample efficiency critical? How many episodes?"
"Sparse reward (only at goal) or dense (every step)?"

When NOT to Use This Pack

Multi-Skill Scenarios

See multi-skill-scenarios.md for detailed routing sequences:

Complete beginner to RL
Continuous control (robotics)
Offline RL from dataset
Multi-agent cooperative task
Sample-efficient learning
Sparse reward problem
RL-controlled neural architecture

Final Reminders

Problem characterization BEFORE algorithm selection
DQN for discrete ONLY (never continuous)
Offline data needs offline-rl (CQL, IQL)
PPO is not universal (good general-purpose, not optimal everywhere)
Debug before changing algorithms (route to rl-debugging)
Ask questions, don't assume (action space? data regime?)

Deep RL Specialist Skills

After routing, load the appropriate specialist skill for detailed guidance:

rl-foundations.md - MDP formulation, Bellman equations, value vs policy basics
value-based-methods.md - Q-learning, DQN, Double DQN, Dueling DQN, Rainbow
policy-gradient-methods.md - REINFORCE, PPO, TRPO, policy optimization
actor-critic-methods.md - A2C, A3C, SAC, TD3, advantage functions
model-based-rl.md - World models, Dyna, MBPO, planning with learned models
offline-rl.md - Batch RL, CQL, IQL, learning from fixed datasets
multi-agent-rl.md - MARL, cooperative/competitive, communication
exploration-strategies.md - ε-greedy, UCB, curiosity, RND, intrinsic motivation
reward-shaping-engineering.md - Reward design, potential-based shaping, inverse RL
counterfactual-reasoning.md - Causal inference, HER, off-policy evaluation, twin networks
rl-debugging.md - Common RL bugs, why not learning, systematic debugging
rl-environments.md - Gym, MuJoCo, custom envs, wrappers, vectorization
rl-evaluation.md - Evaluation methodology, variance, sample efficiency metrics
multi-skill-scenarios.md - Common problem routing sequences

Adoption

tachyon-beep/using-deep-rl

$ install --global

Security Scan Results

SKILL.md

Using Deep RL Meta-Skill

When to Use This Skill

How to Access Reference Sheets

Core Principle

The 13 Deep RL Skills

Routing Decision Framework

Step 1: Assess Experience Level

Step 2: Classify Action Space

Discrete Actions (buttons, menu selections, discrete signals)

Continuous Actions (joint angles, motor forces, steering)

Step 3: Identify Data Regime

Online Learning (Agent Interacts with Environment)

Offline Learning (Fixed Dataset, No Interaction)

Step 4: Special Problem Types

Step 5: Debugging and Infrastructure

Rationalization Resistance Table

Red Flags Checklist

Routing Decision Tree Summary

Diagnostic Questions

Action Space

Data Regime

Experience Level

Special Requirements

When NOT to Use This Pack

Multi-Skill Scenarios

Final Reminders

Deep RL Specialist Skills

Related Skills

tachyon-beep/using-program-management

tachyon-beep/plugins/axiom-product-management/skills/using-product-management

tachyon-beep/using-mcp-engineering

tachyon-beep/using-embedded-database

tachyon-beep/using-deep-rl

$ install --global

Security Scan Results

SKILL.md

Using Deep RL Meta-Skill

When to Use This Skill

How to Access Reference Sheets

Core Principle

The 13 Deep RL Skills

Routing Decision Framework

Step 1: Assess Experience Level

Step 2: Classify Action Space

Discrete Actions (buttons, menu selections, discrete signals)

Continuous Actions (joint angles, motor forces, steering)

Step 3: Identify Data Regime

Online Learning (Agent Interacts with Environment)

Offline Learning (Fixed Dataset, No Interaction)

Step 4: Special Problem Types

Step 5: Debugging and Infrastructure

Rationalization Resistance Table

Red Flags Checklist

Routing Decision Tree Summary

Diagnostic Questions

Action Space

Data Regime

Experience Level

Special Requirements

When NOT to Use This Pack

Multi-Skill Scenarios

Final Reminders

Deep RL Specialist Skills

Related Skills

tachyon-beep/using-program-management

tachyon-beep/plugins/axiom-product-management/skills/using-product-management

tachyon-beep/using-mcp-engineering

tachyon-beep/using-embedded-database