engineering/advanced-ml-engineering/skills/reinforcement-learning/SKILL.md
This skill should be used when the user asks about "reinforcement learning", "RL", "reward function", "policy gradient", "PPO", "SAC", "DQN", "Q-learning", "actor-critic", "MDP", "Markov decision process", "environment design", "reward shaping", "exploration strategy", "experience replay", "multi-agent RL", "RLHF", "reward hacking", or when training an agent to interact with an environment to maximize cumulative reward.
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library reinforcement-learningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Provides the complete framework for designing, training, and evaluating reinforcement learning agents, from MDP formalization through policy optimization to deployment in production control systems.
Every RL problem is formalized as a Markov Decision Process (S, A, P, R, γ):
| Component | Description | Design Considerations | |---|---|---| | S (State Space) | Observation the agent receives | Ensure Markov property; normalize to [0,1] or z-score | | A (Action Space) | Controls available to the agent | Discrete → DQN/PPO; Continuous → SAC/PPO | | P (Transition) | Dynamics of the environment | Usually unknown, learned through interaction | | R (Reward) | Signal encoding the objective | Most critical design decision; see Reward Shaping below | | γ (Discount) | Future reward discount factor | 0.99 for long-horizon tasks; 0.95 for shorter episodes |
| Scenario | Algorithm | Key Advantage | |---|---|---| | Discrete actions, off-policy, sample efficient | DQN (+ PER + Dueling) | Replay buffer enables high sample efficiency | | Continuous control, off-policy | SAC (Soft Actor-Critic) | Maximum entropy RL; stable training, automatic α tuning | | Discrete or continuous, on-policy, stable | PPO | Clipped objective prevents destructive policy updates | | Partial observability, memory required | PPO + LSTM backbone | Recurrent policy handles non-Markovian observations | | Human preference alignment | RLHF (PPO + reward model) | Reward from human comparisons, not engineering |
The reward function is the most consequential RL design decision:
Potential-based shaping (safe — guarantees policy invariance):
Common shaping components:
Red flags (reward hacking indicators):
See references/policy-gradient.md for PPO/SAC mathematical derivations and implementation details.
See references/value-based-methods.md for DQN, Double DQN, Dueling DQN, and Rainbow implementations.
See references/mdp-framework.md for environment design patterns and Gym interface specifications.
Apply all of the following to any RL training run:
testing
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.