Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

harsh040506/reinforcement-learning

Name: reinforcement-learning
Author: harsh040506

engineering/advanced-ml-engineering/skills/reinforcement-learning/SKILL.md

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library reinforcement-learning

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Reinforcement Learning — Full-Stack RL Engine

Provides the complete framework for designing, training, and evaluating reinforcement learning agents, from MDP formalization through policy optimization to deployment in production control systems.

MDP Framework

Every RL problem is formalized as a Markov Decision Process (S, A, P, R, γ):

| Component | Description | Design Considerations | |---|---|---| | S (State Space) | Observation the agent receives | Ensure Markov property; normalize to [0,1] or z-score | | A (Action Space) | Controls available to the agent | Discrete → DQN/PPO; Continuous → SAC/PPO | | P (Transition) | Dynamics of the environment | Usually unknown, learned through interaction | | R (Reward) | Signal encoding the objective | Most critical design decision; see Reward Shaping below | | γ (Discount) | Future reward discount factor | 0.99 for long-horizon tasks; 0.95 for shorter episodes |

Algorithm Selection Guide

| Scenario | Algorithm | Key Advantage | |---|---|---| | Discrete actions, off-policy, sample efficient | DQN (+ PER + Dueling) | Replay buffer enables high sample efficiency | | Continuous control, off-policy | SAC (Soft Actor-Critic) | Maximum entropy RL; stable training, automatic α tuning | | Discrete or continuous, on-policy, stable | PPO | Clipped objective prevents destructive policy updates | | Partial observability, memory required | PPO + LSTM backbone | Recurrent policy handles non-Markovian observations | | Human preference alignment | RLHF (PPO + reward model) | Reward from human comparisons, not engineering |

Reward Shaping

The reward function is the most consequential RL design decision:

Potential-based shaping (safe — guarantees policy invariance):

F(s, a, s') = γ·Φ(s') − Φ(s) where Φ is a potential function
Does not change the optimal policy of the original MDP (Ng et al., 1999)

Common shaping components:

Progress reward: proportional to movement toward the goal state
Constraint penalty: large negative reward for safety violations
Smoothness reward: penalize jerky or oscillatory actions (robotics)
Curiosity bonus: intrinsic motivation for exploration in sparse-reward tasks

Red flags (reward hacking indicators):

Reward that can be maximized by a degenerate policy (e.g., agent finds a bug in the simulator)
Reward achievable without learning the intended behavior

See references/policy-gradient.md for PPO/SAC mathematical derivations and implementation details. See references/value-based-methods.md for DQN, Double DQN, Dueling DQN, and Rainbow implementations. See references/mdp-framework.md for environment design patterns and Gym interface specifications.

Training Stability Guidelines

Apply all of the following to any RL training run:

Gradient clipping (max_grad_norm = 0.5): prevents catastrophic policy updates
Observation normalization: running mean/variance normalization across episodes
Reward normalization: divide rewards by running standard deviation (not mean)
GAE (Generalized Advantage Estimation): λ = 0.95 for PPO — balances bias-variance in advantage estimates
Entropy regularization: encourages exploration; adjust coefficient during training
Multiple seeds: always report results across 5+ random seeds with mean ± std error

Evaluation Protocol

Run 100 deterministic evaluation episodes (ε = 0 for DQN, deterministic policy for SAC/PPO).
Report: mean reward, std, min/max, success rate (if applicable), episode length.
Compare against baselines: random policy, hand-coded heuristic, previous best agent.
Learning curve: episode reward vs. total environment steps (not wall-clock time).

harsh040506/reinforcement-learning

engineering/advanced-ml-engineering/skills/reinforcement-learning/SKILL.md

This skill should be used when the user asks about "reinforcement learning", "RL", "reward function", "policy gradient", "PPO", "SAC", "DQN", "Q-learning", "actor-critic", "MDP", "Markov decision process", "environment design", "reward shaping", "exploration strategy", "experience replay", "multi-agent RL", "RLHF", "reward hacking", or when training an agent to interact with an environment to maximize cumulative reward.

2 stars

development

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library reinforcement-learning

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 5, 2026, 5:10 PM4.6s4 files scanned

SKILL.md

name:: reinforcement-learning
description:: This skill should be used when the user asks about "reinforcement learning", "RL", "reward function", "policy gradient", "PPO", "SAC", "DQN", "Q-learning", "actor-critic", "MDP", "Markov decision process", "environment design", "reward shaping", "exploration strategy", "experience replay", "multi-agent RL", "RLHF", "reward hacking", or when training an agent to interact with an environment to maximize cumulative reward.
version:: 1.0.0

Reinforcement Learning — Full-Stack RL Engine

Provides the complete framework for designing, training, and evaluating reinforcement learning agents, from MDP formalization through policy optimization to deployment in production control systems.

MDP Framework

Every RL problem is formalized as a Markov Decision Process (S, A, P, R, γ):

Algorithm Selection Guide

Reward Shaping

The reward function is the most consequential RL design decision:

Potential-based shaping (safe — guarantees policy invariance):

F(s, a, s') = γ·Φ(s') − Φ(s) where Φ is a potential function
Does not change the optimal policy of the original MDP (Ng et al., 1999)

Common shaping components:

Progress reward: proportional to movement toward the goal state
Constraint penalty: large negative reward for safety violations
Smoothness reward: penalize jerky or oscillatory actions (robotics)
Curiosity bonus: intrinsic motivation for exploration in sparse-reward tasks

Red flags (reward hacking indicators):

Reward that can be maximized by a degenerate policy (e.g., agent finds a bug in the simulator)
Reward achievable without learning the intended behavior

Training Stability Guidelines

Apply all of the following to any RL training run:

Gradient clipping (max_grad_norm = 0.5): prevents catastrophic policy updates
Observation normalization: running mean/variance normalization across episodes
Reward normalization: divide rewards by running standard deviation (not mean)
GAE (Generalized Advantage Estimation): λ = 0.95 for PPO — balances bias-variance in advantage estimates
Entropy regularization: encourages exploration; adjust coefficient during training
Multiple seeds: always report results across 5+ random seeds with mean ± std error

Evaluation Protocol

Run 100 deterministic evaluation episodes (ε = 0 for DQN, deterministic policy for SAC/PPO).
Report: mean reward, std, min/max, success rate (if applicable), episode length.
Compare against baselines: random policy, hand-coded heuristic, previous best agent.
Learning curve: episode reward vs. total environment steps (not wall-clock time).

Related Skills

harsh040506/single-cell-rna-qc

testing

VerifiedTrustedCommunity

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

tools

VerifiedTrustedCommunity

Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

testing

VerifiedTrustedCommunity

This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scientific-problem-selection

harsh040506/nextflow-development

development

VerifiedTrustedCommunity

Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/nextflow-development

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/harsh040506/claude-code-unified-skill-plugin-library.git

# Copy into Claude Code skills folder (global)
cp -r claude-code-unified-skill-plugin-library/engineering/advanced-ml-engineering/skills/reinforcement-learning ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

harsh040506/claude-code-unified-skill-plugin-library

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT