Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

abelrguezr/reinforcement-learning-algorithms

Name: reinforcement-learning-algorithms
Author: abelrguezr

skills/AI/AI-Reinforcement-Learning-Algorithms/SKILL.md

npx skillsauth add abelrguezr/hacktricks-skills reinforcement-learning-algorithms

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Reinforcement Learning Algorithms

This skill helps you implement, understand, and secure reinforcement learning systems. It covers core algorithms (Q-Learning, SARSA), their differences, and critical security considerations for production RL systems.

Quick Start

# Implement Q-Learning from scratch
python scripts/q_learning.py --env grid-world --episodes 1000

# Implement SARSA with softmax exploration
python scripts/sarsa.py --env grid-world --exploration softmax --tau 0.5

# Check for backdoor patterns in trained policy
python scripts/backdoor_detector.py --policy model.pkl --canary-episodes 50

Core Concepts

What is Reinforcement Learning?

Reinforcement learning (RL) is where an agent learns to make decisions by interacting with an environment. The agent receives feedback as rewards or penalties, allowing it to learn optimal behaviors over time.

Key components:

Agent: The learner/decision maker
Environment: The world the agent interacts with
State (s): Current situation
Action (a): What the agent does
Reward (r): Feedback signal
Policy: Strategy for choosing actions

When to Use RL

RL excels at sequential decision-making problems:

Robotics and control systems
Game playing
Autonomous systems
Resource allocation
Recommendation systems with long-term goals

Q-Learning Algorithm

Q-Learning is a model-free, off-policy algorithm that learns the value of actions in each state using a Q-table.

How It Works

Initialize Q-table with zeros or small random values
Select action using exploration strategy (ε-greedy)
Execute action, observe next state and reward

Update Q-value using Bellman equation:

Q(s, a) = Q(s, a) + α * (r + γ * max(Q(s', a')) - Q(s, a))

Repeat until convergence

Parameters Explained

| Parameter | Symbol | Range | Purpose | |-----------|--------|-------|---------| | Learning rate | α | 0 < α ≤ 1 | How much new info overrides old | | Discount factor | γ | 0 ≤ γ < 1 | Importance of future rewards | | Exploration rate | ε | 0 ≤ ε ≤ 1 | Probability of random action |

Key insight: Q-Learning uses max(Q(s', a')) - the best possible future value - regardless of what action the current policy would take. This makes it off-policy.

Implementation Tips

Start with ε = 1.0 and decay to 0.1 over training
Higher α = faster learning but potentially unstable
γ closer to 1 = agent values long-term rewards more
For large state spaces, use function approximation (neural networks)

SARSA Algorithm

SARSA (State-Action-Reward-State-Action) is a model-free, on-policy algorithm similar to Q-Learning but with a key difference in the update rule.

How It Works

Initialize Q-table
Select action using current policy (ε-greedy or softmax)
Execute action, observe next state and reward
Select next action a' using current policy

Update Q-value:

Q(s, a) = Q(s, a) + α * (r + γ * Q(s', a') - Q(s, a))

Repeat until convergence

Key Difference from Q-Learning

SARSA uses Q(s', a') - the value of the action actually taken in the next state, not the maximum. This makes it on-policy - it learns from the actions the current policy would actually take.

Action Selection Strategies

ε-Greedy

With probability ε: choose random action (explore)
With probability 1-ε: choose best-known action (exploit)

Softmax (Boltzmann)

Probability proportional to Q-value:

P(a|s) = exp(Q(s, a) / τ) / Σ(exp(Q(s, a') / τ))

τ (temperature) controls exploration:
- High τ = more uniform probabilities (explore)
- Low τ = favor high Q-values (exploit)

On-Policy vs Off-Policy

| Aspect | On-Policy (SARSA) | Off-Policy (Q-Learning) | |--------|-------------------|-------------------------| | Update uses | Actual next action | Best possible action | | Stability | More stable in some environments | Can be more aggressive | | Convergence | May be slower | Often faster | | Use case | When policy matters during learning | When you want optimal policy |

Security Considerations

RL systems are vulnerable to training-time attacks. Understanding these is critical for production deployments.

Training-Time Backdoors

How they work:

Attacker injects poisoned trajectories with trigger states
When trigger appears, agent performs attacker-chosen behavior
Clean performance remains normal, hiding the backdoor

Detection strategies:

Inspect reward deltas per state - abrupt local improvements are suspicious
Maintain a canary trigger set - hold-out episodes with rare states
Verify each policy independently before aggregation in multi-agent settings

Reward Model Poisoning (RLHF)

Attack vector:

Flip <5% of preference labels during reward model training
Add trigger tokens to prompts
Force preferences where attacker content is marked "better"
Downstream PPO learns to output attacker content when trigger appears

Defense:

Monitor preference label distribution
Use canary triggers in evaluation
Audit reward model outputs for trigger-dependent behavior

Red-Team Checklist

- [ ] Inspect reward deltas per state for anomalies
- [ ] Test with canary trigger set (synthetic rare states/tokens)
- [ ] Verify each shared policy via rollouts before aggregation
- [ ] Monitor for trigger-dependent behavior changes
- [ ] Keep training data provenance and audit logs
- [ ] Use ensemble methods to detect outlier policies

Scripts Reference

`scripts/q_learning.py`

Implements Q-Learning with configurable exploration and learning parameters.

python scripts/q_learning.py \
  --env grid-world \
  --episodes 1000 \
  --alpha 0.1 \
  --gamma 0.99 \
  --epsilon-start 1.0 \
  --epsilon-end 0.1 \
  --epsilon-decay 0.995

`scripts/sarsa.py`

Implements SARSA with ε-greedy or softmax exploration.

python scripts/sarsa.py \
  --env grid-world \
  --episodes 1000 \
  --exploration softmax \
  --tau 0.5 \
  --alpha 0.1 \
  --gamma 0.99

`scripts/backdoor_detector.py`

Detects potential backdoor patterns in trained RL policies.

python scripts/backdoor_detector.py \
  --policy model.pkl \
  --canary-episodes 50 \
  --trigger-patterns triggers.json \
  --output report.json

Common Pitfalls

ε not decaying: Agent keeps exploring randomly, never converges
γ too high: Agent overvalues distant rewards, learning becomes unstable
α too high: Q-values oscillate, never settle
Insufficient exploration: Agent gets stuck in local optima
Ignoring security: Production RL systems can be backdoored during training

Debugging Tips

Log Q-table changes to see learning progress
Track average reward per episode - should trend upward
Visualize policy to see if it makes sense
Test with known solvable environments first
Use deterministic seeds for reproducibility

When to Use Each Algorithm

| Scenario | Recommended Algorithm | |----------|----------------------| | Simple tabular problems | Q-Learning | | Need stable learning | SARSA | | Large state space | Deep Q-Network (DQN) | | Continuous actions | PPO, SAC | | Multi-agent coordination | MADDPG, QMIX | | Safety-critical | Constrained RL with monitoring |

Next Steps

Start with scripts/q_learning.py on a simple grid world
Compare Q-Learning vs SARSA on the same problem
Add security monitoring with scripts/backdoor_detector.py
Scale to neural network function approximation for complex problems
Implement proper evaluation with hold-out test environments

abelrguezr/reinforcement-learning-algorithms

skills/AI/AI-Reinforcement-Learning-Algorithms/SKILL.md

Implement and analyze reinforcement learning algorithms (Q-Learning, SARSA, PPO) and detect security vulnerabilities like backdoor attacks and reward poisoning. Use this skill whenever you need to build RL agents, understand RL algorithms, implement training loops, or audit RL systems for security issues. Make sure to use this skill when the user mentions reinforcement learning, Q-learning, SARSA, RL training, agent training, policy learning, or any ML system that learns from rewards and environment interaction.

5 stars

development

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add abelrguezr/hacktricks-skills reinforcement-learning-algorithms

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:04 AM146.2s2 files scanned

SKILL.md

name:: reinforcement-learning-algorithms
description:: Implement and analyze reinforcement learning algorithms (Q-Learning, SARSA, PPO) and detect security vulnerabilities like backdoor attacks and reward poisoning. Use this skill whenever you need to build RL agents, understand RL algorithms, implement training loops, or audit RL systems for security issues. Make sure to use this skill when the user mentions reinforcement learning, Q-learning, SARSA, RL training, agent training, policy learning, or any ML system that learns from rewards and environment interaction.

Reinforcement Learning Algorithms

Quick Start

# Implement Q-Learning from scratch
python scripts/q_learning.py --env grid-world --episodes 1000

# Implement SARSA with softmax exploration
python scripts/sarsa.py --env grid-world --exploration softmax --tau 0.5

# Check for backdoor patterns in trained policy
python scripts/backdoor_detector.py --policy model.pkl --canary-episodes 50

Core Concepts

What is Reinforcement Learning?

Key components:

Agent: The learner/decision maker
Environment: The world the agent interacts with
State (s): Current situation
Action (a): What the agent does
Reward (r): Feedback signal
Policy: Strategy for choosing actions

When to Use RL

RL excels at sequential decision-making problems:

Robotics and control systems
Game playing
Autonomous systems
Resource allocation
Recommendation systems with long-term goals

Q-Learning Algorithm

Q-Learning is a model-free, off-policy algorithm that learns the value of actions in each state using a Q-table.

How It Works

Initialize Q-table with zeros or small random values
Select action using exploration strategy (ε-greedy)
Execute action, observe next state and reward

Update Q-value using Bellman equation:

Q(s, a) = Q(s, a) + α * (r + γ * max(Q(s', a')) - Q(s, a))

Repeat until convergence

Parameters Explained

Key insight: Q-Learning uses max(Q(s', a')) - the best possible future value - regardless of what action the current policy would take. This makes it off-policy.

Implementation Tips

Start with ε = 1.0 and decay to 0.1 over training
Higher α = faster learning but potentially unstable
γ closer to 1 = agent values long-term rewards more
For large state spaces, use function approximation (neural networks)

SARSA Algorithm

SARSA (State-Action-Reward-State-Action) is a model-free, on-policy algorithm similar to Q-Learning but with a key difference in the update rule.

How It Works

Initialize Q-table
Select action using current policy (ε-greedy or softmax)
Execute action, observe next state and reward
Select next action a' using current policy

Update Q-value:

Q(s, a) = Q(s, a) + α * (r + γ * Q(s', a') - Q(s, a))

Repeat until convergence

Key Difference from Q-Learning

Action Selection Strategies

ε-Greedy

With probability ε: choose random action (explore)
With probability 1-ε: choose best-known action (exploit)

Softmax (Boltzmann)

Probability proportional to Q-value:

P(a|s) = exp(Q(s, a) / τ) / Σ(exp(Q(s, a') / τ))

τ (temperature) controls exploration:
- High τ = more uniform probabilities (explore)
- Low τ = favor high Q-values (exploit)

On-Policy vs Off-Policy

Security Considerations

RL systems are vulnerable to training-time attacks. Understanding these is critical for production deployments.

Training-Time Backdoors

How they work:

Attacker injects poisoned trajectories with trigger states
When trigger appears, agent performs attacker-chosen behavior
Clean performance remains normal, hiding the backdoor

Detection strategies:

Inspect reward deltas per state - abrupt local improvements are suspicious
Maintain a canary trigger set - hold-out episodes with rare states
Verify each policy independently before aggregation in multi-agent settings

Reward Model Poisoning (RLHF)

Attack vector:

Flip <5% of preference labels during reward model training
Add trigger tokens to prompts
Force preferences where attacker content is marked "better"
Downstream PPO learns to output attacker content when trigger appears

Defense:

Monitor preference label distribution
Use canary triggers in evaluation
Audit reward model outputs for trigger-dependent behavior

Red-Team Checklist

- [ ] Inspect reward deltas per state for anomalies
- [ ] Test with canary trigger set (synthetic rare states/tokens)
- [ ] Verify each shared policy via rollouts before aggregation
- [ ] Monitor for trigger-dependent behavior changes
- [ ] Keep training data provenance and audit logs
- [ ] Use ensemble methods to detect outlier policies

Scripts Reference

`scripts/q_learning.py`

Implements Q-Learning with configurable exploration and learning parameters.

python scripts/q_learning.py \
  --env grid-world \
  --episodes 1000 \
  --alpha 0.1 \
  --gamma 0.99 \
  --epsilon-start 1.0 \
  --epsilon-end 0.1 \
  --epsilon-decay 0.995

`scripts/sarsa.py`

Implements SARSA with ε-greedy or softmax exploration.

python scripts/sarsa.py \
  --env grid-world \
  --episodes 1000 \
  --exploration softmax \
  --tau 0.5 \
  --alpha 0.1 \
  --gamma 0.99

`scripts/backdoor_detector.py`

Detects potential backdoor patterns in trained RL policies.

python scripts/backdoor_detector.py \
  --policy model.pkl \
  --canary-episodes 50 \
  --trigger-patterns triggers.json \
  --output report.json

Common Pitfalls

ε not decaying: Agent keeps exploring randomly, never converges
γ too high: Agent overvalues distant rewards, learning becomes unstable
α too high: Q-values oscillate, never settle
Insufficient exploration: Agent gets stuck in local optima
Ignoring security: Production RL systems can be backdoored during training

Debugging Tips

Log Q-table changes to see learning progress
Track average reward per episode - should trend upward
Visualize policy to see if it makes sense
Test with known solvable environments first
Use deterministic seeds for reproducibility

When to Use Each Algorithm

Next Steps

Start with scripts/q_learning.py on a simple grid world
Compare Q-Learning vs SARSA on the same problem
Add security monitoring with scripts/backdoor_detector.py
Scale to neural network function approximation for complex problems
Implement proper evaluation with hold-out test environments

Related Skills

abelrguezr/house-of-lore-exploit

testing

VerifiedTrustedCommunity

How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-lore-exploit

abelrguezr/house-of-force-exploit

testing

VerifiedTrustedCommunity

How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-force-exploit

abelrguezr/house-of-einherjar

tools

VerifiedTrustedCommunity

How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-einherjar

abelrguezr/heap-overflow-exploitation

testing

VerifiedTrustedCommunity

How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/heap-overflow-exploitation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/abelrguezr/hacktricks-skills.git

# Copy into Claude Code skills folder (global)
cp -r hacktricks-skills/skills/AI/AI-Reinforcement-Learning-Algorithms ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

abelrguezr/hacktricks-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT