skills/AI/AI-Reinforcement-Learning-Algorithms/SKILL.md
Implement and analyze reinforcement learning algorithms (Q-Learning, SARSA, PPO) and detect security vulnerabilities like backdoor attacks and reward poisoning. Use this skill whenever you need to build RL agents, understand RL algorithms, implement training loops, or audit RL systems for security issues. Make sure to use this skill when the user mentions reinforcement learning, Q-learning, SARSA, RL training, agent training, policy learning, or any ML system that learns from rewards and environment interaction.
npx skillsauth add abelrguezr/hacktricks-skills reinforcement-learning-algorithmsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill helps you implement, understand, and secure reinforcement learning systems. It covers core algorithms (Q-Learning, SARSA), their differences, and critical security considerations for production RL systems.
# Implement Q-Learning from scratch
python scripts/q_learning.py --env grid-world --episodes 1000
# Implement SARSA with softmax exploration
python scripts/sarsa.py --env grid-world --exploration softmax --tau 0.5
# Check for backdoor patterns in trained policy
python scripts/backdoor_detector.py --policy model.pkl --canary-episodes 50
Reinforcement learning (RL) is where an agent learns to make decisions by interacting with an environment. The agent receives feedback as rewards or penalties, allowing it to learn optimal behaviors over time.
Key components:
RL excels at sequential decision-making problems:
Q-Learning is a model-free, off-policy algorithm that learns the value of actions in each state using a Q-table.
Q(s, a) = Q(s, a) + α * (r + γ * max(Q(s', a')) - Q(s, a))
| Parameter | Symbol | Range | Purpose | |-----------|--------|-------|---------| | Learning rate | α | 0 < α ≤ 1 | How much new info overrides old | | Discount factor | γ | 0 ≤ γ < 1 | Importance of future rewards | | Exploration rate | ε | 0 ≤ ε ≤ 1 | Probability of random action |
Key insight: Q-Learning uses max(Q(s', a')) - the best possible future value - regardless of what action the current policy would take. This makes it off-policy.
SARSA (State-Action-Reward-State-Action) is a model-free, on-policy algorithm similar to Q-Learning but with a key difference in the update rule.
Q(s, a) = Q(s, a) + α * (r + γ * Q(s', a') - Q(s, a))
SARSA uses Q(s', a') - the value of the action actually taken in the next state, not the maximum. This makes it on-policy - it learns from the actions the current policy would actually take.
P(a|s) = exp(Q(s, a) / τ) / Σ(exp(Q(s, a') / τ))
| Aspect | On-Policy (SARSA) | Off-Policy (Q-Learning) | |--------|-------------------|-------------------------| | Update uses | Actual next action | Best possible action | | Stability | More stable in some environments | Can be more aggressive | | Convergence | May be slower | Often faster | | Use case | When policy matters during learning | When you want optimal policy |
RL systems are vulnerable to training-time attacks. Understanding these is critical for production deployments.
How they work:
Detection strategies:
Attack vector:
Defense:
- [ ] Inspect reward deltas per state for anomalies
- [ ] Test with canary trigger set (synthetic rare states/tokens)
- [ ] Verify each shared policy via rollouts before aggregation
- [ ] Monitor for trigger-dependent behavior changes
- [ ] Keep training data provenance and audit logs
- [ ] Use ensemble methods to detect outlier policies
scripts/q_learning.pyImplements Q-Learning with configurable exploration and learning parameters.
python scripts/q_learning.py \
--env grid-world \
--episodes 1000 \
--alpha 0.1 \
--gamma 0.99 \
--epsilon-start 1.0 \
--epsilon-end 0.1 \
--epsilon-decay 0.995
scripts/sarsa.pyImplements SARSA with ε-greedy or softmax exploration.
python scripts/sarsa.py \
--env grid-world \
--episodes 1000 \
--exploration softmax \
--tau 0.5 \
--alpha 0.1 \
--gamma 0.99
scripts/backdoor_detector.pyDetects potential backdoor patterns in trained RL policies.
python scripts/backdoor_detector.py \
--policy model.pkl \
--canary-episodes 50 \
--trigger-patterns triggers.json \
--output report.json
| Scenario | Recommended Algorithm | |----------|----------------------| | Simple tabular problems | Q-Learning | | Need stable learning | SARSA | | Large state space | Deep Q-Network (DQN) | | Continuous actions | PPO, SAC | | Multi-agent coordination | MADDPG, QMIX | | Safety-critical | Constrained RL with monitoring |
scripts/q_learning.py on a simple grid worldscripts/backdoor_detector.pytesting
How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.
testing
How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".
tools
How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.
testing
How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.