Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

abelrguezr/attention-mechanisms

Name: attention-mechanisms
Author: abelrguezr

skills/AI/AI-llm-architecture/4.-attention-mechanisms/SKILL.md

npx skillsauth add abelrguezr/hacktricks-skills attention-mechanisms

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Attention Mechanisms Skill

This skill helps you implement and understand attention mechanisms used in neural networks and large language models (LLMs).

What This Skill Covers

Self-attention: Computing attention weights between tokens in a sequence
Scaled dot-product attention: Using Q/K/V matrices with proper scaling
Causal attention: Masking future tokens for autoregressive generation
Multi-head attention: Running multiple attention heads in parallel
Manual calculations: Step-by-step attention weight computation

When to Use This Skill

Use this skill when you need to:

Implement attention layers from scratch in PyTorch or similar frameworks
Debug or visualize attention patterns in a model
Understand how attention weights are calculated
Build transformer components (encoder/decoder layers)
Explain attention mechanisms to others
Convert between manual calculations and code implementations

Core Concepts

Attention Mechanism Overview

Attention allows a model to focus on specific parts of the input when generating each output. It assigns different weights to different inputs based on their relevance.

Key components:

Query (Q): What we're looking for
Key (K): What each position contains
Value (V): What each position contributes
Attention weights: How much to attend to each position

Step-by-Step Attention Calculation

Step 1: Compute Attention Scores

Calculate the dot product between the query and each key:

attention_score[i] = query · key[i]

For embeddings, this is the sum of element-wise products.

Step 2: Scale the Scores

Divide by the square root of the key dimension to prevent large values:

scaled_score = attention_score / sqrt(d_k)

Step 3: Apply Softmax

Normalize scores to get weights that sum to 1:

attention_weight[i] = exp(scaled_score[i]) / sum(exp(scaled_scores))

Step 4: Compute Context Vector

Weighted sum of values using attention weights:

context_vector = sum(attention_weight[i] * value[i])

Implementation Patterns

Basic Self-Attention (PyTorch)

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        # x shape: (batch, seq_len, d_in)
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        # Attention scores: (batch, seq_len, seq_len)
        attn_scores = queries @ keys.transpose(-2, -1)
        
        # Scale and softmax
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, 
            dim=-1
        )

        # Context vector: (batch, seq_len, d_out)
        context_vec = attn_weights @ values
        return context_vec

Causal Attention (Masked)

For LLMs, prevent attending to future tokens:

class CausalAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout=0.0, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        
        # Create causal mask (upper triangle = -inf)
        self.register_buffer(
            'mask', 
            torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(-2, -1)
        
        # Apply causal mask
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], 
            -torch.inf
        )
        
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, 
            dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec

Multi-Head Attention

Run multiple attention heads in parallel:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)
        self.dropout = nn.Dropout(dropout)
        
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        # Split into heads: (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Scaled dot-product attention
        attn_scores = queries @ keys.transpose(-2, -1)
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Combine heads
        context_vec = (attn_weights @ values).transpose(1, 2)
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)

        return context_vec

Manual Calculation Example

For the sentence "Hello shiny sun!" with 3D embeddings:

| Word | Embedding | |------|-----------| | Hello | [0.34, 0.22, 0.54] | | shiny | [0.53, 0.34, 0.98] | | sun | [0.29, 0.54, 0.93] |

Compute attention for "shiny":

Attention scores (dot products with "shiny" as query):
- Hello: 0.34×0.53 + 0.22×0.34 + 0.54×0.98 = 0.775
- shiny: 0.53×0.53 + 0.34×0.34 + 0.98×0.98 = 1.317
- sun: 0.29×0.53 + 0.54×0.34 + 0.93×0.98 = 1.225
Apply softmax to get weights:
- exp(0.775) = 2.170
- exp(1.317) = 3.732
- exp(1.225) = 3.405
- Sum = 9.307
- Weights: [0.233, 0.401, 0.366]
Context vector (weighted sum):
- = 0.233×[0.34, 0.22, 0.54] + 0.401×[0.53, 0.34, 0.98] + 0.366×[0.29, 0.54, 0.93]
- = [0.399, 0.386, 0.861]

Common Issues and Solutions

Issue: Attention weights are all similar

Solution: Check that you're scaling by sqrt(d_k). Without scaling, softmax saturates.

Issue: Model can't learn

Solution: Ensure Q/K/V matrices are trainable parameters (use nn.Linear or nn.Parameter).

Issue: Future tokens leaking in

Solution: Verify causal mask is applied BEFORE softmax, not after.

Issue: Shape mismatches

Solution: Remember the transpose pattern:

After Q @ K.T: (batch, seq_len, seq_len)
After softmax: (batch, seq_len, seq_len)
After weights @ V: (batch, seq_len, d_out)

Testing Your Implementation

Use the scripts/verify_attention.py script to:

Verify attention weights sum to 1
Check causal masking works correctly
Validate multi-head attention shapes

References

Build a Large Language Model from Scratch
LLMs from Scratch (GitHub)
PyTorch MultiheadAttention

abelrguezr/attention-mechanisms

skills/AI/AI-llm-architecture/4.-attention-mechanisms/SKILL.md

How to implement and understand attention mechanisms in neural networks and LLMs. Use this skill whenever the user needs to build self-attention layers, causal attention, multi-head attention, or understand how attention weights are calculated. Trigger this skill for any task involving attention scores, Q/K/V matrices, attention masking, or transformer architecture components.

5 stars

development

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add abelrguezr/hacktricks-skills attention-mechanisms

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:06 AM132.2s2 files scanned

SKILL.md

name:: attention-mechanisms
description:: How to implement and understand attention mechanisms in neural networks and LLMs. Use this skill whenever the user needs to build self-attention layers, causal attention, multi-head attention, or understand how attention weights are calculated. Trigger this skill for any task involving attention scores, Q/K/V matrices, attention masking, or transformer architecture components.

Attention Mechanisms Skill

This skill helps you implement and understand attention mechanisms used in neural networks and large language models (LLMs).

What This Skill Covers

Self-attention: Computing attention weights between tokens in a sequence
Scaled dot-product attention: Using Q/K/V matrices with proper scaling
Causal attention: Masking future tokens for autoregressive generation
Multi-head attention: Running multiple attention heads in parallel
Manual calculations: Step-by-step attention weight computation

When to Use This Skill

Use this skill when you need to:

Implement attention layers from scratch in PyTorch or similar frameworks
Debug or visualize attention patterns in a model
Understand how attention weights are calculated
Build transformer components (encoder/decoder layers)
Explain attention mechanisms to others
Convert between manual calculations and code implementations

Core Concepts

Attention Mechanism Overview

Attention allows a model to focus on specific parts of the input when generating each output. It assigns different weights to different inputs based on their relevance.

Key components:

Query (Q): What we're looking for
Key (K): What each position contains
Value (V): What each position contributes
Attention weights: How much to attend to each position

Step-by-Step Attention Calculation

Step 1: Compute Attention Scores

Calculate the dot product between the query and each key:

attention_score[i] = query · key[i]

For embeddings, this is the sum of element-wise products.

Step 2: Scale the Scores

Divide by the square root of the key dimension to prevent large values:

scaled_score = attention_score / sqrt(d_k)

Step 3: Apply Softmax

Normalize scores to get weights that sum to 1:

attention_weight[i] = exp(scaled_score[i]) / sum(exp(scaled_scores))

Step 4: Compute Context Vector

Weighted sum of values using attention weights:

context_vector = sum(attention_weight[i] * value[i])

Implementation Patterns

Basic Self-Attention (PyTorch)

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        # x shape: (batch, seq_len, d_in)
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        # Attention scores: (batch, seq_len, seq_len)
        attn_scores = queries @ keys.transpose(-2, -1)
        
        # Scale and softmax
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, 
            dim=-1
        )

        # Context vector: (batch, seq_len, d_out)
        context_vec = attn_weights @ values
        return context_vec

Causal Attention (Masked)

For LLMs, prevent attending to future tokens:

class CausalAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout=0.0, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        
        # Create causal mask (upper triangle = -inf)
        self.register_buffer(
            'mask', 
            torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(-2, -1)
        
        # Apply causal mask
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], 
            -torch.inf
        )
        
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, 
            dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec

Multi-Head Attention

Run multiple attention heads in parallel:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)
        self.dropout = nn.Dropout(dropout)
        
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        # Split into heads: (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Scaled dot-product attention
        attn_scores = queries @ keys.transpose(-2, -1)
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Combine heads
        context_vec = (attn_weights @ values).transpose(1, 2)
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)

        return context_vec

Manual Calculation Example

For the sentence "Hello shiny sun!" with 3D embeddings:

| Word | Embedding | |------|-----------| | Hello | [0.34, 0.22, 0.54] | | shiny | [0.53, 0.34, 0.98] | | sun | [0.29, 0.54, 0.93] |

Compute attention for "shiny":

Attention scores (dot products with "shiny" as query):
- Hello: 0.34×0.53 + 0.22×0.34 + 0.54×0.98 = 0.775
- shiny: 0.53×0.53 + 0.34×0.34 + 0.98×0.98 = 1.317
- sun: 0.29×0.53 + 0.54×0.34 + 0.93×0.98 = 1.225
Apply softmax to get weights:
- exp(0.775) = 2.170
- exp(1.317) = 3.732
- exp(1.225) = 3.405
- Sum = 9.307
- Weights: [0.233, 0.401, 0.366]
Context vector (weighted sum):
- = 0.233×[0.34, 0.22, 0.54] + 0.401×[0.53, 0.34, 0.98] + 0.366×[0.29, 0.54, 0.93]
- = [0.399, 0.386, 0.861]

Common Issues and Solutions

Issue: Attention weights are all similar

Solution: Check that you're scaling by sqrt(d_k). Without scaling, softmax saturates.

Issue: Model can't learn

Solution: Ensure Q/K/V matrices are trainable parameters (use nn.Linear or nn.Parameter).

Issue: Future tokens leaking in

Solution: Verify causal mask is applied BEFORE softmax, not after.

Issue: Shape mismatches

Solution: Remember the transpose pattern:

After Q @ K.T: (batch, seq_len, seq_len)
After softmax: (batch, seq_len, seq_len)
After weights @ V: (batch, seq_len, d_out)

Testing Your Implementation

Use the scripts/verify_attention.py script to:

Verify attention weights sum to 1
Check causal masking works correctly
Validate multi-head attention shapes

References

Build a Large Language Model from Scratch
LLMs from Scratch (GitHub)
PyTorch MultiheadAttention

Related Skills

abelrguezr/house-of-lore-exploit

testing

VerifiedTrustedCommunity

How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-lore-exploit

abelrguezr/house-of-force-exploit

testing

VerifiedTrustedCommunity

How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-force-exploit

abelrguezr/house-of-einherjar

tools

VerifiedTrustedCommunity

How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/house-of-einherjar

abelrguezr/heap-overflow-exploitation

testing

VerifiedTrustedCommunity

How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.

5SKILL.mdUpdated Apr 16, 2026

abelrguezr/heap-overflow-exploitation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/abelrguezr/hacktricks-skills.git

# Copy into Claude Code skills folder (global)
cp -r hacktricks-skills/skills/AI/AI-llm-architecture/4.-attention-mechanisms ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

abelrguezr/hacktricks-skills

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT