skills/AI/AI-llm-architecture/4.-attention-mechanisms/SKILL.md
How to implement and understand attention mechanisms in neural networks and LLMs. Use this skill whenever the user needs to build self-attention layers, causal attention, multi-head attention, or understand how attention weights are calculated. Trigger this skill for any task involving attention scores, Q/K/V matrices, attention masking, or transformer architecture components.
npx skillsauth add abelrguezr/hacktricks-skills attention-mechanismsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill helps you implement and understand attention mechanisms used in neural networks and large language models (LLMs).
Use this skill when you need to:
Attention allows a model to focus on specific parts of the input when generating each output. It assigns different weights to different inputs based on their relevance.
Key components:
Calculate the dot product between the query and each key:
attention_score[i] = query · key[i]
For embeddings, this is the sum of element-wise products.
Divide by the square root of the key dimension to prevent large values:
scaled_score = attention_score / sqrt(d_k)
Normalize scores to get weights that sum to 1:
attention_weight[i] = exp(scaled_score[i]) / sum(exp(scaled_scores))
Weighted sum of values using attention weights:
context_vector = sum(attention_weight[i] * value[i])
import torch
import torch.nn as nn
class SelfAttention(nn.Module):
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
def forward(self, x):
# x shape: (batch, seq_len, d_in)
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
# Attention scores: (batch, seq_len, seq_len)
attn_scores = queries @ keys.transpose(-2, -1)
# Scale and softmax
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5,
dim=-1
)
# Context vector: (batch, seq_len, d_out)
context_vec = attn_weights @ values
return context_vec
For LLMs, prevent attending to future tokens:
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout=0.0, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
# Create causal mask (upper triangle = -inf)
self.register_buffer(
'mask',
torch.triu(torch.ones(context_length, context_length), diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(-2, -1)
# Apply causal mask
attn_scores.masked_fill_(
self.mask.bool()[:num_tokens, :num_tokens],
-torch.inf
)
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5,
dim=-1
)
attn_weights = self.dropout(attn_weights)
context_vec = attn_weights @ values
return context_vec
Run multiple attention heads in parallel:
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out)
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length), diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
# Split into heads: (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
# Transpose: (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)
# Scaled dot-product attention
attn_scores = queries @ keys.transpose(-2, -1)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)
# Combine heads
context_vec = (attn_weights @ values).transpose(1, 2)
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec)
return context_vec
For the sentence "Hello shiny sun!" with 3D embeddings:
| Word | Embedding | |------|-----------| | Hello | [0.34, 0.22, 0.54] | | shiny | [0.53, 0.34, 0.98] | | sun | [0.29, 0.54, 0.93] |
Compute attention for "shiny":
Attention scores (dot products with "shiny" as query):
Apply softmax to get weights:
Context vector (weighted sum):
Solution: Check that you're scaling by sqrt(d_k). Without scaling, softmax saturates.
Solution: Ensure Q/K/V matrices are trainable parameters (use nn.Linear or nn.Parameter).
Solution: Verify causal mask is applied BEFORE softmax, not after.
Solution: Remember the transpose pattern:
Use the scripts/verify_attention.py script to:
testing
How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.
testing
How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".
tools
How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.
testing
How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.