skills/nlp/attention-head-pooling/SKILL.md
Learns attention weights over token positions to compute a weighted average of hidden states for sequence representation.
npx skillsauth add wenmin-wu/ds-skills nlp-attention-head-poolingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Instead of CLS token or mean pooling, add a learnable attention layer that scores each token position and computes a weighted sum. This lets the model focus on the most informative tokens in the sequence — e.g., complex vocabulary in readability tasks or key entities in classification.
import torch
import torch.nn as nn
class AttentionPooling(nn.Module):
def __init__(self, hidden_size):
super().__init__()
self.attention = nn.Sequential(
nn.Linear(hidden_size, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, 1),
)
def forward(self, hidden_states, attention_mask=None):
# hidden_states: (B, S, H)
weights = self.attention(hidden_states).squeeze(-1) # (B, S)
if attention_mask is not None:
weights = weights.masked_fill(attention_mask == 0, -1e9)
weights = torch.softmax(weights, dim=1) # (B, S)
return torch.sum(hidden_states * weights.unsqueeze(-1), dim=1) # (B, H)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF