skills/nlp/learned-attention-pooling/SKILL.md
Replace mean pooling with a trainable attention network (Linear-Tanh-Linear-Softmax) that learns token importance weights over transformer hidden states
npx skillsauth add wenmin-wu/ds-skills nlp-learned-attention-poolingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Mean pooling treats every token equally, which dilutes signal from key tokens. A lightweight trainable attention layer learns which tokens matter most for the task. The network computes a scalar weight per token via Linear → Tanh → Linear → Softmax, then produces a weighted sum of hidden states. Adds minimal parameters (~500K) but can significantly improve regression and classification tasks.
import torch.nn as nn
class AttentionPooling(nn.Module):
def __init__(self, hidden_size, attn_size=512):
super().__init__()
self.attention = nn.Sequential(
nn.Linear(hidden_size, attn_size),
nn.Tanh(),
nn.Linear(attn_size, 1),
nn.Softmax(dim=1)
)
def forward(self, hidden_states, attention_mask=None):
# hidden_states: (batch, seq_len, hidden_size)
weights = self.attention(hidden_states) # (batch, seq_len, 1)
if attention_mask is not None:
weights = weights * attention_mask.unsqueeze(-1)
weights = weights / (weights.sum(dim=1, keepdim=True) + 1e-8)
return (weights * hidden_states).sum(dim=1) # (batch, hidden_size)
class Model(nn.Module):
def __init__(self, model_name):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
h = self.encoder.config.hidden_size
self.pool = AttentionPooling(h)
self.head = nn.Linear(h, 1)
def forward(self, **inputs):
hidden = self.encoder(**inputs).last_hidden_state
pooled = self.pool(hidden, inputs.get('attention_mask'))
return self.head(pooled)
AttentionPooling modulenn.init.xavier_uniform_ on linear layers for stable trainingnlp-attention-head-pooling for multi-head variantdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF