skills/nlp/last-hidden-states-concat/SKILL.md
Concatenate the last two transformer hidden states along the feature dimension before the task head for richer token representations
npx skillsauth add wenmin-wu/ds-skills nlp-last-hidden-states-concatInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Instead of using only the final hidden state from a transformer encoder, concatenate the last two (or more) hidden layers along the feature dimension. The second-to-last layer often captures different linguistic patterns than the final layer. This doubles the representation size but consistently improves span extraction and token classification tasks with minimal overhead.
import torch
import torch.nn as nn
from transformers import AutoModel, AutoConfig
class SpanExtractor(nn.Module):
def __init__(self, model_name, n_layers=2):
super().__init__()
config = AutoConfig.from_pretrained(model_name, output_hidden_states=True)
self.encoder = AutoModel.from_pretrained(model_name, config=config)
hidden_size = config.hidden_size * n_layers
self.dropout = nn.Dropout(0.1)
self.head = nn.Linear(hidden_size, 2) # start + end logits
nn.init.normal_(self.head.weight, std=0.02)
self.n_layers = n_layers
def forward(self, input_ids, attention_mask, token_type_ids=None):
outputs = self.encoder(
input_ids, attention_mask=attention_mask,
token_type_ids=token_type_ids)
hidden_states = outputs.hidden_states
# Concatenate last N layers
cat = torch.cat(hidden_states[-self.n_layers:], dim=-1)
logits = self.head(self.dropout(cat))
start_logits, end_logits = logits.split(1, dim=-1)
return start_logits.squeeze(-1), end_logits.squeeze(-1)
hidden_size * n_layersnlp-weighted-layer-pooling for learned weights instead of concatdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF