skills/nlp/lstm-over-transformer-layers/SKILL.md
Feeds CLS token embeddings from each transformer layer into a BiLSTM to learn an optimal combination across layer depth.
npx skillsauth add wenmin-wu/ds-skills nlp-lstm-over-transformer-layersInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Different transformer layers capture different levels of abstraction. Instead of using only the last layer or a weighted sum, stack CLS token embeddings from all layers into a sequence (layer 1 → layer N) and pass through a BiLSTM. The LSTM learns which layers matter and how to combine them — often outperforming static pooling strategies.
import torch.nn as nn
class LSTMPooling(nn.Module):
def __init__(self, num_layers, hidden_size, lstm_hidden=256):
super().__init__()
self.num_layers = num_layers
self.hidden_size = hidden_size
self.lstm = nn.LSTM(hidden_size, lstm_hidden, batch_first=True, bidirectional=True)
self.dropout = nn.Dropout(0.1)
def forward(self, all_hidden_states):
# Stack CLS token from each layer: (batch, num_layers, hidden_size)
cls_per_layer = torch.stack(
[all_hidden_states[i][:, 0] for i in range(1, self.num_layers + 1)], dim=1
)
out, _ = self.lstm(cls_per_layer)
return self.dropout(out[:, -1, :]) # last LSTM output
# Usage in model forward:
outputs = transformer(input_ids, attention_mask, output_hidden_states=True)
pooled = lstm_pooling(outputs.hidden_states)
logits = classifier_head(pooled)
output_hidden_states=True in the transformer forward passdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF