skills/llm/last-token-logit-binary-scoring/SKILL.md
Score a binary classification prompt by reading the logits of the True/False (or Yes/No) token IDs at the final position and softmaxing only those two values, skipping generation entirely for a 10-50x speedup over decoding
npx skillsauth add wenmin-wu/ds-skills llm-last-token-logit-binary-scoringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For binary classification with an instruct LLM, the standard recipe — generate(max_tokens=1) and parse the string — is wasteful. The model already computed the full vocabulary distribution at the last input position; you only need two scalar entries from it. Forward-pass once, grab logits[:, -1, [true_id, false_id]], softmax over those two values, and you have P(True) as a calibrated probability — not just 0/1. This is also the only way to get a useful continuous score from a model trained with cross-entropy on a single label token, which matters for AUC-style metrics where the threshold can be tuned post-hoc.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.bfloat16, device_map='auto').eval()
true_id = tok.encode('True', add_special_tokens=False)[0]
false_id = tok.encode('False', add_special_tokens=False)[0]
@torch.no_grad()
def score(prompts):
enc = tok(prompts, return_tensors='pt', padding=True, truncation=True, max_length=2048).to(model.device)
logits = model(**enc).logits # (B, T, V)
last_idx = enc.attention_mask.sum(1) - 1 # last real token per row
last = logits[torch.arange(len(prompts)), last_idx] # (B, V)
pair = last[:, [true_id, false_id]] # (B, 2)
return torch.softmax(pair.float(), dim=-1)[:, 0] # P(True)
from vllm import SamplingParams
sp = SamplingParams(temperature=0, max_tokens=1, logprobs=20)
out = llm.generate(prompts, sp)
probs = []
for o in out:
lp = o.outputs[0].logprobs[0] # dict[token_id -> Logprob]
pT = lp.get(true_id, type('x',(),{'logprob':-1e9})()).logprob
pF = lp.get(false_id, type('x',(),{'logprob':-1e9})()).logprob
probs.append(np.exp(pT) / (np.exp(pT) + np.exp(pF)))
True/False, Yes/No, A/B. Verify each tokenizes to exactly one ID."... Answer: " with no trailing space if the tokenizer eats it).max_length=2048 (or whatever fits); use left-padding so the last position is meaningful, OR index with attention_mask.sum(1) - 1 for right-padding.[true_id, false_id] columns from the last-position logits.P(True) ≈ 1e-3 because mass leaks to thousands of irrelevant tokens; pair-softmax gives a usable probability.Yes is one token in most tokenizers but Positive is two — use tok.encode(label, add_special_tokens=False) and assert len == 1.' True' vs 'True' are different tokens. Match what the prompt's trailing text demands.B * max_tokens * forward_per_token and discards 99.9% of the logits; this trick costs one forward pass.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF