skills/from-sparse-decisions-dense/SKILL.md
Build content moderation and safety classification systems using multi-attribute trajectory reasoning instead of binary labels. Decomposes monolithic safe/unsafe decisions into structured reasoning chains (evidence grounding, modality assessment, risk mapping, policy decision, response generation) with multi-head reward scoring. Use when asked to: 'build a content moderation pipeline', 'classify harmful content with explanations', 'create a safety filter with reasoning traces', 'design a multi-attribute content scorer', 'implement explainable content moderation', 'add dense safety reasoning to a classifier'.
npx skillsauth add ndpvt-web/arxiv-claude-skills from-sparse-decisions-denseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build content moderation and safety classification systems that replace brittle binary (safe/unsafe) labels with structured, multi-stage reasoning trajectories. Based on the UniMod paradigm, the approach decomposes each moderation decision into five explicit stages -- evidence grounding, modality assessment, risk mapping, policy decision, and response generation -- then scores outputs along multiple safety dimensions (quality, privacy, bias, toxicity, legal risk) using a multi-head reward model. This eliminates shortcut learning where classifiers latch onto superficial features, and produces explainable, auditable moderation decisions.
The core insight: Binary moderation labels (safe/unsafe) create a sparse supervision signal that lets models learn shortcuts -- e.g., flagging any image with skin tones as unsafe, or any mention of weapons as harmful regardless of context. UniMod replaces this with dense supervision by requiring the model to produce a structured reasoning trajectory before its final decision. Each stage constrains the search space for the next, so the model cannot skip to a conclusion without grounding it in evidence. This sequential decomposition reduces sample complexity from exponential (searching the full decision space) to stepwise (searching within each stage's logical subspace).
Multi-head reward scoring: Instead of a single reward signal, the UniRM component uses a shared backbone with five parallel scoring heads: quality/compliance, privacy protection, bias mitigation, toxicity avoidance, and legal risk assessment. Each head computes r_k = sigmoid(w_k^T * h) where h is the shared representation. Heads are kept independent via soft orthogonal regularization (sum of (cos_sim(w_i, w_j))^2 with penalty lambda=0.05) and stochastic scheduling that randomizes which head updates first each epoch. The final reward is an additive aggregate R = sum(w_k * r_k), which preserves reward variance and prevents the collapse-to-zero problem that multiplicative aggregation causes.
Practical upshot: This approach achieves state-of-the-art multimodal moderation with under 40% of the training data used by competing methods, because structured trajectories extract far more learning signal per sample than flat labels.
Define your safety taxonomy: Enumerate the risk categories your system must detect (e.g., violence, hate speech, sexual content, self-harm, misinformation, privacy violations). Map each to a policy document or ruleset. This becomes the "risk mapping" vocabulary.
Design the trajectory schema: Create a structured output format with five sequential fields. Use XML tags or JSON keys:
<evidence>: Specific quotes, regions, or features from the input that are safety-relevant<modality>: Which input modalities (text, image, audio, video) contain the flagged content<risk>: Which taxonomy categories apply, with severity (low/medium/high/critical)<decision>: Allow, flag, or block -- with the policy rule that justifies it<response>: The user-facing moderation message or explanationBuild the trajectory generation pipeline: For each input, generate the full five-stage trajectory rather than a single label. If training a model, use a teacher ensemble (multiple strong models vote on each stage, majority consensus wins for categorical stages, embedding similarity for free-text evidence). If building with an LLM, use structured prompting that forces sequential stage completion.
Implement multi-head scoring: Create separate scoring functions for each safety dimension (quality, privacy, bias, toxicity, legal). Each scorer evaluates the <response> stage output on its dimension, returning a score in [-1, 1]. Keep scorer weights orthogonal -- if using learned heads, add the cosine-similarity penalty term to your loss.
Aggregate scores additively: Combine dimension scores as R = sum(w_k * r_k) with configurable per-dimension weights. Do NOT use multiplicative aggregation -- it collapses reward variance when any single head is uncertain.
Decouple training objectives: If fine-tuning a model, use head-wise weight subspace decoupling (orthogonal regularization on head weight vectors) and stochastic head scheduling (random permutation of head update order each epoch) to prevent gradient interference between safety dimensions.
Validate intermediate stages, not just final decisions: Evaluate accuracy at each trajectory stage independently. A system that gets the right final answer but wrong evidence or wrong risk category is still fragile -- it will fail on distribution shifts.
Implement cascading early-exit: For production latency, allow high-confidence cases to exit early. If evidence grounding finds zero safety-relevant features with high confidence, skip to "allow" without full risk mapping. Reserve the full trajectory for ambiguous cases.
Build feedback loops: Log full trajectories for human review. When reviewers disagree with a decision, annotate which stage went wrong (bad evidence extraction? wrong risk category? correct risk but wrong policy application?). This targeted feedback is far more valuable than flipping a binary label.
Test against shortcut scenarios: Create adversarial test cases designed to trigger shortcuts -- benign content with surface-level "dangerous" keywords, harmful content disguised in neutral language, images with misleading context. Verify the system's evidence grounding catches the true signal.
Example 1: Building a moderation API with trajectory reasoning
User: "Build me a content moderation function that takes text and an optional image URL, and returns a structured safety assessment instead of just safe/unsafe."
Approach:
Output:
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class Severity(Enum):
NONE = "none"
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class Decision(Enum):
ALLOW = "allow"
FLAG = "flag"
BLOCK = "block"
@dataclass
class ModerationTrajectory:
evidence: list[str] # Specific excerpts/regions flagged
modality_flags: dict[str, bool] # {"text": True, "image": False}
risk_categories: dict[str, Severity] # {"violence": "none", "hate_speech": "high"}
decision: Decision
policy_rule: str # Which policy triggered the decision
explanation: str # Human-readable justification
RISK_TAXONOMY = [
"violence", "hate_speech", "sexual_content",
"self_harm", "misinformation", "privacy_violation",
]
def moderate(text: str, image_url: Optional[str] = None) -> ModerationTrajectory:
# Stage 1: Evidence grounding
evidence = extract_safety_evidence(text, image_url)
# Stage 2: Modality assessment
modality_flags = {
"text": any(e["source"] == "text" for e in evidence),
"image": any(e["source"] == "image" for e in evidence),
}
# Stage 3: Risk mapping
risk_categories = {}
for category in RISK_TAXONOMY:
risk_categories[category] = classify_risk(evidence, category)
# Stage 4: Policy decision
max_risk = max(risk_categories.values(), key=lambda s: s.value)
decision, policy_rule = apply_policy(risk_categories, max_risk)
# Stage 5: Response generation
explanation = generate_explanation(evidence, risk_categories, decision, policy_rule)
return ModerationTrajectory(
evidence=[e["text"] for e in evidence],
modality_flags=modality_flags,
risk_categories={k: v.value for k, v in risk_categories.items()},
decision=decision,
policy_rule=policy_rule,
explanation=explanation,
)
Example 2: Multi-head reward scorer for evaluating moderation quality
User: "I have a moderation model that generates responses. I need a scoring system that evaluates responses across multiple safety dimensions, not just accuracy."
Approach:
Output:
import torch
import torch.nn as nn
class MultiHeadRewardModel(nn.Module):
DIMENSIONS = ["quality", "privacy", "bias", "toxicity", "legal"]
def __init__(self, backbone_dim: int = 768, lambda_ortho: float = 0.05):
super().__init__()
self.lambda_ortho = lambda_ortho
self.heads = nn.ModuleDict({
dim: nn.Linear(backbone_dim, 1) for dim in self.DIMENSIONS
})
self.dim_weights = nn.Parameter(torch.ones(len(self.DIMENSIONS)) / len(self.DIMENSIONS))
def forward(self, hidden_state: torch.Tensor) -> dict:
scores = {}
for dim in self.DIMENSIONS:
scores[dim] = torch.sigmoid(self.heads[dim](hidden_state)).squeeze(-1)
# Additive aggregation (NOT multiplicative)
aggregate = sum(
self.dim_weights[i] * scores[dim]
for i, dim in enumerate(self.DIMENSIONS)
)
scores["aggregate"] = aggregate
return scores
def orthogonal_loss(self) -> torch.Tensor:
"""Penalize cosine similarity between head weight vectors."""
weights = [self.heads[d].weight.squeeze() for d in self.DIMENSIONS]
loss = torch.tensor(0.0, device=weights[0].device)
for i in range(len(weights)):
for j in range(i + 1, len(weights)):
cos_sim = torch.cosine_similarity(weights[i], weights[j], dim=0)
loss += cos_sim ** 2
return self.lambda_ortho * loss
def total_loss(self, predictions: dict, targets: dict, active_dim: str) -> torch.Tensor:
"""SSSL: single-sample single-label -- only one dimension has a label per sample."""
mse = nn.functional.mse_loss(predictions[active_dim], targets[active_dim])
return mse + self.orthogonal_loss()
Example 3: Converting a binary classifier into trajectory-based moderation
User: "I have an existing content filter that returns True/False. How do I refactor it to use the trajectory approach for better accuracy and explainability?"
Approach:
Output:
# Before: binary classifier
def is_harmful(text: str) -> bool:
return classifier.predict(text) > 0.5
# After: trajectory-wrapped moderation
def moderate_with_trajectory(text: str, image: bytes | None = None) -> dict:
trajectory = {}
# Stage 1: Evidence grounding -- extract what's actually concerning
trajectory["evidence"] = extract_evidence(text, image)
# Stage 2: Modality assessment
trajectory["modality"] = {
"text_flagged": bool(trajectory["evidence"]["text_spans"]),
"image_flagged": bool(trajectory["evidence"].get("image_regions")),
}
# Stage 3: Risk mapping -- categorize, don't just score
trajectory["risks"] = map_to_taxonomy(trajectory["evidence"])
# Stage 4: Policy decision -- use original classifier PLUS trajectory context
raw_score = classifier.predict(text)
has_grounded_evidence = len(trajectory["evidence"]["text_spans"]) > 0
has_mapped_risk = any(r["severity"] != "none" for r in trajectory["risks"])
# Override shortcuts: high score but no evidence = likely false positive
if raw_score > 0.5 and not has_grounded_evidence:
trajectory["decision"] = "allow"
trajectory["override_reason"] = "classifier_flagged_but_no_evidence_found"
elif has_grounded_evidence and has_mapped_risk:
trajectory["decision"] = "block"
else:
trajectory["decision"] = "allow"
# Stage 5: Explanation
trajectory["explanation"] = build_explanation(trajectory)
return trajectory
From Sparse Decisions to Dense Reasoning (arXiv:2602.02536) -- Focus on Section 3 (trajectory decomposition and the three lemmas justifying it), Section 4 (UniRM multi-head architecture with orthogonal regularization), and Table 2 (ablation showing each trajectory stage's contribution to final accuracy).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".