skills/are-open-weight-ready-social/SKILL.md
Build LLM-based content moderation pipelines using zero-shot classification with open-weight models. Implements the structured prompting and evaluation methodology from Chou et al. (2026) for detecting rudeness, intolerance, and threats in social media posts. Use when: 'build a content moderation system', 'detect harmful posts with LLMs', 'moderate social media content', 'set up open-weight moderation pipeline', 'evaluate LLM moderation accuracy', 'privacy-preserving content filtering'.
npx skillsauth add ndpvt-web/arxiv-claude-skills are-open-weight-ready-socialInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build and evaluate content moderation systems that use open-weight LLMs for zero-shot harmful content detection. Based on Chou et al. (2026), the approach uses structured role-definition prompts with precise violation definitions to classify social media posts into categories (rudeness, intolerance, threats) without fine-tuning. The key finding: open-weight models running on a single 24GB GPU achieve sensitivity (81-97%) and specificity (91-100%) competitive with proprietary models, making privacy-preserving, on-premise moderation viable.
The core method is definition-anchored zero-shot classification: each moderation category is paired with a precise natural-language definition, injected into a standardized prompt template with the role "You are a social media moderator." The model outputs a binary yes/no decision. No few-shot examples or fine-tuning are needed. The prompt template follows this structure:
Instruction: You are a social media moderator. Decide whether a given post is {category}
according to this definition: '{definition}'. Respond with either yes or no (lowercase).
Input: Post: {post_text}
Output: Response:
What makes this effective is the tight coupling between definition and decision boundary. The definition text (e.g., "Rude or impolite, including crude language and disrespectful comments, without constructive purpose") acts as the classification specification. By varying the definition, the same pipeline handles different violation types without architectural changes. The study found that specificity exceeds sensitivity for subjective categories like rudeness (fewer false positives, more missed violations), while sensitivity exceeds specificity for clear-cut violations like threats and intolerance (catches more true positives). This asymmetry is important for calibrating moderation thresholds.
For deployment, open-weight models (9B-30B parameters with mixture-of-experts architectures) run on a single NVIDIA RTX 3090 (24GB VRAM) using vLLM for inference. Temperature is set to 0 with a fixed random seed for deterministic outputs. Max output tokens should be capped (the study used 10,000 to accommodate reasoning traces) but the actual classification answer is extracted from the first token or short response.
Define your moderation taxonomy. Write precise, one-sentence definitions for each violation category. Follow the pattern: "[Category name]: [Observable behavior], including [specific examples], [scope qualifier]." Example: "Rude: Rude or impolite, including crude language and disrespectful comments, without constructive purpose."
Construct prompt templates for each category. Use the role-definition-binary format: assign the moderator role, inject the category definition, present the post, and constrain output to "yes" or "no". Keep the template identical across categories except for the definition string.
Select and deploy the model. For privacy-preserving local deployment, choose an open-weight model that fits in 24GB VRAM (e.g., Qwen3-30B-A3B, Nemotron-Nano-9B). Serve via vLLM with temperature=0 and a fixed seed for reproducibility. For cloud deployment, any reasoning-capable model works.
Implement the classification endpoint. Build an API that accepts post text, runs it through each category's prompt template, parses the binary response, and returns a structured moderation decision. Handle response parsing defensively: extract "yes"/"no" from potentially verbose reasoning output.
Handle model refusals as positive signals. When a model's safety filter triggers and refuses to process a post, interpret the refusal as a positive classification (the content was harmful enough to trigger built-in guardrails). Log these cases separately for review.
Cap output tokens and extract the decision. Set max_tokens high enough for reasoning models (4096-10000) but parse only the final answer. For reasoning models that emit <think> blocks, strip the reasoning and extract the yes/no from the response section.
Evaluate with sensitivity and specificity, not just accuracy. Compute true positive rate (sensitivity) and true negative rate (specificity) separately per category. Accuracy alone is misleading because harmful posts are rare (class imbalance). Report both metrics side by side.
Compute inter-rater agreement if using multiple annotators or models. Use Cohen's kappa for pairwise agreement and Fleiss' kappa for multi-rater scenarios. Agreement between LLMs and humans on par with human-human agreement (kappa > 0.6) indicates deployment readiness.
Calibrate per-category thresholds. For subjective categories (rudeness), accept higher false-negative rates to avoid over-censorship. For safety-critical categories (threats), bias toward higher sensitivity even at the cost of more false positives. Make these trade-offs configurable.
Build the personalization layer. Allow users to adjust their own sensitivity thresholds per category. The same underlying model serves both platform-wide enforcement (strict definitions) and personal filtering (user-tuned definitions or thresholds).
Example 1: Building a Bluesky moderation microservice
User: "I want to build a content moderation API for my Bluesky-compatible service that runs locally for privacy."
Approach:
/moderate endpointOutput:
# moderation_service.py
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams
app = FastAPI()
CATEGORIES = {
"rude": "Rude or impolite, including crude language and disrespectful comments, without constructive purpose",
"intolerant": "Discrimination against protected groups",
"threat": "Promotes violence or harm towards others, including threats, incitement, or advocacy of harm",
}
PROMPT_TEMPLATE = (
"Instruction: You are a social media moderator. Decide whether a given post "
"is {category} according to this definition: '{definition}'. "
"Respond with either yes or no (lowercase).\n"
"Input: Post: {post_text}\n"
"Output: Response:"
)
llm = LLM(model="Qwen/Qwen3-30B-A3B", gpu_memory_utilization=0.9)
params = SamplingParams(temperature=0, max_tokens=4096, seed=350)
class Post(BaseModel):
text: str
class ModerationResult(BaseModel):
rude: bool
intolerant: bool
threat: bool
flagged: bool
def parse_decision(output_text: str) -> bool:
"""Extract yes/no from model output, handling reasoning traces."""
cleaned = output_text.strip().lower()
# For reasoning models, look after </think> tag
if "</think>" in cleaned:
cleaned = cleaned.split("</think>")[-1].strip()
if cleaned.startswith("yes"):
return True
if cleaned.startswith("no"):
return False
# Model refusal = treat as flagged
if "i cannot" in cleaned or "i'm unable" in cleaned:
return True
return "yes" in cleaned.split("\n")[0]
@app.post("/moderate", response_model=ModerationResult)
async def moderate(post: Post):
prompts = [
PROMPT_TEMPLATE.format(
category=cat, definition=defn, post_text=post.text
)
for cat, defn in CATEGORIES.items()
]
outputs = llm.generate(prompts, params)
decisions = {
cat: parse_decision(out.outputs[0].text)
for cat, out in zip(CATEGORIES.keys(), outputs)
}
decisions["flagged"] = any(decisions.values())
return ModerationResult(**decisions)
Example 2: Evaluating moderation model accuracy
User: "I have a labeled dataset of posts. How do I evaluate which LLM is best for moderation?"
Approach:
Output:
# evaluate_moderation.py
import numpy as np
from sklearn.metrics import confusion_matrix, cohen_kappa_score
import json
def evaluate_model(predictions: list[bool], ground_truth: list[bool], label: str):
tn, fp, fn, tp = confusion_matrix(ground_truth, predictions).ravel()
sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0.0
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0.0
accuracy = (tp + tn) / (tp + tn + fp + fn)
kappa = cohen_kappa_score(ground_truth, predictions)
return {
"category": label,
"sensitivity": round(sensitivity, 3),
"specificity": round(specificity, 3),
"accuracy": round(accuracy, 3),
"cohens_kappa": round(kappa, 3),
"tp": int(tp), "fp": int(fp), "fn": int(fn), "tn": int(tn),
}
# Example usage with results from multiple models
models = ["qwen3-30b", "nemotron-9b", "gpt-4o", "gemini-2.5-pro"]
for model_name in models:
preds = load_predictions(model_name) # list of bool
truth = load_ground_truth() # list of bool
for category in ["rude", "intolerant", "threat"]:
result = evaluate_model(
preds[category], truth[category], category
)
print(f"{model_name} | {json.dumps(result)}")
Example 3: Adding personalized moderation filters
User: "I want users to be able to set their own moderation sensitivity for different content types."
Approach:
Output:
# personalized_moderation.py
from dataclasses import dataclass
@dataclass
class UserPreferences:
rude_threshold: float = 0.5 # 0.0 = show everything, 1.0 = hide aggressively
intolerant_threshold: float = 0.7
threat_threshold: float = 0.9 # most users want strict threat filtering
def personalized_moderate(
post_text: str,
user_prefs: UserPreferences,
llm_client,
n_samples: int = 5
) -> dict[str, bool]:
"""Run moderation with user-specific sensitivity thresholds."""
results = {}
for category, threshold in [
("rude", user_prefs.rude_threshold),
("intolerant", user_prefs.intolerant_threshold),
("threat", user_prefs.threat_threshold),
]:
# Sample multiple times with low temperature variation
votes = []
for i in range(n_samples):
prompt = build_prompt(category, post_text)
response = llm_client.generate(prompt, temperature=0.1 * i, seed=350 + i)
votes.append(parse_decision(response))
confidence = sum(votes) / len(votes)
results[category] = confidence >= threshold
return results
temperature=0 and fix the random seed for reproducible moderation decisions. Non-deterministic moderation erodes user trust.<think> blocks before the answer. Always strip reasoning traces and extract the final yes/no. Set max_tokens high enough (4096+) to avoid truncation before the answer appears.HarmBlockThreshold.BLOCK_NONE) or host the model locally where you control the safety layer.Chou, H.-Y., Naveed, W., Zhou, S., & Yang, X. (2026). Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky. arXiv:2602.05189v1. https://arxiv.org/abs/2602.05189v1
Key takeaway: Open-weight models (9B-30B) match proprietary LLMs on moderation accuracy using zero-shot definition-anchored prompts, enabling privacy-preserving deployment on a single 24GB consumer GPU.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".