skills/beyond-holistic-scores-automatic/SKILL.md
Build trait-based essay scoring systems that evaluate argumentative writing across multiple rubric dimensions (Content, Organization, Word Choice, Sentence Fluency, Conventions) using structured in-context learning prompts and ordinal regression. Trigger phrases: 'score essays by trait', 'build essay grading rubric', 'argumentative essay evaluation', 'trait-based writing assessment', 'rubric-aligned essay scoring', 'automated writing feedback'.
npx skillsauth add ndpvt-web/arxiv-claude-skills beyond-holistic-scores-automaticInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build automated essay scoring systems that go beyond single holistic scores by evaluating argumentative writing across five distinct quality traits: Ideas & Content, Organization, Word Choice, Sentence Fluency, and Conventions. The approach uses two complementary techniques from the paper: (1) structured in-context learning prompts with rubric-aligned exemplars for LLM-based scoring, and (2) a CORAL ordinal regression formulation that explicitly models the ordered nature of rubric scores, substantially outperforming standard classification and regression baselines.
Structured In-Context Learning for Trait Scoring. Rather than asking an LLM to produce a single essay score, this approach prompts the model as an expert evaluator with a carefully structured prompt: (1) a role specification ("expert evaluator of students' argumentative essays"), (2) a trait definition pulled directly from the rubric, (3) rubric guidelines with one exemplar essay per score level, (4) the student essay, and (5) an output specification requesting JSON with justification, score, and confidence. Critically, the output specification must appear at the very end of the prompt -- placing it elsewhere causes models to ignore formatting instructions. The paper found that a single exemplar per score level works best; adding more exemplars degrades output quality for non-reasoning models.
CORAL Ordinal Regression. For supervised models, standard cross-entropy classification treats score levels as independent categories, discarding the fact that a score of 2 is closer to 3 than to 1. The CORAL (Consistent Ordinal Regression) framework fixes this by converting a K-level ordinal task into K-1 binary threshold predictions. For a 3-level scale (weak/fair/strong), the model predicts P(score > 1) and P(score > 2) using shared features but separate thresholds. Training uses binary cross-entropy on each threshold with per-threshold class weights to handle imbalance. At inference, threshold cutoffs (c1, c2) are grid-searched on a validation set with the monotonicity constraint c2 >= c1. This approach achieved QWK scores of 0.59 (Content), 0.53 (Organization), 0.61 (Word Choice), 0.47 (Sentence Fluency), and 0.48 (Conventions) on the ASAP++ dataset -- consistently outperforming both LLMs and standard classification/regression baselines.
Score Remapping for Pedagogical Clarity. The original 6-point essay scores are remapped to 3 levels (1-2 -> weak, 3-4 -> fair, 5-6 -> strong). This coarser scale improves inter-rater agreement, is more pedagogically actionable, and aligns better with how teachers actually use rubrics. When building scoring systems, collapsing fine-grained scales to meaningful tiers often improves both model performance and practical utility.
Define the trait rubric. For each quality dimension, write a concise definition and describe what each score level looks like. Use established frameworks: Content (idea development and clarity), Organization (structure and coherence), Word Choice (vocabulary precision), Sentence Fluency (rhythm and variety), Conventions (grammar, spelling, punctuation).
Select one exemplar per score level. For each trait, choose one representative text per score tier (e.g., weak/fair/strong). These exemplars must not overlap with the texts you will score. Keep the total prompt under the model's context window -- one exemplar per level is optimal.
Construct the structured prompt. Assemble the prompt in this exact order: (a) Role: "You are an expert evaluator of [text type]"; (b) Context: genre, audience, expected length; (c) Trait definition from rubric; (d) Score level descriptions with one exemplar each; (e) The text to evaluate, clearly delimited; (f) Output specification requesting JSON with reasoning, score, and confidence fields. The output specification MUST be the final section.
Score each trait independently. Run the prompt once per trait. Do not ask the model to score all five traits in a single pass -- this reduces accuracy and makes it harder to debug.
Parse and validate outputs. Extract the JSON response. Flag any scores where confidence < 0.6 for human review. Confidence values are not calibrated probabilities but serve as useful triage signals.
Aggregate and report. Present per-trait scores with the model's justification text. Optionally compute a weighted holistic score from the trait scores.
Prepare the dataset. Collect texts with human-assigned ordinal scores per trait. If scores are on a fine-grained scale (e.g., 1-6), remap to 3 pedagogically meaningful tiers. Split into train (60%), validation (20%), test (20%).
Encode with a long-sequence transformer. Use BigBird, Longformer, or any encoder supporting your max document length. Extract the [CLS] token representation from the final hidden layer.
Build the CORAL head. Replace the standard classification head with K-1 binary sigmoid outputs (where K is the number of ordinal levels). Each output predicts P(score > threshold_k). Apply dropout before the linear projection.
Train with weighted binary cross-entropy. For each threshold, compute positive-class weight as (num_negative / num_positive) from training data. Minimize the sum of K-1 binary cross-entropy losses. Train for up to 6 epochs with early stopping (patience=2) on validation QWK.
Optimize decision thresholds. After training, grid-search cutoff pairs (c1, c2, ..., c_{K-1}) on the validation set to maximize QWK, enforcing the monotonicity constraint c_{k+1} >= c_k.
Evaluate with QWK. Report Quadratic Weighted Kappa per trait on the test set. Run with 5+ random seeds and report mean +/- std.
Example 1: Building an LLM-based essay scorer
User: "I want to score student argumentative essays on Content and Organization using an LLM."
Approach:
Output -- Content scoring prompt:
{
"role": "You are an expert evaluator of students' argumentative essays written by 13-year-old students.",
"trait": "Ideas and Content: How fully the essay develops its central argument, how clearly ideas are expressed, and how well evidence supports the thesis.",
"rubric": {
"weak (1)": "Ideas are unclear or undeveloped. Little or no evidence supports the argument. Example: [exemplar essay text]",
"fair (2)": "Ideas are present but inconsistently developed. Some evidence is provided but may be irrelevant. Example: [exemplar essay text]",
"strong (3)": "Ideas are clearly articulated and well-developed. Evidence is relevant and effectively supports the argument. Example: [exemplar essay text]"
},
"essay": "[student essay text here]",
"output_format": "Respond in JSON with keys: reasoning (2-3 sentences justifying your score), score (1, 2, or 3), confidence (0.0 to 1.0)"
}
Expected model output:
{
"reasoning": "The essay presents a clear thesis about school uniforms but develops only two supporting points, one of which relies on personal anecdote rather than evidence. The counterargument is acknowledged but not fully addressed.",
"score": 2,
"confidence": 0.75
}
Example 2: Implementing CORAL ordinal regression in PyTorch
User: "I have essay scores on a 1-3 scale. Help me implement CORAL loss for a BigBird encoder."
Approach:
import torch
import torch.nn as nn
class CORALHead(nn.Module):
"""CORAL ordinal regression head for K ordinal classes."""
def __init__(self, hidden_size: int, num_classes: int, dropout: float = 0.1):
super().__init__()
self.num_thresholds = num_classes - 1 # K-1 binary tasks
self.dropout = nn.Dropout(dropout)
# Shared feature projection, separate bias per threshold
self.fc = nn.Linear(hidden_size, 1, bias=False)
self.thresholds = nn.Parameter(torch.zeros(self.num_thresholds))
def forward(self, cls_hidden_state: torch.Tensor) -> torch.Tensor:
"""Returns logits of shape (batch, num_thresholds)."""
x = self.dropout(cls_hidden_state)
logits = self.fc(x) + self.thresholds # (batch, num_thresholds)
return logits
def coral_loss(logits: torch.Tensor, labels: torch.Tensor,
threshold_weights: list[torch.Tensor]) -> torch.Tensor:
"""
Binary cross-entropy CORAL loss.
logits: (batch, K-1) raw logits per threshold
labels: (batch,) integer labels in [0, K-1]
threshold_weights: list of K-1 positive-class weight scalars
"""
num_thresholds = logits.size(1)
total_loss = 0.0
for k in range(num_thresholds):
binary_target = (labels > k).float()
weight = threshold_weights[k]
pos_weight = torch.tensor([weight], device=logits.device)
total_loss += nn.functional.binary_cross_entropy_with_logits(
logits[:, k], binary_target, pos_weight=pos_weight
)
return total_loss / num_thresholds
def decode_coral(logits: torch.Tensor, cutoffs: list[float]) -> torch.Tensor:
"""Convert CORAL logits to ordinal class using optimized cutoffs."""
probs = torch.sigmoid(logits) # (batch, K-1)
preds = torch.zeros(logits.size(0), dtype=torch.long, device=logits.device)
for k, c in enumerate(cutoffs):
preds += (probs[:, k] > c).long()
return preds
Example 3: Adapting trait scoring for code review quality
User: "I want to evaluate code review comments on multiple quality dimensions."
Approach:
Trait: Specificity
- Weak: Comment is vague ("this looks wrong") with no actionable guidance
- Fair: Comment identifies the issue but suggestion is incomplete or generic
- Strong: Comment pinpoints the exact problem, explains why, and suggests a concrete fix
Prompt structure:
Role -> Trait definition -> Rubric with exemplar comments -> Target comment -> JSON output spec
Paper: Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays (Favero et al., 2026). Focus on Section 3 for the CORAL formulation, Section 4 for prompt design, and Table 1 for per-trait QWK results across all methods.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".