skills/compass-contrastive-learning-automated/SKILL.md
Assess patch correctness using contrastive learning on code representations. Applies semantic-preserving code transformations and multi-view embedding comparison to determine whether a code patch is genuinely correct or merely overfitting to test suites. Trigger phrases: 'is this patch correct', 'check patch correctness', 'assess this fix', 'validate this bug fix', 'detect overfitting patch', 'contrastive patch assessment'
npx skillsauth add ndpvt-web/arxiv-claude-skills compass-contrastive-learning-automatedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to assess whether a code patch (bug fix) is genuinely correct or merely overfitting -- passing existing tests without truly fixing the underlying bug. It applies the ComPass technique: encoding buggy and patched code separately, computing multi-view difference representations (subtraction, concatenation, cosine similarity, etc.), and reasoning about whether the semantic change matches the intended fix. The approach draws from contrastive learning principles where semantically equivalent code variants are pulled together in embedding space while semantically different code is pushed apart.
Contrastive Representation of Patches. ComPass treats patch assessment as a binary classification problem over paired code representations. Rather than feeding a raw diff to a classifier, it encodes the buggy code (C_b) and patched code (C_p) independently through a pre-trained language model, producing embedding vectors E_b and E_p. The critical insight is that correctness signal lives in the relationship between these embeddings, not in either one alone. ComPass computes six views of this relationship: direct concatenation [E_b; E_p], element-wise addition (E_b + E_p), subtraction (E_b - E_p), multiplication (E_b * E_p), cosine similarity, and Euclidean distance. Ablation studies show subtraction is the most informative single operation (7.23% recall drop without it), as it directly captures what changed.
Semantic-Preserving Data Augmentation. The second key idea is using 18 code transformation rules (variable renaming, for-to-while conversion, if-else reversal, infix expression splitting, statement reordering, switch-to-if conversion, etc.) to generate semantically equivalent code variants. During pre-training, contrastive learning with InfoNCE loss pulls together representations of code and its transformed variants while pushing apart unrelated code. This teaches the model that for(int i=0; i<n; i++) and int i=0; while(i<n) { ... i+=1; } are the same semantics, making it robust to surface-level syntactic variation in patches.
Joint Fine-Tuning. The pre-trained encoder is fine-tuned jointly with a two-layer fully connected classifier on labeled correct/overfitting patches. The multi-view concatenated vector passes through FC layers with softmax to produce P(correct) vs P(overfitting). This achieves 88.35% accuracy on 2,274 real-world Defects4J patches, a 6.33% improvement over the prior state-of-the-art (APPT).
Extract the buggy and patched code. Isolate the changed method or code block in both its pre-fix (buggy) and post-fix (patched) versions. Include sufficient surrounding context (the full method body) but exclude unchanged boilerplate. If given a unified diff, reconstruct both complete versions.
Normalize both code versions. Strip comments, normalize whitespace, and standardize formatting so that superficial differences do not dominate the comparison. Retain all semantic content (variable names, control flow, expressions).
Apply semantic-preserving transformations mentally. Consider whether the patch could be expressed equivalently in another syntactic form (e.g., a ternary instead of if-else, a while instead of for). This helps distinguish true semantic changes from syntactic reshuffling. If the patch is only a syntactic transformation with no semantic effect, flag it as suspicious.
Compute the multi-view difference analysis. Systematically compare the buggy and patched versions across these dimensions:
Assess control flow and data flow impact. Trace how the patch modifies execution paths. A correct patch should address the root cause path. An overfitting patch often guards a specific failing input without fixing the general case (e.g., adding if (x == null) return default; instead of fixing why x is null).
Check boundary conditions and edge cases. Determine if the patch handles the bug's edge cases or only the specific test inputs. Overfitting patches frequently hard-code values, add narrow guards, or handle exact test scenarios.
Evaluate semantic coherence. Does the patched code express a coherent programming intent? Correct patches typically modify code in ways that a human developer would recognize as meaningful. Overfitting patches often feel arbitrary or disconnected from the surrounding logic.
Classify and explain. Provide a binary assessment (likely correct / likely overfitting) with a confidence level and specific reasoning tied to the analysis dimensions above.
Example 1: Null-guard patch for a NullPointerException
User: "Is this patch correct? It fixes a NullPointerException in Apache Commons Math."
// Buggy version
public double evaluate(double[] values, int begin, int length) {
double sum = 0;
for (int i = begin; i < begin + length; i++) {
sum += values[i];
}
return sum / length;
}
// Patched version
public double evaluate(double[] values, int begin, int length) {
if (length == 0) {
return Double.NaN;
}
double sum = 0;
for (int i = begin; i < begin + length; i++) {
sum += values[i];
}
return sum / length;
}
Approach:
length == 0.length == 0. This guard directly prevents that.NaN for an empty input is standard mathematical convention (mean of nothing is undefined).Assessment: Likely correct (high confidence). The patch addresses the root cause (division by zero) with a semantically appropriate return value. The guard is general, not test-specific.
Example 2: Overfitting constant insertion
User: "A repair tool generated this patch for a failing test in JFreeChart. Is it correct?"
// Buggy version
public Range getDataRange(ValueAxis axis) {
Range result = null;
Iterator iterator = this.datasets.keySet().iterator();
while (iterator.hasNext()) {
Integer key = (Integer) iterator.next();
XYDataset dataset = getDataset(key.intValue());
// ... combines ranges from datasets
}
return result;
}
// Patched version
public Range getDataRange(ValueAxis axis) {
Range result = null;
if (axis == null) {
return result;
}
Iterator iterator = this.datasets.keySet().iterator();
while (iterator.hasNext()) {
Integer key = (Integer) iterator.next();
XYDataset dataset = getDataset(key.intValue());
// ... combines ranges from datasets
}
return result;
}
Approach:
axis that returns null early.axis to filter.axis is never used in the shown code, which itself is suspicious.getDataRange(null) but does not fix incorrect range computation for non-null axes.Assessment: Likely overfitting (high confidence). The patch adds a null guard unrelated to the core range computation logic. It would pass a null-axis test case without correcting the actual dataset-axis mapping behavior.
Example 3: Operator change in conditional
User: "Check if this one-line fix for Apache Commons Lang is correct."
// Buggy: uses <= causing off-by-one
if (index <= size) {
throw new IndexOutOfBoundsException("...");
}
// Patched: uses < to fix boundary
if (index < size) {
throw new IndexOutOfBoundsException("...");
}
Approach:
<= to <.size is the valid upper bound (exclusive), then index == size should throw. The buggy code with <= throws when index == size, which is correct for exclusive bounds. Wait -- re-read: the condition guards when to throw. If index <= size throws, that means even valid indices throw. Changing to < would still throw for indices below size, which is wrong. This needs careful context about what size represents.Assessment: Uncertain -- requires broader method context. One-line operator changes are the most common overfitting pattern in APR. Request the full method to determine whether the boundary condition aligns with the data structure's contract.
Paper: ComPass: Contrastive Learning for Automated Patch Correctness Assessment in Program Repair (Zhang et al., 2026). Look for: Table 1 (18 transformation rules), Equation 1 (InfoNCE contrastive loss), Figure 1 (multi-view representation integration architecture), and Table 6 (ablation study showing subtraction as most critical operation).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".