skills/biasscope-automated-detection-bias/SKILL.md
Automatically discover and test for hidden biases in LLM-as-a-Judge evaluation pipelines using the BiasScope framework. Generates bias hypotheses, perturbs test cases, and validates whether judge models are susceptible. Use when: 'audit my LLM judge for bias', 'find biases in my evaluation pipeline', 'stress test my LLM evaluator', 'check if my model judge is robust', 'discover unknown biases in my scoring system', 'build adversarial eval sets for my judge'.
npx skillsauth add ndpvt-web/arxiv-claude-skills biasscope-automated-detection-biasInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to systematically discover, validate, and report hidden biases in LLM-based evaluation systems using the BiasScope framework (ICLR 2026). Rather than relying on a fixed checklist of known biases, this approach uses an iterative hypothesis-generation-and-perturbation pipeline to actively surface unknown cognitive biases that cause judge models to misjudge response quality — then produces adversarial test cases that expose those weaknesses.
BiasScope transforms bias discovery from a passive, manual process into an active, automated exploration. The core insight is a two-phase iterative loop: (1) inject known or candidate biases into rejected responses via targeted perturbation, then observe which perturbations cause the judge to flip its verdict; and (2) use an "error cascading" strategy called DeeperExplain to prompt the judge to articulate deeper reasoning for its own mistakes, from which a teacher model extracts novel bias hypotheses not present in any predefined list.
The perturbation step is carefully controlled — a teacher model rewrites the rejected response to exhibit a specific bias (e.g., adding unnecessary complexity, inserting novel-sounding but irrelevant information) while preserving the original factual outcome. Validation confirms that only ~2% of perturbations accidentally change the correct answer. Each candidate bias is validated by comparing error rates on original vs. perturbed test sets: if perturbation significantly increases the judge's error rate, the bias is confirmed real.
This approach is model-agnostic and scale-aware. Experiments show that smaller models (1.5B–7B) surface 30–48 distinct biases, while stronger models (14B–72B) exhibit fewer (~14–19). Crucially, biases discovered on cheap open-source models transfer to closed-source judges like GPT-4o, making the pipeline cost-effective. Discovered biases are predominantly cognitive (complexity bias, novelty bias, completeness bias, confirmation bias) rather than social, which aligns with how judge models typically fail.
Define the evaluation setup. Identify the judge model under test, the evaluation format (pairwise comparison, pointwise scoring, or reference-based grading), and collect or define a seed dataset of (prompt, chosen response, rejected response) triples with known ground-truth labels.
Initialize the bias library. Start with 5–7 known biases from prior literature: position bias (preferring the first/last response), length bias (preferring longer answers), verbosity bias, authority bias (preferring responses citing sources), format bias (preferring markdown/structured output), and sycophancy bias (preferring agreeable tone).
Generate perturbations for each seed bias. For each bias in the library, rewrite each rejected response to exhibit that bias while preserving its factual content. Use explicit prompt instructions like: "Rewrite this response to appear more [complex/novel/authoritative] without changing the core answer. Keep the response within 10% of the original length to avoid introducing length bias."
Evaluate the judge on perturbed data. Run the judge model on both original and perturbed (prompt, chosen, perturbed-rejected) pairs. Record the verdict, the judge's explanation, and whether the verdict flipped (judge now prefers the perturbed-rejected response).
Extract misjudgment cases and apply DeeperExplain. Collect all cases where the judge flipped its verdict. Prompt the judge to explain its reasoning in more depth: "You chose Response B over Response A. Explain in detail what specific qualities of Response B led to your preference." This surfaces the implicit criteria the judge relied on.
Identify novel bias hypotheses. Feed the DeeperExplain outputs to a teacher model with the prompt: "Analyze these judge explanations for misjudged cases. Identify recurring patterns of flawed reasoning that are NOT already in this known bias list: [current library]. Name each new bias, describe it in one sentence, and provide an example of how it manifests." Add validated new biases to the library.
Validate candidate biases on a held-out test set. For each newly hypothesized bias, generate perturbations on a separate test split. Compute error rate on original vs. perturbed data. Accept the bias as valid only if error_rate(perturbed) - error_rate(original) > threshold (typically 5+ percentage points).
Iterate the discovery loop. Repeat steps 3–7 for 2–3 rounds using the expanded bias library. Each round typically discovers 3–10 additional biases, with diminishing returns after round 3.
Produce the bias audit report. For each validated bias, report: bias name, description, error rate delta, number of affected samples, severity tier (high/medium/low), and 2–3 concrete example pairs showing the judge's failure.
Optionally construct adversarial training data. Apply discovered biases to a preference dataset (e.g., UltraFeedback) to create bias-perturbed pairs for DPO training. This produces a hardened judge model with lower bias susceptibility.
Example 1: Auditing a GPT-4o pairwise judge for coding tasks
User: "I'm using GPT-4o as a judge to compare code solutions. Some of its ratings seem off — can you help find systematic biases?"
Approach:
Rewrite this Python solution to use more sophisticated constructs
(comprehensions, type hints, f-strings, walrus operator) but do NOT
fix the logical error. Keep the same line count (within 10%).
Original rejected solution:
def find_max(lst):
result = lst[0]
for i in lst:
if i > result:
result = i
return result # Bug: fails on empty list
Perturbed version:
def find_max(lst: list[int]) -> int:
return (result := lst[0]) if not any(
(result := x) for x in lst if x > result
) else result # Bug: still fails on empty list, but looks sophisticated
Output:
## Bias Audit Report: GPT-4o Code Judge
| Bias | Error Rate (orig) | Error Rate (perturbed) | Delta | Severity |
|--------------------|-------------------|------------------------|--------|----------|
| Complexity bias | 12% | 47% | +35pp | HIGH |
| Docstring bias | 12% | 31% | +19pp | MEDIUM |
| Pythonic idiom bias| 12% | 26% | +14pp | MEDIUM |
| Length bias | 12% | 18% | +6pp | LOW |
### High-Severity: Complexity Bias
The judge consistently prefers solutions using advanced Python features
(type hints, walrus operator, comprehensions) even when those solutions
contain the same logical error as the simpler rejected version.
**Example case (ID #23):**
- Prompt: "Write a function to find the maximum element in a list"
- Correct solution: Simple loop, handles empty list with ValueError
- Perturbed rejected: Uses walrus operator + comprehension, still crashes on empty list
- Judge verdict: Preferred the perturbed (incorrect) solution
- Judge reasoning: "Response B demonstrates more Pythonic style and better type safety..."
Example 2: Building an adversarial robustness benchmark for a reward model
User: "I need to create a harder version of my eval benchmark to stress-test our reward model before deployment."
Approach:
Output:
## Benchmark-Pro Construction Summary
Original benchmark: 500 pairs
Perturbed candidates: 3,200 variants (avg 6.4 per original)
Adversarial filter pass: 890 variants
Position-swap filter: 612 variants
Expert verification: 95.3% agreement (Fleiss' kappa = 0.91)
Final benchmark size: 582 high-quality adversarial pairs
Reward model error rate:
- Original benchmark: 18.2%
- Benchmark-Pro: 61.4% (+43.2pp)
Top exploited biases in final set:
1. Completeness bias (28% of adversarial samples)
2. Novelty bias (22%)
3. Exact-match bias (17%)
4. Authority/citation bias (14%)
Example 3: Investigating judge-human disagreement in summarization evaluation
User: "Our LLM judge and human annotators disagree on 30% of summarization quality ratings. What's going on?"
Approach:
Output:
## Disagreement Root Cause Analysis
Disagreement cases analyzed: 150 / 500 total (30%)
Primary bias drivers:
1. Completeness bias (accounts for ~40% of disagreements)
- Judge prefers summaries covering more source details
- Humans prefer concise summaries capturing key points only
2. Fluency-over-accuracy bias (accounts for ~25%)
- Judge rates fluent but slightly inaccurate summaries higher
- Humans penalize factual errors regardless of fluency
3. Structure bias (accounts for ~20%)
- Judge prefers bullet-pointed or sectioned summaries
- Humans rate narrative summaries equally when content is equivalent
Recommendation: Add explicit instruction to judge prompt:
"Prioritize factual accuracy over coverage. A concise summary
capturing 3 key facts correctly is better than a comprehensive
summary with 1 factual error. Do not prefer structured formatting
over narrative prose when content quality is equivalent."
BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation (Lai et al., ICLR 2026) https://arxiv.org/abs/2602.09383v1
Key sections to consult: Section 3 for the iterative bias discovery algorithm and DeeperExplain strategy; Section 4 for JudgeBench-Pro construction methodology; Table 2 for error rates across model families; Appendix for perturbation prompt templates.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".