BiasScope: Automated Bias Detection in LLM-as-a-Judge Evaluation

This skill enables Claude to systematically discover, validate, and report hidden biases in LLM-based evaluation systems using the BiasScope framework (ICLR 2026). Rather than relying on a fixed checklist of known biases, this approach uses an iterative hypothesis-generation-and-perturbation pipeline to actively surface unknown cognitive biases that cause judge models to misjudge response quality — then produces adversarial test cases that expose those weaknesses.

When to Use

When the user wants to audit an LLM judge (e.g., GPT-4o, Claude, Llama) for evaluation biases before deploying it in a scoring pipeline.
When building or improving a reward model and needing adversarial preference data to harden it against biases.
When constructing a robustness benchmark for LLM evaluators (similar to JudgeBench-Pro).
When investigating why an LLM judge disagrees with human annotators on specific evaluation tasks.
When the user has a pairwise comparison pipeline (response A vs. response B) and wants to find systematic failure modes.
When designing DPO or RLHF training data that accounts for judge biases to improve alignment.

Key Technique

BiasScope transforms bias discovery from a passive, manual process into an active, automated exploration. The core insight is a two-phase iterative loop: (1) inject known or candidate biases into rejected responses via targeted perturbation, then observe which perturbations cause the judge to flip its verdict; and (2) use an "error cascading" strategy called DeeperExplain to prompt the judge to articulate deeper reasoning for its own mistakes, from which a teacher model extracts novel bias hypotheses not present in any predefined list.

The perturbation step is carefully controlled — a teacher model rewrites the rejected response to exhibit a specific bias (e.g., adding unnecessary complexity, inserting novel-sounding but irrelevant information) while preserving the original factual outcome. Validation confirms that only ~2% of perturbations accidentally change the correct answer. Each candidate bias is validated by comparing error rates on original vs. perturbed test sets: if perturbation significantly increases the judge's error rate, the bias is confirmed real.

This approach is model-agnostic and scale-aware. Experiments show that smaller models (1.5B–7B) surface 30–48 distinct biases, while stronger models (14B–72B) exhibit fewer (~14–19). Crucially, biases discovered on cheap open-source models transfer to closed-source judges like GPT-4o, making the pipeline cost-effective. Discovered biases are predominantly cognitive (complexity bias, novelty bias, completeness bias, confirmation bias) rather than social, which aligns with how judge models typically fail.

Step-by-Step Workflow

Define the evaluation setup. Identify the judge model under test, the evaluation format (pairwise comparison, pointwise scoring, or reference-based grading), and collect or define a seed dataset of (prompt, chosen response, rejected response) triples with known ground-truth labels.
Initialize the bias library. Start with 5–7 known biases from prior literature: position bias (preferring the first/last response), length bias (preferring longer answers), verbosity bias, authority bias (preferring responses citing sources), format bias (preferring markdown/structured output), and sycophancy bias (preferring agreeable tone).
Generate perturbations for each seed bias. For each bias in the library, rewrite each rejected response to exhibit that bias while preserving its factual content. Use explicit prompt instructions like: "Rewrite this response to appear more [complex/novel/authoritative] without changing the core answer. Keep the response within 10% of the original length to avoid introducing length bias."
Evaluate the judge on perturbed data. Run the judge model on both original and perturbed (prompt, chosen, perturbed-rejected) pairs. Record the verdict, the judge's explanation, and whether the verdict flipped (judge now prefers the perturbed-rejected response).
Extract misjudgment cases and apply DeeperExplain. Collect all cases where the judge flipped its verdict. Prompt the judge to explain its reasoning in more depth: "You chose Response B over Response A. Explain in detail what specific qualities of Response B led to your preference." This surfaces the implicit criteria the judge relied on.
Identify novel bias hypotheses. Feed the DeeperExplain outputs to a teacher model with the prompt: "Analyze these judge explanations for misjudged cases. Identify recurring patterns of flawed reasoning that are NOT already in this known bias list: [current library]. Name each new bias, describe it in one sentence, and provide an example of how it manifests." Add validated new biases to the library.
Validate candidate biases on a held-out test set. For each newly hypothesized bias, generate perturbations on a separate test split. Compute error rate on original vs. perturbed data. Accept the bias as valid only if error_rate(perturbed) - error_rate(original) > threshold (typically 5+ percentage points).
Iterate the discovery loop. Repeat steps 3–7 for 2–3 rounds using the expanded bias library. Each round typically discovers 3–10 additional biases, with diminishing returns after round 3.
Produce the bias audit report. For each validated bias, report: bias name, description, error rate delta, number of affected samples, severity tier (high/medium/low), and 2–3 concrete example pairs showing the judge's failure.
Optionally construct adversarial training data. Apply discovered biases to a preference dataset (e.g., UltraFeedback) to create bias-perturbed pairs for DPO training. This produces a hardened judge model with lower bias susceptibility.

Concrete Examples

Example 1: Auditing a GPT-4o pairwise judge for coding tasks

User: "I'm using GPT-4o as a judge to compare code solutions. Some of its ratings seem off — can you help find systematic biases?"

Approach:

Collect 50+ (prompt, correct_solution, incorrect_solution) triples from coding benchmarks where the correct answer is known.
Initialize bias library: length bias, complexity bias, comment density bias, naming convention bias, import/library bias.
For complexity bias: rewrite each incorrect solution to use more advanced language features (list comprehensions, decorators, type hints) without fixing the actual bug. Example perturbation prompt:

Rewrite this Python solution to use more sophisticated constructs
(comprehensions, type hints, f-strings, walrus operator) but do NOT
fix the logical error. Keep the same line count (within 10%).

Original rejected solution:
def find_max(lst):
    result = lst[0]
    for i in lst:
        if i > result:
            result = i
    return result  # Bug: fails on empty list

Perturbed version:
def find_max(lst: list[int]) -> int:
    return (result := lst[0]) if not any(
        (result := x) for x in lst if x > result
    ) else result  # Bug: still fails on empty list, but looks sophisticated

Run GPT-4o judge on original pairs and perturbed pairs. Record flip rate.
If complexity bias confirmed (e.g., 35% flip rate), apply DeeperExplain to flipped cases.
Teacher model discovers additional biases from explanations: "docstring bias" (preferring solutions with docstrings regardless of correctness), "Pythonic bias" (preferring idiomatic Python over correct-but-verbose code).

Output:

## Bias Audit Report: GPT-4o Code Judge

| Bias               | Error Rate (orig) | Error Rate (perturbed) | Delta  | Severity |
|--------------------|-------------------|------------------------|--------|----------|
| Complexity bias    | 12%               | 47%                    | +35pp  | HIGH     |
| Docstring bias     | 12%               | 31%                    | +19pp  | MEDIUM   |
| Pythonic idiom bias| 12%               | 26%                    | +14pp  | MEDIUM   |
| Length bias         | 12%               | 18%                    | +6pp   | LOW      |

### High-Severity: Complexity Bias
The judge consistently prefers solutions using advanced Python features
(type hints, walrus operator, comprehensions) even when those solutions
contain the same logical error as the simpler rejected version.

**Example case (ID #23):**
- Prompt: "Write a function to find the maximum element in a list"
- Correct solution: Simple loop, handles empty list with ValueError
- Perturbed rejected: Uses walrus operator + comprehension, still crashes on empty list
- Judge verdict: Preferred the perturbed (incorrect) solution
- Judge reasoning: "Response B demonstrates more Pythonic style and better type safety..."

Example 2: Building an adversarial robustness benchmark for a reward model

User: "I need to create a harder version of my eval benchmark to stress-test our reward model before deployment."

Approach:

Take the existing benchmark of 500 preference pairs with ground-truth labels.
Run BiasScope discovery loop (3 rounds) against the reward model, surfacing 15–25 biases.
For each benchmark item, generate 5–10 bias-perturbed variants of the rejected response (one per top bias).
Filter adversarially: keep only variants where the reward model flips its verdict AND position-swapping also fails (the model prefers the biased response regardless of position).
Have domain experts verify a random 20% sample to confirm ground truth is preserved (target: >90% agreement).

Output:

## Benchmark-Pro Construction Summary

Original benchmark:     500 pairs
Perturbed candidates:   3,200 variants (avg 6.4 per original)
Adversarial filter pass: 890 variants
Position-swap filter:    612 variants
Expert verification:     95.3% agreement (Fleiss' kappa = 0.91)
Final benchmark size:    582 high-quality adversarial pairs

Reward model error rate:
  - Original benchmark:  18.2%
  - Benchmark-Pro:       61.4%  (+43.2pp)

Top exploited biases in final set:
  1. Completeness bias (28% of adversarial samples)
  2. Novelty bias (22%)
  3. Exact-match bias (17%)
  4. Authority/citation bias (14%)

Example 3: Investigating judge-human disagreement in summarization evaluation

User: "Our LLM judge and human annotators disagree on 30% of summarization quality ratings. What's going on?"

Approach:

Collect the 30% disagreement cases. Label each with the human verdict and the judge verdict.
Run targeted perturbation on the agreed-upon cases (where judge is correct) to find which biases, when injected, cause the judge to produce verdicts matching its errors on the disagreement cases.
Apply DeeperExplain to the disagreement cases directly: prompt the judge to explain its preference in detail, then cluster the explanations by theme.
Cross-reference discovered biases with the disagreement distribution.

Output:

## Disagreement Root Cause Analysis

Disagreement cases analyzed: 150 / 500 total (30%)

Primary bias drivers:
1. Completeness bias (accounts for ~40% of disagreements)
   - Judge prefers summaries covering more source details
   - Humans prefer concise summaries capturing key points only

2. Fluency-over-accuracy bias (accounts for ~25%)
   - Judge rates fluent but slightly inaccurate summaries higher
   - Humans penalize factual errors regardless of fluency

3. Structure bias (accounts for ~20%)
   - Judge prefers bullet-pointed or sectioned summaries
   - Humans rate narrative summaries equally when content is equivalent

Recommendation: Add explicit instruction to judge prompt:
"Prioritize factual accuracy over coverage. A concise summary
capturing 3 key facts correctly is better than a comprehensive
summary with 1 factual error. Do not prefer structured formatting
over narrative prose when content quality is equivalent."

Best Practices

Do always validate perturbations preserve ground truth: check that ~98%+ of perturbed rejected responses are still genuinely worse than the chosen response. Discard any perturbation that accidentally fixes the rejected answer.
Do control for length when injecting biases. Constrain perturbation prompts to produce responses within 10% of the original length, and separately measure length bias so it does not confound other bias measurements.
Do use position swapping (A/B vs. B/A) as a secondary filter. A robust bias finding should cause failures regardless of presentation order.
Do iterate the discovery loop at least 2–3 times. The DeeperExplain cascading strategy surfaces biases that only become visible after earlier biases are already accounted for.
Avoid treating all discovered biases as equally severe. Rank by error rate delta and affected sample count. A bias causing a 5pp increase on 10 samples is less actionable than one causing 20pp on 100 samples.
Avoid using the same data split for discovery and validation. Always hold out a separate test set for bias verification to prevent overfitting bias hypotheses to noise.

Error Handling

Perturbation corrupts the correct answer. If more than 5% of perturbations accidentally change the ground-truth outcome, the perturbation prompt is too aggressive. Tighten constraints: specify exactly which aspects to change and which to preserve. Add a verification step where a second model confirms the perturbed response is still incorrect.
Judge refuses to explain its reasoning. Some judge configurations return only a label without explanation. DeeperExplain requires explanations. Switch to a prompt format that forces chain-of-thought output (e.g., "First explain your reasoning step by step, then give your final verdict").
No biases found above threshold. If no candidate bias produces a >5pp error rate increase, the judge may be relatively robust for this dataset, or the perturbation quality is insufficient. Try using a stronger teacher model for perturbation generation, or lower the threshold to 3pp for exploratory analysis.
Discovered bias does not transfer across models. A bias found on Qwen-7B may not affect GPT-4o. Always re-validate on the target judge model. Use open-source model discoveries as candidates, not confirmed findings, for closed-source judges.

Limitations

BiasScope requires ground-truth labels (known correct preference ordering) to detect judge errors. It cannot discover biases in fully unsupervised evaluation settings where no reference answer exists.
The framework is designed for pairwise comparison and reference-graded evaluation formats. Adapting it to open-ended pointwise scoring (e.g., "rate this essay 1–10") requires defining what constitutes a "flip" in that context.
Perturbation quality depends on the teacher model's ability to inject biases subtly. If the teacher model itself has strong biases, it may produce unnatural perturbations that are easy to detect, reducing the adversarial value.
The approach discovers biases at the population level (systematic tendencies across many samples). It does not explain individual judge failures that stem from one-off reasoning errors rather than systematic bias.
Cognitive biases dominate discoveries; social biases (gender, race) are rarely surfaced because they require identity-specific perturbations that this general-purpose pipeline does not target by default.

Reference

BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation (Lai et al., ICLR 2026) https://arxiv.org/abs/2602.09383v1

Key sections to consult: Section 3 for the iterative bias discovery algorithm and DeeperExplain strategy; Section 4 for JudgeBench-Pro construction methodology; Table 2 for error rates across model families; Appendix for perturbation prompt templates.

BiasScope: Automated Bias Detection in LLM-as-a-Judge Evaluation

When to Use

When the user wants to audit an LLM judge (e.g., GPT-4o, Claude, Llama) for evaluation biases before deploying it in a scoring pipeline.
When building or improving a reward model and needing adversarial preference data to harden it against biases.
When constructing a robustness benchmark for LLM evaluators (similar to JudgeBench-Pro).
When investigating why an LLM judge disagrees with human annotators on specific evaluation tasks.
When the user has a pairwise comparison pipeline (response A vs. response B) and wants to find systematic failure modes.
When designing DPO or RLHF training data that accounts for judge biases to improve alignment.

Key Technique

Step-by-Step Workflow

Define the evaluation setup. Identify the judge model under test, the evaluation format (pairwise comparison, pointwise scoring, or reference-based grading), and collect or define a seed dataset of (prompt, chosen response, rejected response) triples with known ground-truth labels.
Initialize the bias library. Start with 5–7 known biases from prior literature: position bias (preferring the first/last response), length bias (preferring longer answers), verbosity bias, authority bias (preferring responses citing sources), format bias (preferring markdown/structured output), and sycophancy bias (preferring agreeable tone).
Generate perturbations for each seed bias. For each bias in the library, rewrite each rejected response to exhibit that bias while preserving its factual content. Use explicit prompt instructions like: "Rewrite this response to appear more [complex/novel/authoritative] without changing the core answer. Keep the response within 10% of the original length to avoid introducing length bias."
Evaluate the judge on perturbed data. Run the judge model on both original and perturbed (prompt, chosen, perturbed-rejected) pairs. Record the verdict, the judge's explanation, and whether the verdict flipped (judge now prefers the perturbed-rejected response).
Extract misjudgment cases and apply DeeperExplain. Collect all cases where the judge flipped its verdict. Prompt the judge to explain its reasoning in more depth: "You chose Response B over Response A. Explain in detail what specific qualities of Response B led to your preference." This surfaces the implicit criteria the judge relied on.
Identify novel bias hypotheses. Feed the DeeperExplain outputs to a teacher model with the prompt: "Analyze these judge explanations for misjudged cases. Identify recurring patterns of flawed reasoning that are NOT already in this known bias list: [current library]. Name each new bias, describe it in one sentence, and provide an example of how it manifests." Add validated new biases to the library.
Validate candidate biases on a held-out test set. For each newly hypothesized bias, generate perturbations on a separate test split. Compute error rate on original vs. perturbed data. Accept the bias as valid only if error_rate(perturbed) - error_rate(original) > threshold (typically 5+ percentage points).
Iterate the discovery loop. Repeat steps 3–7 for 2–3 rounds using the expanded bias library. Each round typically discovers 3–10 additional biases, with diminishing returns after round 3.
Produce the bias audit report. For each validated bias, report: bias name, description, error rate delta, number of affected samples, severity tier (high/medium/low), and 2–3 concrete example pairs showing the judge's failure.
Optionally construct adversarial training data. Apply discovered biases to a preference dataset (e.g., UltraFeedback) to create bias-perturbed pairs for DPO training. This produces a hardened judge model with lower bias susceptibility.

Concrete Examples

Example 1: Auditing a GPT-4o pairwise judge for coding tasks

User: "I'm using GPT-4o as a judge to compare code solutions. Some of its ratings seem off — can you help find systematic biases?"

Approach:

Collect 50+ (prompt, correct_solution, incorrect_solution) triples from coding benchmarks where the correct answer is known.
Initialize bias library: length bias, complexity bias, comment density bias, naming convention bias, import/library bias.
For complexity bias: rewrite each incorrect solution to use more advanced language features (list comprehensions, decorators, type hints) without fixing the actual bug. Example perturbation prompt:

Rewrite this Python solution to use more sophisticated constructs
(comprehensions, type hints, f-strings, walrus operator) but do NOT
fix the logical error. Keep the same line count (within 10%).

Original rejected solution:
def find_max(lst):
    result = lst[0]
    for i in lst:
        if i > result:
            result = i
    return result  # Bug: fails on empty list

Perturbed version:
def find_max(lst: list[int]) -> int:
    return (result := lst[0]) if not any(
        (result := x) for x in lst if x > result
    ) else result  # Bug: still fails on empty list, but looks sophisticated

Run GPT-4o judge on original pairs and perturbed pairs. Record flip rate.
If complexity bias confirmed (e.g., 35% flip rate), apply DeeperExplain to flipped cases.
Teacher model discovers additional biases from explanations: "docstring bias" (preferring solutions with docstrings regardless of correctness), "Pythonic bias" (preferring idiomatic Python over correct-but-verbose code).

Output:

## Bias Audit Report: GPT-4o Code Judge

| Bias               | Error Rate (orig) | Error Rate (perturbed) | Delta  | Severity |
|--------------------|-------------------|------------------------|--------|----------|
| Complexity bias    | 12%               | 47%                    | +35pp  | HIGH     |
| Docstring bias     | 12%               | 31%                    | +19pp  | MEDIUM   |
| Pythonic idiom bias| 12%               | 26%                    | +14pp  | MEDIUM   |
| Length bias         | 12%               | 18%                    | +6pp   | LOW      |

### High-Severity: Complexity Bias
The judge consistently prefers solutions using advanced Python features
(type hints, walrus operator, comprehensions) even when those solutions
contain the same logical error as the simpler rejected version.

**Example case (ID #23):**
- Prompt: "Write a function to find the maximum element in a list"
- Correct solution: Simple loop, handles empty list with ValueError
- Perturbed rejected: Uses walrus operator + comprehension, still crashes on empty list
- Judge verdict: Preferred the perturbed (incorrect) solution
- Judge reasoning: "Response B demonstrates more Pythonic style and better type safety..."

Example 2: Building an adversarial robustness benchmark for a reward model

User: "I need to create a harder version of my eval benchmark to stress-test our reward model before deployment."

Approach:

Take the existing benchmark of 500 preference pairs with ground-truth labels.
Run BiasScope discovery loop (3 rounds) against the reward model, surfacing 15–25 biases.
For each benchmark item, generate 5–10 bias-perturbed variants of the rejected response (one per top bias).
Filter adversarially: keep only variants where the reward model flips its verdict AND position-swapping also fails (the model prefers the biased response regardless of position).
Have domain experts verify a random 20% sample to confirm ground truth is preserved (target: >90% agreement).

Output:

## Benchmark-Pro Construction Summary

Original benchmark:     500 pairs
Perturbed candidates:   3,200 variants (avg 6.4 per original)
Adversarial filter pass: 890 variants
Position-swap filter:    612 variants
Expert verification:     95.3% agreement (Fleiss' kappa = 0.91)
Final benchmark size:    582 high-quality adversarial pairs

Reward model error rate:
  - Original benchmark:  18.2%
  - Benchmark-Pro:       61.4%  (+43.2pp)

Top exploited biases in final set:
  1. Completeness bias (28% of adversarial samples)
  2. Novelty bias (22%)
  3. Exact-match bias (17%)
  4. Authority/citation bias (14%)

Example 3: Investigating judge-human disagreement in summarization evaluation

User: "Our LLM judge and human annotators disagree on 30% of summarization quality ratings. What's going on?"

Approach:

Collect the 30% disagreement cases. Label each with the human verdict and the judge verdict.
Run targeted perturbation on the agreed-upon cases (where judge is correct) to find which biases, when injected, cause the judge to produce verdicts matching its errors on the disagreement cases.
Apply DeeperExplain to the disagreement cases directly: prompt the judge to explain its preference in detail, then cluster the explanations by theme.
Cross-reference discovered biases with the disagreement distribution.

Output:

## Disagreement Root Cause Analysis

Disagreement cases analyzed: 150 / 500 total (30%)

Primary bias drivers:
1. Completeness bias (accounts for ~40% of disagreements)
   - Judge prefers summaries covering more source details
   - Humans prefer concise summaries capturing key points only

2. Fluency-over-accuracy bias (accounts for ~25%)
   - Judge rates fluent but slightly inaccurate summaries higher
   - Humans penalize factual errors regardless of fluency

3. Structure bias (accounts for ~20%)
   - Judge prefers bullet-pointed or sectioned summaries
   - Humans rate narrative summaries equally when content is equivalent

Recommendation: Add explicit instruction to judge prompt:
"Prioritize factual accuracy over coverage. A concise summary
capturing 3 key facts correctly is better than a comprehensive
summary with 1 factual error. Do not prefer structured formatting
over narrative prose when content quality is equivalent."

Best Practices

Do always validate perturbations preserve ground truth: check that ~98%+ of perturbed rejected responses are still genuinely worse than the chosen response. Discard any perturbation that accidentally fixes the rejected answer.
Do control for length when injecting biases. Constrain perturbation prompts to produce responses within 10% of the original length, and separately measure length bias so it does not confound other bias measurements.
Do use position swapping (A/B vs. B/A) as a secondary filter. A robust bias finding should cause failures regardless of presentation order.
Do iterate the discovery loop at least 2–3 times. The DeeperExplain cascading strategy surfaces biases that only become visible after earlier biases are already accounted for.
Avoid treating all discovered biases as equally severe. Rank by error rate delta and affected sample count. A bias causing a 5pp increase on 10 samples is less actionable than one causing 20pp on 100 samples.
Avoid using the same data split for discovery and validation. Always hold out a separate test set for bias verification to prevent overfitting bias hypotheses to noise.

Error Handling

Perturbation corrupts the correct answer. If more than 5% of perturbations accidentally change the ground-truth outcome, the perturbation prompt is too aggressive. Tighten constraints: specify exactly which aspects to change and which to preserve. Add a verification step where a second model confirms the perturbed response is still incorrect.
Judge refuses to explain its reasoning. Some judge configurations return only a label without explanation. DeeperExplain requires explanations. Switch to a prompt format that forces chain-of-thought output (e.g., "First explain your reasoning step by step, then give your final verdict").
No biases found above threshold. If no candidate bias produces a >5pp error rate increase, the judge may be relatively robust for this dataset, or the perturbation quality is insufficient. Try using a stronger teacher model for perturbation generation, or lower the threshold to 3pp for exploratory analysis.
Discovered bias does not transfer across models. A bias found on Qwen-7B may not affect GPT-4o. Always re-validate on the target judge model. Use open-source model discoveries as candidates, not confirmed findings, for closed-source judges.

Limitations

BiasScope requires ground-truth labels (known correct preference ordering) to detect judge errors. It cannot discover biases in fully unsupervised evaluation settings where no reference answer exists.
The framework is designed for pairwise comparison and reference-graded evaluation formats. Adapting it to open-ended pointwise scoring (e.g., "rate this essay 1–10") requires defining what constitutes a "flip" in that context.
Perturbation quality depends on the teacher model's ability to inject biases subtly. If the teacher model itself has strong biases, it may produce unnatural perturbations that are easy to detect, reducing the adversarial value.
The approach discovers biases at the population level (systematic tendencies across many samples). It does not explain individual judge failures that stem from one-off reasoning errors rather than systematic bias.
Cognitive biases dominate discoveries; social biases (gender, race) are rarely surfaced because they require identity-specific perturbations that this general-purpose pipeline does not target by default.

Reference

BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation (Lai et al., ICLR 2026) https://arxiv.org/abs/2602.09383v1

Adoption

ndpvt-web/biasscope-automated-detection-bias

$ install --global

Security Scan Results

SKILL.md

BiasScope: Automated Bias Detection in LLM-as-a-Judge Evaluation

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge

ndpvt-web/biasscope-automated-detection-bias

$ install --global

Security Scan Results

SKILL.md

BiasScope: Automated Bias Detection in LLM-as-a-Judge Evaluation

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge