skills/beyond-needles-illusion-decoupled/SKILL.md
Decouple evidence access from evidence use when evaluating or building long-context and RAG systems under semantic interference. Use this skill when the user says: 'evaluate my RAG pipeline against hard negatives', 'stress-test retrieval with semantic distractors', 'build a decoupled retrieval benchmark', 'diagnose why my long-context QA is failing', 'create collision-tested evaluation data', 'measure evidence access vs answer quality separately'.
npx skillsauth add ndpvt-web/arxiv-claude-skills beyond-needles-illusion-decoupledInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to apply the EverMemBench-S methodology: evaluating and hardening long-context LLM and RAG systems by decoupling evidence access from evidence use and introducing collision-tested semantic interference. Standard Needle-in-a-Haystack (NIAH) benchmarks use near-unique needles in irrelevant haystacks, producing inflated scores that collapse in production where documents overlap, paraphrase each other, and partially satisfy queries while violating key constraints. This skill enables you to build adversarial evaluation harnesses that expose the real bottleneck -- semantic discrimination, not context length.
The core insight: Separate the measurement of can the system find the right evidence? (evidence access) from can it use that evidence to answer correctly? (evidence use). Most benchmarks conflate these. A system might retrieve the right document but hallucinate the answer, or it might fail to retrieve but guess correctly from world knowledge. Decoupling reveals which stage is broken.
Semantic interference is what makes this adversarial. Instead of padding context with unrelated text, you inject collision-tested hard negatives -- documents that are semantically close to the gold evidence (same domain, overlapping entities, similar language) but do not actually support the answer. This forces the system to discriminate between near-miss and true-match documents. The construction uses embedding-based retrieval to find candidate distractors, then LLM verification to classify each as a hard negative (similar but unsupportive), a conflict (contradicts the answer -- discard), or a false negative (actually valid evidence -- add to gold set).
The reference-corpus ladder systematically increases interference. Start with domain-isolated contexts where gold documents have minimal competition. Then mix in cross-domain documents. Then merge all domains into a shared corpus. Finally inject global distractors from the full document pool. At each rung, measure both access metrics (Recall@K, full-set Recall@K for multi-source queries) and QA quality separately. The degradation curve reveals the scale at which your system breaks.
Curate a document pool with domain labels. Collect your corpus and assign each document a domain tag (e.g., finance, medical, legal). Each document gets a unique DocID. For native long-context evaluation, prepend [DocID=<id>] headers so models can output IDs directly.
Build (Query, Answer, RefDocSet) tuples. For each evaluation query, identify the minimal set of reference documents that together contain sufficient evidence. Multi-source queries (requiring 2+ documents) are critical -- they expose brittleness that single-source queries miss.
Mine hard negatives via embedding retrieval. For each query, retrieve the top-K nearest non-reference documents using a dense embedder (e.g., any sentence-transformer or embedding API). These candidates are semantically close to the query but not in the gold set.
Classify candidates with LLM verification. For each retrieved candidate, prompt an LLM to classify it into one of three categories:
Construct the reference-corpus ladder. Build evaluation contexts at increasing scales:
Measure evidence access independently. Have the system output its top-K retrieved document IDs (or, for native long-context models, extract integer DocIDs from output). Score with:
Measure evidence use independently. Feed the system the full context (or retrieved passages) and score the generated answer on accuracy, completeness, and relevance (0-5 scale via LLM-as-judge or human evaluation). Compare this against an oracle condition where gold documents are provided directly.
Compute the access-use gap. For each corpus tier, compare evidence access scores against QA quality. A high access score with low QA score indicates a reasoning failure. A low access score with high QA score suggests reliance on parametric knowledge (fragile). Both metrics declining together confirms retrieval is the bottleneck.
Track the multi-source gap. At each tier, compute SR@10 - FR@10. This gap widens as interference increases and directly measures how well the system handles evidence aggregation across documents.
Iterate on the weakest link. If access degrades first: improve retriever discrimination (reranking, hard-negative fine-tuning, hybrid search). If use degrades first: improve downstream reasoning (better prompting, chain-of-thought, answer verification). If both degrade: prioritize access -- you cannot reason over evidence you never found.
Example 1: Diagnosing a RAG pipeline that passes NIAH but fails in production
User: My RAG system gets 95% on NIAH but users report wrong answers
on our medical knowledge base. Help me figure out what's breaking.
Approach:
1. Sample 50 real user queries that produced wrong answers.
2. For each query, identify the gold reference documents in the corpus.
3. Run embedding retrieval (top-50) against the full corpus for each query.
4. Classify non-gold retrieved docs with an LLM:
- "Given query Q, answer A, and reference docs R, does document D
support the answer? Classify as: hard_negative, conflict, or false_negative."
5. Build three evaluation tiers:
- Tier 1: Gold docs + 10 random unrelated docs (easy baseline)
- Tier 2: Gold docs + 10 hard negatives (semantic interference)
- Tier 3: Gold docs + 10 hard negatives + 30 random docs (scale + interference)
6. At each tier, measure:
- Access: Did the retriever rank gold docs in top-10? (SR@10)
- Access (strict): Did it find ALL gold docs? (FR@10)
- Use: Score generated answers 0-5 with LLM-as-judge.
Output (example diagnostic table):
| Tier | SR@10 | FR@10 | Avg QA Score |
|---------------|-------|-------|--------------|
| Easy baseline | 0.96 | 0.88 | 4.2 |
| Hard negatives| 0.71 | 0.43 | 2.8 |
| Full scale | 0.65 | 0.31 | 2.4 |
Diagnosis: FR@10 drops 57 points from easy to full scale. The retriever
cannot discriminate gold docs from hard negatives, especially for
multi-source queries. QA tracks access -- this is a retrieval problem.
Recommendation: Fine-tune the retriever with these hard negatives as
training signal, or add a reranker stage.
Example 2: Building a collision-tested evaluation set from scratch
User: I have 10,000 documents in my knowledge base and 200 QA pairs.
Create an adversarial evaluation set with hard negatives.
Approach:
1. Embed all 10,000 documents using a dense encoder.
2. For each of the 200 QA pairs:
a. Retrieve top-20 non-reference documents by cosine similarity.
b. For each candidate, call LLM classifier:
Prompt: "Query: {q}\nAnswer: {a}\nReference: {ref_summary}\n
Candidate: {candidate_text}\n
Classify this candidate as:
- HARD_NEGATIVE: Topically similar but does not support the answer
- CONFLICT: Contradicts the answer or contains incompatible info
- FALSE_NEGATIVE: Actually supports the answer (missed gold doc)
Output only the label and a one-sentence reason."
c. Keep HARD_NEGATIVEs, discard CONFLICTs (and their queries),
promote FALSE_NEGATIVEs to gold set.
3. Build corpus ladder:
- 64K tier: gold docs + top-5 hard negatives per query
- 256K tier: add cross-domain random documents
- Full corpus: all 10,000 documents
4. Output: eval_set.jsonl with fields:
{query, answer, gold_doc_ids, hard_negative_ids, tier, corpus_doc_ids}
Output:
- 200 queries -> 183 retained (17 had conflict-only candidates)
- 14 false negatives discovered and added to gold sets
- Average 8.3 hard negatives per query
- eval_set.jsonl ready for three-tier evaluation
Example 3: Measuring a long-context model's semantic discrimination
User: I want to test whether GPT-4-turbo or Claude can actually find
specific evidence in a 128K context with distractors, not just pass NIAH.
Approach:
1. Select 30 multi-source queries (each requiring 2-3 gold documents).
2. For each query, construct a 128K-token context:
- Gold documents with [DocID=N] headers
- 15 collision-tested hard negatives with their own DocIDs
- Random padding documents to fill to 128K
- Shuffle document order randomly
3. Prompt format for access measurement:
"Given the documents above, which document IDs contain evidence
needed to answer: {query}? Output up to 10 DocIDs as integers."
4. Prompt format for use measurement:
"Given the documents above, answer: {query}"
5. Score:
- Access: Parse integer IDs from output, compute SR@10 and FR@10
- Use: LLM-as-judge scores answer 0-5
Output:
| Model | SR@10 | FR@10 | QA Score | Access-Use Gap |
|---------------|-------|-------|----------|----------------|
| Model A | 0.82 | 0.47 | 3.1 | Retrieval-bound |
| Model B | 0.78 | 0.52 | 3.8 | Balanced |
Model A finds individual docs but misses multi-source sets (low FR@10).
Model B retrieves better multi-source sets and reasons more faithfully.
\b\d+\b) and take the first K unique matches. If extraction yields zero IDs, score as zero recall rather than skipping the sample.development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".