skills/epistemic-context-learning-building/SKILL.md
Build trust-aware multi-agent systems using Epistemic Context Learning (ECL). Constructs peer reliability profiles from interaction history so agents weight information by source credibility instead of blindly conforming. Use when: 'build a multi-agent pipeline with trust', 'add peer reliability tracking to my agent system', 'prevent sycophancy in my LLM agents', 'implement history-aware trust for multi-agent', 'make agents evaluate peer credibility', 'add epistemic context to agent orchestration'.
npx skillsauth add ndpvt-web/arxiv-claude-skills epistemic-context-learning-buildingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches you to implement Epistemic Context Learning (ECL), a two-stage reasoning framework where LLM-based agents build explicit trust profiles from peer interaction history before incorporating peer responses into their decisions. Instead of treating all peer input equally (which causes sycophantic conformity), ECL forces agents to first compress historical evidence into a reliability belief, then use that belief to weight current-round peer suggestions. This makes small models outperform much larger ones by accurately identifying which peers to trust.
The core insight: In multi-agent LLM systems, agents fail not because they lack reasoning ability, but because they cannot distinguish trustworthy peers from unreliable ones. Standard approaches either treat all peer opinions equally (majority vote) or evaluate individual reasoning quality per-response. ECL shifts the problem: instead of judging what a peer says right now, judge how reliable the peer has been historically, then condition your decision on that reliability estimate.
The two-stage architecture: ECL splits each decision into Stage 1 (Trust Estimation) and Stage 2 (Trust-Informed Aggregation). In Stage 1, the agent receives only the historical interaction record for each peer -- a sequence of past questions and that peer's past answers -- with no access to the current question or current peer responses. This information bottleneck forces the agent to compress history into a structured belief profile (e.g., "Peer A was correct on 4/5 past rounds; Peer B on 1/5"). In Stage 2, the agent receives the compressed belief profiles alongside the current question and current peer responses, then generates its final answer by weighting peer input according to estimated reliability.
Why the separation matters: If you give the model history and current responses simultaneously, it takes shortcuts -- matching surface patterns between historical and current answers instead of genuinely modeling trust. The two-stage split, acting as an information bottleneck, forces real reliability estimation. The paper further shows this can be optimized with reinforcement learning using a Peer Recognition Reward (did Stage 1 correctly identify the most reliable peer?) alongside the standard outcome reward (did Stage 2 get the right answer?).
Define your peer set and interaction schema. Identify the N agents (models, tools, or retrieval sources) that will serve as peers. For each peer, define what a "historical interaction" looks like: a tuple of (query, peer_response, ground_truth_or_outcome). Store these in a structured format per peer.
Collect interaction history. Run each peer through a calibration set of T questions (the paper uses T=5 rounds with 4 peers). Record each peer's response and whether it was correct. Structure this as: history[peer_id] = [(q1, response1, correct1), (q2, response2, correct2), ...].
Implement Stage 1: Trust Estimation (history-only). Build a prompt that presents ONLY the historical interactions for each peer. Explicitly withhold the current question and current peer responses. Instruct the model to analyze past behavior and output a structured belief profile. The output should name each peer and summarize their reliability (accuracy rate, domain strengths, failure patterns), concluding with an explicit ranking: "The most reliable peer is: [PEER_NAME]".
Implement the information bottleneck. Pass only the compressed belief profile string from Stage 1 into Stage 2. Do NOT pass raw history into Stage 2. This forces the model to rely on its own trust summary rather than re-processing raw data or pattern-matching.
Implement Stage 2: Trust-Informed Aggregation. Build a prompt that presents: (a) the belief profiles from Stage 1, (b) the current question, and (c) the current peer responses labeled by peer name. Instruct the model to weight peer suggestions according to their reliability profiles and produce a final answer.
Add structured output parsing. Parse Stage 1 output to extract per-peer reliability scores and the top-peer designation. Parse Stage 2 output to extract the final answer. Use these for logging, evaluation, and the reward signals described next.
Implement dual reward signals for optimization (optional but powerful). If fine-tuning or using RL: assign a Peer Recognition Reward (PRR) of +1.0 to Stage 1 when it correctly identifies the most reliable peer, and an Outcome Reward (OR) of +1.0 to Stage 2 when the final answer is correct. Training with both rewards provides denser feedback than outcome-only.
Handle dynamic peer sets. When peers change between rounds (new agents join, old ones leave), maintain per-peer history independently. For new peers with no history, instruct Stage 1 to assign a neutral prior ("insufficient history to assess reliability") and have Stage 2 treat their input with moderate skepticism.
Validate with adversarial testing. Test with a configuration where one peer is reliably correct and others inject plausible-but-wrong answers. Verify that Stage 1 correctly identifies the reliable peer and Stage 2 follows that peer's guidance over the majority.
Log and iterate on trust calibration. Track Stage 1's peer identification accuracy and Stage 2's final answer accuracy separately. If Stage 1 accuracy is low, increase the history window (more rounds) or improve the Stage 1 prompt. If Stage 1 is accurate but Stage 2 still fails, the aggregation prompt needs refinement.
Example 1: Multi-Model Code Review Pipeline
User: I have three LLM reviewers (GPT-4o, Claude, Gemini) checking PRs for bugs.
Sometimes the weaker model's wrong suggestion overrides the correct one.
Add trust-based weighting so the system learns which reviewer to trust.
Approach:
1. Collect calibration data: run all three reviewers on 20 known-buggy code
snippets where ground truth is established. Record each reviewer's
verdicts (bug found / missed / false positive).
2. Build per-reviewer history records:
history = {
"gpt4o": [
{"snippet": "off-by-one in loop", "response": "flagged correctly", "correct": true},
{"snippet": "null deref", "response": "missed", "correct": false},
...
],
"claude": [...],
"gemini": [...]
}
3. Stage 1 prompt (history only, no current PR):
"""
Analyze the historical review accuracy of each peer reviewer below.
For each reviewer, summarize their accuracy rate and typical failure modes.
Conclude with: "The most reliable reviewer is: <NAME>"
## GPT-4o History
- Round 1: [snippet summary] -> [response] -> [correct/incorrect]
- Round 2: ...
## Claude History
...
## Gemini History
...
"""
Stage 1 output:
"GPT-4o: 16/20 correct (80%), tends to miss null-safety issues.
Claude: 18/20 correct (90%), strong on logic bugs, occasionally verbose.
Gemini: 12/20 correct (60%), high false-positive rate on style issues.
The most reliable reviewer is: Claude"
4. Stage 2 prompt:
"""
You are reviewing a pull request. Use these peer reliability profiles
to weight reviewer feedback appropriately.
## Reliability Profiles
[Stage 1 output inserted here]
## Current PR Diff
[diff content]
## Reviewer Feedback
GPT-4o says: "No bugs found."
Claude says: "Line 42 has a race condition in the mutex acquisition."
Gemini says: "Variable naming could be improved on line 15."
Provide your final review, weighting feedback by reviewer reliability.
"""
Output: Final review flags the race condition (trusting Claude's high-reliability
assessment) while deprioritizing Gemini's style comment and noting GPT-4o's
miss pattern on concurrency issues.
Example 2: Multi-Source RAG with Source Reliability
User: My RAG pipeline pulls from internal docs, Stack Overflow, and a legacy
wiki. The legacy wiki often has outdated info that poisons answers.
Help me add source trust tracking.
Approach:
1. Treat each retrieval source as a "peer agent." Collect history by
sampling 10 past queries where you know the correct answer, recording
which source provided correct vs. outdated/wrong passages.
2. Build source history:
history = {
"internal_docs": [{"query": "auth flow", "relevant": true, "accurate": true}, ...],
"stackoverflow": [{"query": "auth flow", "relevant": true, "accurate": true}, ...],
"legacy_wiki": [{"query": "auth flow", "relevant": true, "accurate": false}, ...]
}
3. Stage 1 (run once, cache the profile, refresh periodically):
Prompt with history only. Output:
"internal_docs: 9/10 accurate, authoritative for current architecture.
stackoverflow: 7/10 accurate, good for general patterns, sometimes outdated versions.
legacy_wiki: 3/10 accurate, frequently references deprecated APIs.
Most reliable source: internal_docs"
4. Stage 2 (per query):
Present source profiles + current retrieved passages + user question.
The model prioritizes internal_docs passages, cross-checks stackoverflow,
and treats legacy_wiki content with high skepticism -- citing it only
when corroborated by a reliable source.
Output: Answers grounded in internal docs, with legacy wiki info either
excluded or explicitly flagged as "from a lower-reliability source, verify
against current docs."
Example 3: Adversarial-Robust Agent Debate
User: I'm running a 4-agent debate system for complex reasoning tasks.
Three agents sometimes converge on a wrong answer and outvote the correct one.
Implement ECL so the system trusts track records over majority.
Approach:
1. Run all 4 agents through 5 calibration rounds on known questions.
Record per-agent correctness.
2. Stage 1 identifies that Agent-2 has 5/5 accuracy while Agents 1, 3, 4
average 2/5. Profile: "Agent-2 is the most reliable peer."
3. Stage 2 receives the current debate. Even though Agents 1, 3, 4 agree
on answer "B", the system weights Agent-2's answer "A" more heavily
based on the trust profile, and selects "A".
Result: The system breaks free of majority-rules failure mode by
conditioning on historical reliability rather than current vote counts.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".