skills/echoes-loop-diagnosing-risks/SKILL.md
Diagnose and mitigate feedback-loop risks (bias amplification, hallucination propagation, exposure polarization) in LLM-powered recommender systems using a role-aware, phase-wise diagnostic framework. Use when: 'audit my recommendation pipeline for bias', 'check feedback loop risks in my LLM recommender', 'diagnose hallucination propagation in recommendations', 'build a feedback-loop simulation for my recommender', 'trace popularity bias through my recommendation cycles', 'add risk monitoring to my LLM-based ranking system'.
npx skillsauth add ndpvt-web/arxiv-claude-skills echoes-loop-diagnosing-risksInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to systematically diagnose, simulate, and mitigate the compounding risks that emerge when LLMs operate within recommender system feedback loops. Based on the role-aware, phase-wise diagnostic framework from Park, Lee & Lee (2026), it provides a structured methodology for tracing how LLM-specific risks — popularity bias amplification, hallucination-induced spurious signals, and self-reinforcing exposure polarization — emerge from specific LLM functional roles, manifest in ranking outcomes, and accumulate across repeated recommendation cycles. Claude applies this framework to audit existing pipelines, build simulation harnesses, and instrument production systems with risk-aware monitoring.
Traditional recommender system audits evaluate a single snapshot of recommendations. This framework recognizes that LLMs introduce qualitatively different risks depending on their functional role in the pipeline, and that these risks compound through feedback loops across multiple recommendation cycles (phases).
The framework identifies three primary LLM functional roles, each with distinct risk profiles:
Data Augmentation Role — LLMs generate synthetic user reviews, item descriptions, or features to fill data gaps. Risk: hallucinated attributes (e.g., fabricating that a niche book is a "bestseller") inject spurious signals that downstream models treat as ground truth. Over cycles, these hallucinations self-reinforce as the system recommends items based on fabricated properties, collects implicit feedback on those recommendations, and feeds that feedback back into training.
Profiling Role — LLMs summarize user preferences, extract interests from interaction histories, or generate user embeddings. Risk: LLMs inherit and amplify popularity bias from training data, producing profiles that over-index on mainstream preferences. Over cycles, users receive increasingly homogeneous recommendations, their interaction histories converge, and subsequent profiles become even more biased — a classic positive feedback loop.
Decision-Making Role — LLMs directly score, rank, or select items for recommendation. Risk: LLMs exhibit position bias, verbosity bias, and anchoring effects that distort rankings. Over cycles, items that benefit from these biases accumulate more exposure, more interactions, and thus more evidence to justify continued promotion — creating exposure polarization where popular items monopolize attention.
The framework measures risk at three levels, each corresponding to a progressively wider blast radius: (a) Content Level — evaluating the LLM-generated artifacts themselves for hallucination rates, factual consistency, and distributional skew; (b) Ranking Level — measuring how content-level risks translate into ranking distortions such as popularity bias in top-K lists, diversity collapse, and fairness violations; (c) Ecosystem Level — tracking system-wide dynamics over multiple cycles including Gini concentration of item exposure, user preference homogenization, and self-reinforcing feedback patterns.
Enumerate every point where an LLM is invoked in the recommender pipeline. For each invocation, classify it as data augmentation, profiling, or decision-making. Document the input/output contract: what data flows in, what the LLM produces, and where its output is consumed downstream.
For each LLM role identified in Step 1, enumerate the applicable risk categories:
Add measurement points at each LLM output to capture:
Add measurement points after the ranking stage to capture:
Construct a simulation that runs the recommendation pipeline for N cycles:
Across simulation cycles, track:
Execute the simulation for 10-20 cycles. Plot each metric over time. Look for:
When amplification is detected, trace backward through the pipeline to the originating LLM role. Compare metrics with and without each LLM component (ablation) to isolate the contribution of each role to the observed risk.
Based on the diagnosed role and risk:
Deploy the content-level and ranking-level metrics as production monitors. Set alerts for:
Example 1: Auditing an LLM-augmented product recommender
User: "We use GPT-4 to generate product descriptions for items missing catalog data, then feed those into our collaborative filtering model. Lately our recommendations seem very same-y. Can you help diagnose what's happening?"
Approach:
# content_audit.py
import json
from collections import Counter
def audit_generated_descriptions(generated: list[dict], catalog: list[dict]):
"""Compare LLM-generated product attributes against ground-truth catalog."""
hallucination_flags = []
popularity_tokens = Counter()
catalog_attrs = {item["id"]: set(item.get("attributes", [])) for item in catalog}
for item in generated:
item_id = item["id"]
gen_attrs = set(item.get("generated_attributes", []))
true_attrs = catalog_attrs.get(item_id, set())
# Hallucination: generated attributes not in catalog
hallucinated = gen_attrs - true_attrs
hallucination_flags.append({
"item_id": item_id,
"hallucinated_attributes": list(hallucinated),
"hallucination_rate": len(hallucinated) / max(len(gen_attrs), 1),
})
# Track token frequency to detect distributional skew
for attr in gen_attrs:
popularity_tokens[attr] += 1
avg_hallucination_rate = sum(
f["hallucination_rate"] for f in hallucination_flags
) / len(hallucination_flags)
# Compute Gini of attribute frequency distribution
freqs = sorted(popularity_tokens.values())
n = len(freqs)
gini = (2 * sum((i + 1) * f for i, f in enumerate(freqs)) / (n * sum(freqs))) - (n + 1) / n if n > 0 and sum(freqs) > 0 else 0
return {
"avg_hallucination_rate": avg_hallucination_rate,
"attribute_gini": gini,
"top_hallucinated": sorted(hallucination_flags, key=lambda x: x["hallucination_rate"], reverse=True)[:10],
}
# feedback_loop_sim.py
def simulate_feedback_loop(pipeline, users, items, n_cycles=15):
"""Simulate recommendation cycles and track risk metrics."""
metrics_over_time = []
for cycle in range(n_cycles):
# Step A: LLM augments item data
augmented_items = pipeline.augment(items)
# Step B: Generate recommendations
recommendations = pipeline.recommend(users, augmented_items, top_k=10)
# Step C: Simulate user interactions (click model)
interactions = simulate_clicks(users, recommendations, position_bias=True)
# Step D: Measure metrics at all three levels
content_metrics = measure_content_level(augmented_items, items)
ranking_metrics = measure_ranking_level(recommendations, items)
ecosystem_metrics = measure_ecosystem_level(
recommendations, users, metrics_over_time
)
metrics_over_time.append({
"cycle": cycle,
**content_metrics,
**ranking_metrics,
**ecosystem_metrics,
})
# Step E: Fold interactions back into training data
pipeline.update(interactions)
items = augmented_items # Generated data becomes next cycle's input
return metrics_over_time
Output: A diagnostic report showing hallucination rate climbing from 8% to 23% over 15 cycles, with Gini coefficient of item exposure rising from 0.45 to 0.72, confirming that fabricated attributes on popular items create a self-reinforcing loop.
Example 2: Adding risk monitoring to an LLM-based news recommender
User: "Our news app uses an LLM to summarize user reading histories into interest profiles, then ranks articles by profile relevance. How do I monitor for echo chamber effects?"
Approach:
# echo_chamber_monitor.py
import numpy as np
from scipy.stats import entropy
def compute_echo_chamber_metrics(
user_profiles: dict[str, list[str]],
recommendations: dict[str, list[str]],
previous_recommendations: dict[str, list[str]],
):
"""Detect preference homogenization and echo chamber formation."""
# 1. User preference entropy (are profiles converging?)
all_topics = set()
for topics in user_profiles.values():
all_topics.update(topics)
topic_list = sorted(all_topics)
entropies = []
for user_id, topics in user_profiles.items():
topic_counts = np.array([topics.count(t) for t in topic_list], dtype=float)
if topic_counts.sum() > 0:
topic_dist = topic_counts / topic_counts.sum()
entropies.append(entropy(topic_dist))
avg_entropy = np.mean(entropies)
# 2. Recommendation overlap with previous cycle
overlaps = []
for user_id in recommendations:
if user_id in previous_recommendations:
current = set(recommendations[user_id][:10])
previous = set(previous_recommendations[user_id][:10])
overlap = len(current & previous) / max(len(current), 1)
overlaps.append(overlap)
avg_overlap = np.mean(overlaps) if overlaps else 0
# 3. Cross-user recommendation similarity (homogenization)
user_ids = list(recommendations.keys())
pairwise_overlaps = []
for i in range(min(len(user_ids), 200)):
for j in range(i + 1, min(len(user_ids), 200)):
set_i = set(recommendations[user_ids[i]][:10])
set_j = set(recommendations[user_ids[j]][:10])
pairwise_overlaps.append(
len(set_i & set_j) / max(len(set_i | set_j), 1)
)
cross_user_similarity = np.mean(pairwise_overlaps) if pairwise_overlaps else 0
return {
"avg_profile_entropy": avg_entropy,
"avg_cycle_overlap": avg_overlap,
"cross_user_similarity": cross_user_similarity,
"echo_chamber_alert": avg_overlap > 0.8 or avg_entropy < 1.0,
}
# alerts.py
RISK_THRESHOLDS = {
"avg_profile_entropy": {"warn": 1.5, "critical": 1.0, "direction": "below"},
"avg_cycle_overlap": {"warn": 0.6, "critical": 0.8, "direction": "above"},
"cross_user_similarity": {"warn": 0.3, "critical": 0.5, "direction": "above"},
"feedback_loop_gain": {"warn": 1.0, "critical": 1.5, "direction": "above"},
}
def check_alerts(metrics: dict) -> list[dict]:
alerts = []
for metric, thresholds in RISK_THRESHOLDS.items():
value = metrics.get(metric)
if value is None:
continue
if thresholds["direction"] == "above":
if value >= thresholds["critical"]:
alerts.append({"metric": metric, "value": value, "level": "CRITICAL"})
elif value >= thresholds["warn"]:
alerts.append({"metric": metric, "value": value, "level": "WARNING"})
else:
if value <= thresholds["critical"]:
alerts.append({"metric": metric, "value": value, "level": "CRITICAL"})
elif value <= thresholds["warn"]:
alerts.append({"metric": metric, "value": value, "level": "WARNING"})
return alerts
Output: A monitoring dashboard that tracks profile entropy, cycle-over-cycle overlap, and cross-user homogenization, alerting when echo chamber indicators exceed thresholds.
Example 3: Designing a new LLM re-ranker with built-in risk controls
User: "I want to use an LLM to re-rank search results for our e-commerce site. How do I avoid the feedback loop problems?"
Approach:
# risk_aware_reranker.py
import random
def risk_aware_llm_rerank(
query: str,
candidates: list[dict],
llm_client,
exploration_rate: float = 0.1,
max_popularity_ratio: float = 0.5,
):
"""LLM re-ranker with built-in feedback loop mitigations."""
# Mitigation 1: Shuffle candidates before LLM scoring to reduce position bias
shuffled = candidates.copy()
random.shuffle(shuffled)
# Mitigation 2: Prompt design that counters popularity anchoring
prompt = f"""Score the relevance of each product to the query: "{query}"
Rate ONLY based on feature match to the query. Ignore popularity,
review count, or brand recognition. A niche product that perfectly
matches the query should score higher than a popular product that
partially matches.
Products:
{format_products(shuffled)}
Return scores as JSON: [{{"id": "...", "score": 0.0-1.0}}]"""
scores = llm_client.generate(prompt)
scored = sorted(scores, key=lambda x: x["score"], reverse=True)
# Mitigation 3: Exposure fairness constraint
head_items = [s for s in scored if s.get("popularity_tier") == "head"]
head_count = sum(1 for s in scored[:10] if s.get("popularity_tier") == "head")
if head_count / 10 > max_popularity_ratio:
scored = rebalance_exposure(scored, max_popularity_ratio)
# Mitigation 4: Exploration budget — inject random long-tail items
final = scored[:10]
n_explore = max(1, int(len(final) * exploration_rate))
long_tail = [c for c in candidates if c.get("popularity_tier") == "tail"]
if long_tail:
for i in range(n_explore):
pos = random.randint(5, 9) # Insert in lower half
final[pos] = random.choice(long_tail)
return final
Output: A re-ranker that proactively mitigates position bias (shuffling), popularity anchoring (prompt design), exposure concentration (fairness constraint), and feedback loops (exploration budget).
Do:
Avoid:
Park, D., Lee, D., & Lee, Y.-C. (2026). Echoes in the Loop: Diagnosing Risks in LLM-Powered Recommender Systems under Feedback Loops. arXiv:2602.07442. https://arxiv.org/abs/2602.07442v1
Key insight to look for: The role-aware decomposition (data augmentation / profiling / decision-making) combined with three-level measurement (content / ranking / ecosystem) across temporal phases, which reveals compounding risks invisible to single-snapshot evaluations.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".