skills/deep-search-hierarchical-meta-cognitive/SKILL.md
Implement hierarchical meta-cognitive monitoring for deep search agents. Embeds a two-tier self-monitoring system (fast consistency checks + slow experience-driven reflection) into multi-step retrieval-reasoning loops to detect anomalies, prevent reasoning drift, and trigger corrective interventions. Use when: 'build a deep search agent with self-monitoring', 'add metacognitive monitoring to my search pipeline', 'detect and fix reasoning failures in multi-step retrieval', 'implement DS-MCM for search quality', 'add anomaly detection to my RAG agent', 'build a self-correcting research agent'.
npx skillsauth add ndpvt-web/arxiv-claude-skills deep-search-hierarchical-meta-cognitiveInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build and apply a two-tier metacognitive monitoring framework for deep search agents. Inspired by cognitive neuroscience — where the anterior cingulate cortex performs fast conflict detection and the prefrontal cortex handles deliberate control — DS-MCM embeds a Fast Consistency Monitor (lightweight anomaly detection at every reasoning step) and a Slow Experience-Driven Monitor (selectively triggered corrective reflection using past trajectory memory) directly into the retrieval-reasoning loop. This prevents common failure modes like confidence-evidence misalignment, query loops, contradiction blindness, and reasoning drift, while adding only 3-7% time overhead.
The Core Insight: A deep search agent operates correctly when its internal reasoning uncertainty is calibrated to the uncertainty of external evidence. When retrieved documents are highly ambiguous (high Searching Entropy), the model's reasoning should reflect proportional uncertainty (high Reasoning Entropy). A mismatch — e.g., the model is overconfident despite contradictory sources, or confused despite clear evidence — signals a metacognitive failure requiring intervention.
Two-Tier Architecture: The Fast Consistency Monitor runs at every step with minimal cost. It computes Searching Entropy (SE) by clustering retrieved documents in embedding space and measuring Shannon entropy over semantic clusters, and Reasoning Entropy (RE) from the token-level probability distributions during reasoning generation. A learned linear calibration function (RE_hat = a*SE + b) predicts expected reasoning entropy given evidence entropy. When the residual |RE - RE_hat| exceeds a threshold (k=2 standard deviations from successful trajectory baselines), an anomaly is flagged.
Experience-Driven Correction: Only when the Fast Monitor flags an anomaly does the Slow Experience-Driven Monitor activate. It maintains dual memory banks — M+ (successful trajectories) and M- (failed trajectories) — each storing session state, context summaries, and LLM-generated cognitive abstractions of what went right or wrong. The current anomalous state is encoded and matched against both banks via cosine similarity. A Critical Model receives the current state plus retrieved positive and negative examples, then outputs an error classification and a natural-language corrective suggestion that conditions the agent's next action. This keeps expensive reflection rare (triggered only on anomalies) while making corrections informed by concrete past experience.
Set up the ReAct reasoning-retrieval loop: Implement or identify the existing agent loop: query formulation -> reasoning/thought -> action (search/retrieve/terminate) -> observation/tool feedback -> next step. DS-MCM hooks into the gap between reasoning and action.
Implement Searching Entropy (SE) computation: After each retrieval step, take the top-K retrieved documents (K=5), embed them using a sentence embedding model, cluster them using k-means or DBSCAN in embedding space, compute cluster probability distribution p(c_i), and calculate Shannon entropy: SE = -sum(p_i * log(p_i)). High SE means retrieved evidence is semantically fragmented/contradictory.
Implement Reasoning Entropy (RE) computation: During the reasoning generation step, capture the token-level probability distributions over the top-K vocabulary tokens. Compute per-token entropy and average across the generated reasoning trace: RE = mean(-sum(p_k * log(p_k))) for each token position. High RE means the model is internally uncertain.
Fit the calibration function: Collect a set of successful agent trajectories (even 50-100 is useful). For each step, compute (SE, RE) pairs. Fit a simple linear regression RE_hat = a*SE + b. Compute the standard deviation sigma of residuals epsilon = RE - RE_hat across successful trajectories. Set the anomaly threshold tau = 2 * sigma.
Deploy the Fast Consistency Monitor: At each reasoning step in the live agent, compute SE and RE, calculate the residual epsilon = RE - RE_hat, and flag an anomaly if |epsilon| > tau. Log the anomaly type: overconfident (RE too low for given SE) or under-confident (RE too high for given SE).
Build the experience memory banks: From historical trajectories, label each step as belonging to a successful or failed trajectory. For each step, generate a cognitive abstraction using an LLM prompt like: "Summarize what cognitive behavior (search strategy, reasoning pattern, error type) this step exhibits." Store entries as (state, context_summary, abstraction, label) in a vector index (FAISS or similar) keyed by the state embedding.
Implement the Slow Experience-Driven Monitor: When an anomaly is detected, encode the current agent state, retrieve the top-2 most similar entries from M+ and top-2 from M-. Feed the current state plus these four examples to a Critical Model with a prompt: "Given this search state and these examples of past successes and failures, identify whether a cognitive error is occurring and suggest a specific corrective action." The output is (error_flag, correction_suggestion).
Apply corrective interventions: If the Critical Model identifies an error, append the correction suggestion to the agent's context before generating the next action. Typical corrections include: reformulate the query with different terms, broaden/narrow the search scope, resolve a specific contradiction between sources, or abandon a dead-end line of investigation.
Implement online memory consolidation (optional): After each completed trajectory, generate a cognitive abstraction for the new experience and add it to M+ or M- based on outcome. Deduplicate against existing entries using an embedding similarity threshold (tau_dup = 0.95).
Monitor and tune: Track anomaly trigger rate (should be 10-25% of steps), false positive rate, and correction acceptance rate. Adjust k (anomaly sensitivity) and calibration parameters based on observed performance.
Example 1: Building a self-correcting research agent
User: "Build me a research agent that can search the web for complex questions and detect when it's going down the wrong path."
Approach:
Output structure:
class DSMCMAgent:
def __init__(self, llm, search_tool, embedder, calibration_params):
self.llm = llm
self.search_tool = search_tool
self.embedder = embedder
self.a, self.b, self.sigma = calibration_params # from fitted calibration
self.memory_positive = FAISSIndex()
self.memory_negative = FAISSIndex()
def fast_monitor(self, retrieved_docs, reasoning_logprobs):
se = self.compute_searching_entropy(retrieved_docs)
re = self.compute_reasoning_entropy(reasoning_logprobs)
re_hat = self.a * se + self.b
epsilon = abs(re - re_hat)
return epsilon > 2 * self.sigma, {"se": se, "re": re, "epsilon": epsilon}
def slow_monitor(self, current_state):
embedding = self.embedder.encode(current_state)
pos_examples = self.memory_positive.search(embedding, k=2)
neg_examples = self.memory_negative.search(embedding, k=2)
correction = self.llm.generate(
f"Current state: {current_state}\n"
f"Similar successes: {pos_examples}\n"
f"Similar failures: {neg_examples}\n"
f"Identify any cognitive error and suggest a correction."
)
return correction
def step(self, query, history):
docs = self.search_tool.search(query)
reasoning, logprobs = self.llm.reason(query, docs, history, return_logprobs=True)
is_anomaly, metrics = self.fast_monitor(docs, logprobs)
correction = None
if is_anomaly:
correction = self.slow_monitor(self.format_state(query, docs, reasoning))
action = self.llm.decide_action(query, docs, reasoning, correction=correction)
return action, metrics
Example 2: Adding anomaly detection to an existing RAG pipeline
User: "My RAG agent sometimes gives confident but wrong answers when sources conflict. Add monitoring."
Approach:
Output:
def monitor_rag_response(retrieved_chunks, query, llm, embedder):
# Searching Entropy
embeddings = embedder.encode([c.text for c in retrieved_chunks])
labels = kmeans(embeddings, n_clusters=min(3, len(chunks))).labels_
cluster_probs = np.bincount(labels) / len(labels)
se = -np.sum(cluster_probs * np.log(cluster_probs + 1e-10))
# Reasoning Entropy (sampling proxy)
samples = [llm.generate(query, retrieved_chunks, temperature=0.7) for _ in range(3)]
sample_embeddings = embedder.encode(samples)
pairwise_sim = cosine_similarity(sample_embeddings)
re_proxy = 1.0 - pairwise_sim.mean() # high agreement = low entropy
# Anomaly: high evidence uncertainty but low reasoning uncertainty
if se > SE_THRESHOLD and re_proxy < RE_THRESHOLD:
return {"anomaly": True, "type": "overconfident_despite_contradiction",
"correction": "Sources conflict. Identify contradictions explicitly."}
return {"anomaly": False}
Example 3: Preventing query loops in a multi-step search agent
User: "My agent keeps searching for the same thing with slightly different queries and never converges."
Approach:
Output:
Step 5 anomaly detected:
SE trend: [2.1, 2.0, 2.1, 1.9, 2.0] (flat - no information gain)
Query similarity to previous: 0.92 (near-duplicate)
Correction from experience memory:
"Past success pattern: When search stalls, decompose the question into
sub-questions and search for each independently. Similar failure avoided
by switching from 'who invented X' to 'history of X' + 'patents filed for X'."
-> Agent reformulates into two targeted sub-queries
Sun, Z., Wang, Q., Yu, W., Yang, J., & Lu, H. (2026). Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience. arXiv:2601.23188v1. https://arxiv.org/abs/2601.23188v1
Key sections to study: the Searching Entropy / Reasoning Entropy calibration framework (Section 3.2), the dual experience memory architecture (Section 3.3), and the ablation study showing that the two-tier hierarchy outperforms always-on reflection while using 3-7% overhead vs 12-22% (Section 4.3).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".