skills/curiosity-driven-knowledge-retrieval/SKILL.md
Implements a curiosity-driven knowledge retrieval framework for autonomous agents. Formalizes agent uncertainty as a curiosity score, triggers external knowledge retrieval when uncertainty exceeds a threshold, and organizes retrieved knowledge into structured AppCards for selective integration into reasoning. Trigger phrases: "build an agent with curiosity-driven retrieval", "add uncertainty-aware knowledge lookup", "implement AppCard knowledge system", "create a curiosity-scored agent pipeline", "build adaptive knowledge retrieval", "implement uncertainty-triggered documentation lookup"
npx skillsauth add ndpvt-web/arxiv-claude-skills curiosity-driven-knowledge-retrievalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to build agent systems that detect their own knowledge gaps during task execution and autonomously retrieve relevant external knowledge to fill them. The core technique, from "Curiosity Driven Knowledge Retrieval for Mobile Agents" (Li et al., 2026), formalizes uncertainty as a curiosity score using Jensen-Shannon divergence between predicted and observed states. When cumulative uncertainty for a domain exceeds a threshold, the system retrieves documentation, code, and historical trajectories, then organizes them into AppCards — structured knowledge cards encoding functional semantics, parameter conventions, interface mappings, and interaction patterns. The agent selectively injects relevant AppCards into its reasoning context, reducing ambiguity and shortening exploration.
Curiosity Score as Retrieval Gate. Traditional RAG retrieves on every query. This framework instead monitors the agent's epistemic uncertainty continuously. It models the agent's predicted next state as a prior distribution P and the actual observed state as a posterior Q. The divergence between them — computed as a tail-adjusted Jensen-Shannon divergence JS*(P||Q) = JS(P||Q) + λ · ½(P(<OTHER>) + Q(<OTHER>)) — quantifies information gain per step. The tail adjustment captures uncertainty from low-probability tokens that standard top-K sampling misses. Cumulative uncertainty per domain accumulates as U(domain) = Σ γ_t · JS*(P_t || Q_t), where γ_t is a decay-weighted coefficient. When U(domain) > τ, retrieval fires.
AppCards as Structured Knowledge. Retrieved content is not injected as raw text. Instead, it is consolidated into AppCards: compact, standardized knowledge cards with four fields — (1) functional semantics (what the tool/API does), (2) input/output constraints (parameter types, return values, boundary conditions), (3) interface-functionality mapping (UI elements or endpoints mapped to their triggered functions), and (4) interaction patterns (common workflows and exception handling). Each AppCard contains 5-10 entries covering major modules or workflows, not micro-actions. This structure prevents prompt pollution and allows the agent to selectively integrate only the relevant fragments.
Selective Integration. During execution, the agent identifies which domains are needed for the current task, retrieves their AppCards, and injects them into its system prompt. This is not a blanket context dump — only AppCards for domains with high cumulative uncertainty are included. The result: the agent compensates for knowledge gaps without overloading its context window.
Define the domain registry. Enumerate every tool, API, or application the agent may interact with. For each, assign a domain key (e.g., "stripe-api", "gallery-app", "kubernetes-cli"). Initialize cumulative uncertainty U[domain] = 0 for each.
Implement the state prediction module. Before the agent takes an action, have it predict the expected outcome (next state, API response shape, UI state). Capture this as a probability distribution P over possible outcomes. Use the LLM's own token probabilities or a lightweight classifier.
Observe the actual outcome. After the action executes, capture the real result as distribution Q. For discrete outcomes, build Q from the observed state. For text-based outcomes, use token-level probability distributions from a forward pass.
Compute the curiosity score. Calculate the tail-adjusted JS divergence between P and Q. Aggregate low-probability tokens into an <OTHER> bucket and add the tail penalty weighted by λ (start with λ=0.1). Accumulate into U[domain] with temporal decay (γ=0.9 per step).
Check the retrieval threshold. If U[domain] > τ (start with τ=1.5, tune per domain), trigger knowledge retrieval for that domain. Reset U[domain] after retrieval to avoid redundant fetches.
Retrieve from heterogeneous sources. Query three source types in parallel: (a) web documentation and API guides (D_web), (b) source code and function definitions (D_git), (c) historical execution trajectories from past successful runs (D_traj). Use the domain key and current task context as the retrieval query.
Build or update the AppCard. Consolidate retrieved content into a structured AppCard with four sections: functional semantics, input/output constraints, interface-functionality mappings, and interaction patterns. Limit to 5-10 entries per card. If an AppCard already exists for this domain, merge new information rather than replacing.
Inject the AppCard into agent context. Prepend the relevant AppCard(s) to the agent's system prompt or tool description before the next reasoning step. Include only AppCards for domains actively involved in the current task and flagged by high uncertainty.
Execute with augmented knowledge. The agent proceeds with the AppCard-enriched context. Monitor whether uncertainty drops after integration — if U[domain] remains high after AppCard injection, escalate to a human or flag the task as requiring manual documentation.
Log and iterate. Store the task trajectory (actions, states, curiosity scores, retrieved AppCards) as a new entry in D_traj for future retrieval. Over time, the trajectory store becomes the most valuable retrieval source.
Example 1: API automation agent hitting an unknown endpoint
User: Build an agent that automates Stripe payment workflows. It should
handle subscriptions, invoices, and refunds.
Approach:
1. Define domains: ["stripe-subscriptions", "stripe-invoices", "stripe-refunds"]
2. Agent starts a subscription creation task. It predicts the API will
return a Subscription object with `status: "active"`.
3. Actual response includes `status: "incomplete"` with a `pending_setup_intent`.
JS divergence spikes — curiosity score for stripe-subscriptions jumps to 1.8.
4. Threshold exceeded (τ=1.5). Retrieve from Stripe docs, stripe-python
source code, and past successful subscription trajectories.
5. Build AppCard:
AppCard: stripe-subscriptions
1. Create: POST /v1/subscriptions — requires customer, price; returns
Subscription with status in [incomplete, active, past_due, canceled]
2. PaymentIntent: incomplete status means payment_intent needs confirmation;
use latest_invoice.payment_intent to retrieve and confirm
3. Lifecycle: incomplete → active (on payment) or incomplete_expired (after 23h)
4. Trials: Set trial_period_days or trial_end; status becomes "trialing"
5. Webhooks: Listen for customer.subscription.updated, invoice.paid
6. Agent re-plans with AppCard context, correctly handles the
pending_setup_intent, and completes the subscription flow.
Example 2: Multi-tool task with cross-domain uncertainty
User: Create an agent that reads data from a Google Sheet, processes it,
and posts a summary to Slack.
Approach:
1. Domains: ["google-sheets-api", "data-processing", "slack-api"]
2. Agent reads the sheet successfully (low uncertainty for google-sheets-api).
3. Agent attempts to post to Slack using chat.postMessage with blocks.
It predicts a 200 OK but receives {"error": "invalid_blocks"}.
Curiosity score for slack-api: 2.1 (exceeds τ).
4. Retrieve Slack Block Kit documentation, slack-sdk source, and past
Slack posting trajectories.
5. Build AppCard:
AppCard: slack-api
1. PostMessage: chat.postMessage — channel (required), text (fallback),
blocks (array of Block objects, max 50)
2. BlockKit: Each block needs "type" field; section blocks require
"text" object with "type":"mrkdwn" and "text" string
3. RichText: For formatted messages, use rich_text block type with
rich_text_section elements
4. Errors: invalid_blocks means malformed JSON in blocks array;
validate with Block Kit Builder API first
5. RateLimit: Tier 2 = 20 requests/min; use retry-after header
6. Agent rebuilds the Slack message with valid Block Kit JSON and posts
successfully.
Example 3: Implementing the curiosity scoring module in Python
import numpy as np
from collections import Counter
def compute_curiosity_score(predicted_probs: dict, observed_probs: dict,
lambda_tail: float = 0.1, top_k: int = 50) -> float:
"""
Compute tail-adjusted Jensen-Shannon divergence between predicted
and observed outcome distributions.
Args:
predicted_probs: {outcome: probability} from agent's prediction
observed_probs: {outcome: probability} from actual observation
lambda_tail: weight for the tail penalty term
top_k: number of top outcomes to keep; rest go to <OTHER>
"""
def bucket_distribution(probs, k):
sorted_items = sorted(probs.items(), key=lambda x: -x[1])[:k]
top = dict(sorted_items)
other_mass = max(0, 1.0 - sum(top.values()))
top["<OTHER>"] = other_mass
return top
P = bucket_distribution(predicted_probs, top_k)
Q = bucket_distribution(observed_probs, top_k)
# Union of all keys
all_keys = set(P.keys()) | set(Q.keys())
p = np.array([P.get(k, 1e-10) for k in all_keys])
q = np.array([Q.get(k, 1e-10) for k in all_keys])
# Normalize
p = p / p.sum()
q = q / q.sum()
# Jensen-Shannon divergence
m = 0.5 * (p + q)
js = 0.5 * np.sum(p * np.log2(p / m)) + 0.5 * np.sum(q * np.log2(q / m))
# Tail penalty
tail_penalty = lambda_tail * 0.5 * (P.get("<OTHER>", 0) + Q.get("<OTHER>", 0))
return js + tail_penalty
class CuriosityTracker:
"""Track cumulative uncertainty per domain and trigger retrieval."""
def __init__(self, threshold: float = 1.5, decay: float = 0.9):
self.threshold = threshold
self.decay = decay
self.uncertainty = {} # domain -> cumulative score
def update(self, domain: str, predicted: dict, observed: dict) -> bool:
"""Update uncertainty for a domain. Returns True if retrieval needed."""
score = compute_curiosity_score(predicted, observed)
current = self.uncertainty.get(domain, 0.0)
self.uncertainty[domain] = self.decay * current + score
return self.uncertainty[domain] > self.threshold
def reset(self, domain: str):
"""Reset after successful retrieval."""
self.uncertainty[domain] = 0.0
D_traj) becomes the highest-value retrieval source over time because it captures proven action sequences.| Problem | Cause | Fix |
|---|---|---|
| Retrieval triggers constantly | Threshold τ too low or decay γ too high | Raise τ to 2.0+; lower γ to 0.8 |
| Agent ignores AppCard content | AppCard injected too far from relevant reasoning step | Move AppCard injection to immediately before the action requiring that domain |
| Curiosity score stays high after retrieval | Retrieved docs are irrelevant or outdated | Check retrieval query quality; add domain-specific filters; fall back to D_traj |
| AppCards grow too large | No pruning of stale entries | Cap at 10 entries per card; replace lowest-utility entries based on access frequency |
| Observed distribution is degenerate (single outcome) | Deterministic API responses | Use output schema matching instead of token distributions; compare structural similarity |
D_traj empty), the system relies entirely on web documentation and code — which may be incomplete or hard to parse. Seed the trajectory store with manually curated examples for critical domains.Paper: "Curiosity Driven Knowledge Retrieval for Mobile Agents" — Li, Tan, Ali, Schmidt, Ma (2026). arXiv:2601.19306v1. https://arxiv.org/abs/2601.19306v1
Key insight: Formalizing agent uncertainty as a continuous curiosity signal via tail-adjusted JS divergence, then using it to gate retrieval rather than retrieving on every step, yields a 6 percentage point average improvement and state-of-the-art 88.8% success on AndroidWorld. The structured AppCard format is what makes the retrieved knowledge actually usable — raw document injection performs significantly worse.
Task trajectories and AppCard examples: https://lisalsj.github.io/Droidrun-appcard/
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".