skills/from-task-solving-robust/SKILL.md
Build LLM agent workflows that stay robust under partial observability, noisy signals, shifting environments, and internal state drift. Applies the four-stressor robustness framework from Pezeshkpour & Hruschka (2026) to real automation pipelines. Use when: 'make this agent more robust', 'handle unreliable API responses', 'add fallback logic to my pipeline', 'my agent breaks when the environment changes', 'add verification steps to my workflow', 'build a fault-tolerant automation'.
npx skillsauth add ndpvt-web/arxiv-claude-skills from-task-solving-robustInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to design and harden LLM agent systems against the four deployment stressors identified in Pezeshkpour & Hruschka (2026): partial observability (incomplete state), dynamic environments (conditions shift mid-execution), noisy signals (unreliable feedback from tools/APIs), and dynamic agent state (the agent's own capabilities degrade or change). The core insight is that task-solving ability alone does not predict deployment robustness -- agents must explicitly budget for verification, adapt strategies when conditions shift, and infer unstated objectives from context rather than assuming a clean interface.
The paper benchmarks five state-of-the-art LLMs in a grid game called WildGrid that deliberately violates the "clean interface" assumption most agent benchmarks rely on. The game has simple rules (collect three keys, reach the exit) but introduces four stressors that mirror real deployment conditions:
Partial observability: The agent sees only a local window, not the full state. Hidden rules govern tile behavior based on context the agent hasn't observed. In real systems, this maps to incomplete API responses, missing database fields, or undocumented service behavior. Agents must experiment, form hypotheses, and act conservatively under ambiguity.
Dynamic environments with distribution shift: At fixed intervals, a latent "weather" variable changes regimes -- altering action reliability and resource costs. Hazards spread. Teleport events relocate the agent. This mirrors production systems where rate limits change, dependencies update, or infrastructure shifts. The key finding: agents that front-load information gathering (Scan/Measure early) dramatically outperform those that dive into action immediately. Robust agents transition from an exploration phase to an exploitation phase, while fragile agents use myopic trial-and-error throughout.
Noisy signals and costly verification: Observations are corrupted with per-cell noise. Actions fail stochastically (movement "slip"). Verification costs energy -- Scan expands visibility temporarily, Measure reveals hidden structure, but both drain the resource that also powers actions. This is the verification budget problem: you can't check everything, so you must decide what is worth verifying. The paper shows moderate noise can paradoxically improve performance by forcing agents into more cautious strategies.
Dynamic agent state (capability drift): Mid-episode "drift events" change the agent's action reliability and cost profiles without notification. The agent must detect degradation from outcome patterns and recalibrate. In production, this maps to model degradation, token budget exhaustion, or infrastructure slowdowns that change what the agent can reliably do.
When asked to build or harden an agent workflow, apply these steps:
Audit the interface assumptions. List every external dependency (APIs, databases, file systems, other services) and classify each as reliable, intermittent, or unknown. For each, identify what information the agent receives vs. what it assumes -- this is your partial observability surface.
Map the four stressors to the specific system. For each dependency and step in the pipeline, ask: (a) Can the agent observe the full state? (b) Can conditions change mid-execution? (c) Can signals be noisy, stale, or corrupted? (d) Can the agent's own capabilities degrade (rate limits, token budgets, credential expiry)?
Design an explicit verification budget. Not every step needs verification. Prioritize verification for: actions with high penalty on failure, actions whose preconditions depend on stale or partial information, and actions taken after a detected environment shift. Implement verification as a callable check (not just a retry) that confirms the expected postcondition before proceeding.
Implement phased execution: explore-then-exploit. Structure the agent to front-load information gathering before committing to costly actions. In practice: probe API health, validate schema assumptions, check rate limit headers, and confirm data freshness before launching the main workflow. Budget 15-25% of total compute/calls for this reconnaissance phase.
Add change detection with replanning triggers. Monitor key signals for distribution shift: response latencies, error rates, schema changes, unexpected null fields. When a shift is detected, pause the current plan, re-evaluate assumptions, and replan from the current state rather than retrying the failed step blindly.
Implement graduated fallback, not binary retry. Design a fallback hierarchy: (a) retry with backoff, (b) retry with degraded parameters (smaller batch, simpler query), (c) switch to alternative data source or method, (d) return partial results with explicit uncertainty markers, (e) escalate to human with a structured summary of what was tried and what failed.
Add capability drift detection. Track the agent's own success rate over a rolling window. If action reliability drops (e.g., API calls that used to succeed now fail 30% of the time), trigger recalibration: reduce concurrency, increase verification frequency, or switch to more conservative action selection.
Infer implicit objectives from context. Real tasks have unstated goals beyond the explicit request. The paper shows agents naturally trade off completion, efficiency, and penalty avoidance. Make these trade-offs explicit in the agent prompt: "Your job is NOT just to finish, but to act robustly under uncertainty. Minimize penalties. Prefer partial correct results over complete but unreliable ones."
Test under stressor combinations, not just individual failures. Single-stressor testing misses interaction effects. Test with: (a) one stressor at a time, (b) pairs of stressors, (c) all stressors simultaneously. The paper finds non-monotonic and model-specific sensitivities -- what helps under one stressor can hurt under another.
Log decisions, not just outcomes. Record why the agent chose each action, what it verified, what it assumed, and what triggered any fallback. This decision log is essential for debugging robustness failures that don't reproduce under clean conditions.
Example 1: Hardening a data ingestion pipeline
User: "My agent pulls data from three APIs, transforms it, and loads it into a database. It works fine in dev but keeps failing in production."
Approach:
GET /health for each API, check rate limit headers, validate a sample response against expected schemaOutput structure:
class RobustPipeline:
def run(self):
# Phase 1: Reconnaissance (explore)
health = self.probe_all_sources()
if not health.all_ok:
self.replan(health.degraded_sources)
# Phase 2: Execute with verification
for source in self.sources:
data = self.fetch_with_fallback(source)
if not self.verify_postcondition(data, source.expected_schema):
data = self.fallback_strategy(source)
# Phase 3: Transform with drift detection
result = self.transform(data)
if self.drift_detected(result):
self.log_decision("Drift detected, recalibrating")
result = self.transform_conservative(data)
self.load(result)
Example 2: Adding robustness to a web scraping agent
User: "Build me a scraper that collects pricing data from competitor sites. Some sites change layout frequently."
Approach:
Output structure:
def scrape_with_robustness(site):
# Reconnaissance: detect current layout regime
page = fetch(site.url)
layout_fingerprint = detect_layout(page)
if layout_fingerprint != site.last_known_layout:
log_decision(f"Layout shift detected for {site.name}")
selectors = discover_selectors(page, site.target_fields)
else:
selectors = site.primary_selectors
# Extract with graduated fallback
results = {}
for field in site.target_fields:
for strategy in selectors[field].fallback_chain:
value = strategy.extract(page)
if value is not None:
results[field] = VerifiedValue(value, strategy.confidence)
break
else:
results[field] = UnverifiedValue(null, reason="all_strategies_failed")
# Capability drift: track extraction success rate
site.update_success_rate(results)
if site.success_rate < 0.7:
alert(f"{site.name} extraction degraded, pausing")
return results
Example 3: Making a deployment automation agent robust
User: "My CI/CD agent sometimes deploys broken builds because health checks pass initially but the service degrades after 30 seconds."
Approach:
Output:
deploy_verification:
phase_1: # t=0, reconnaissance
check: health_endpoint
threshold: 200_ok
action_on_fail: abort_deploy
phase_2: # t=30s, post-stabilization
check: [health_endpoint, error_rate, p99_latency]
thresholds: {error_rate: "<2%", p99: "<500ms"}
action_on_fail: hold_canary
phase_3: # t=60s, confirmation
check: [error_rate, p99_latency, business_metrics]
thresholds: {error_rate: "<1%", p99: "<300ms"}
action_on_fail: rollback
phase_4: # t=120s, full promotion
check: all_metrics_stable_for_60s
action_on_pass: promote_to_100_percent
action_on_fail: rollback_with_alert
| Failure Mode | Detection | Response | |---|---|---| | Partial observability gap | Unexpected null fields, missing data, undocumented behavior | Log the gap, attempt discovery probe, fall back to conservative defaults | | Environment shift mid-execution | Schema changes, new error codes, latency spikes, changed rate limits | Pause current plan, re-probe affected dependencies, replan from current state | | Noisy/corrupted signal | Inconsistent responses across retries, values outside expected range | Cross-validate with secondary source, apply confidence threshold before acting | | Capability drift | Rolling success rate drop, increased latency in agent's own actions | Reduce concurrency, increase verification frequency, switch to conservative mode | | Compounding stressors | Multiple simultaneous anomalies | Return partial results with explicit uncertainty markers, escalate to human |
Paper: Pezeshkpour, P. & Hruschka, E. (2026). "From Task Solving to Robust Real-World Adaptation in LLM Agents." arXiv:2602.02760v1. https://arxiv.org/abs/2602.02760v1
What to look for: Section 3 for the WildGrid environment design and four-stressor definitions; Section 5 for ablation results showing non-monotonic stressor effects and model-specific failure modes; the system prompt structure in Section 4 for the robustness-oriented agent framing.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".