skills/langgraph-error-handling/SKILL.md
Implement LangGraph error handling with current v1 patterns. Use when users need to classify failures, add RetryPolicy for transient issues, build LLM recovery loops with Command routing, add human-in-the-loop with interrupt()/resume, handle ToolNode errors, or choose a safe strategy between retry, recovery, and escalation.
npx skillsauth add lubu-labs/langchain-agent-skills langgraph-error-handlingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
RetryPolicy to flaky nodes (API, DB, model/tool calls)Command + error state + retry counters)interrupt() and resumeToolNode failuresUse this order:
429, timeout, 5xx, temporary DB lock) -> RetryPolicyCommandinterrupt() + resume| Error Type | Owner | Primary Mechanism |
|---|---|---|
| Transient | System | RetryPolicy |
| LLM-recoverable | LLM | State update + Command(goto=...) |
| User-fixable | Human | interrupt() + Command(resume=...) |
| Unexpected | Developer | Raise/log/debug |
For full taxonomy, load references/error-types.md.
from langgraph.types import RetryPolicy
builder.add_node(
"call_api",
call_api,
retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)
builder.addNode("callApi", callApi, {
retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});
Notes:
retry_on/retryOn for non-transient domains.Use MessagesState in Python for message state.
from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command
class State(MessagesState):
error: NotRequired[str]
retry_count: NotRequired[int]
def agent(state: State) -> Command[Literal["tool", "__end__"]]:
if state.get("retry_count", 0) >= 3:
return Command(goto="__end__")
if state.get("error"):
return Command(goto="tool")
return Command(goto="tool")
import { StateGraph, Command, END } from "@langchain/langgraph";
// If a node returns Command in JS, add `ends` on addNode.
builder.addNode("agent", agentNode, { ends: ["tool", END] });
from langgraph.types import interrupt, Command
def human_review(state):
approved = interrupt({
"question": "Proceed?",
"payload": state["pending_action"],
})
return Command(goto="execute" if approved else "cancel")
# resume
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})
import { Command, interrupt } from "@langchain/langgraph";
const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
configurable: { thread_id: "t-1" },
});
Requirements:
thread_id on resume.For deep HITL patterns, load references/human-escalation.md.
from langgraph.prebuilt import ToolNode
tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))
Use custom handlers when you need deterministic error shaping for model recovery. For broader tool-recovery design, load references/llm-recovery.md.
interrupt() re-runs the node on resume: side effects before interrupt must be idempotent, or moved after interrupt / separate node.Command routing requires ends metadata on addNode(...).max_attempts, plus state counters for recovery loops).scripts/classify_error.py: classify exception category and recommended handlingscripts/wrap_with_retry.py: generate boilerplate node wrappers with retry/recovery/escalation optionsRun from repo root:
uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery
assets/examples/retry-example/: retry + recovery loop (Python and JS)assets/examples/human-loop-example/: interrupt/resume approval flow (Python and JS)references/error-types.md: error taxonomy and classification rulesreferences/retry-strategies.md: retry tuning, backoff, circuit-breaker-style patternsreferences/llm-recovery.md: recovery-loop and ToolNode strategiesreferences/human-escalation.md: human approval, interrupts, and escalation patterns| Symptom | Root Cause | Fix |
|---|---|---|
| interrupt() fails at runtime | no checkpointer | compile with checkpointer |
| Resume starts new run | different thread_id | reuse same thread_id |
| JS Command route not taken | missing ends | add ends to addNode |
| Infinite loop | no termination counter/condition | add retry counter + terminal branch |
| Retry never triggers | exception excluded by retry filter | set explicit retry_on/retryOn |
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
tools
Fetch, organize, and analyze LangSmith traces for debugging and evaluation. Use when you need to: query traces/runs by project, metadata, status, or time window; download traces to JSON; organize outcomes into passed/failed/error buckets; analyze token/message/tool-call patterns; compare passed vs failed behavior; or investigate benchmark and production failures.
tools
Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.
development
Design state schemas, implement reducers, configure persistence, and debug state issues for LangGraph applications. Use when users want to (1) design or define state schemas for LangGraph graphs, (2) implement reducer functions for state accumulation, (3) configure persistence with checkpointers (InMemorySaver/MemorySaver, SqliteSaver, PostgresSaver), (4) debug state update issues or unexpected state behavior, (5) migrate state schemas between versions, (6) validate state schema structure, (7) choose between TypedDict and MessagesState patterns, (8) implement custom reducers for lists, dicts, or sets, (9) use the Overwrite type to bypass reducers, (10) set up thread-based persistence for multi-turn conversations, or (11) inspect checkpoints for debugging.