skills/beyond-accuracy-cognitive-load/SKILL.md
Analyze and reduce cognitive load in tool-use agent workflows using the Cognitive Load Framework from AAAI 2026. Diagnoses why agent pipelines fail by decomposing task complexity into Intrinsic Load (tool dependency depth/branching) and Extraneous Load (ambiguity/parameter confusion). Use when: 'diagnose why my agent keeps failing', 'reduce tool-call complexity', 'optimize my agent workflow', 'analyze cognitive load of this pipeline', 'map capability boundaries', 'simplify my tool orchestration'.
npx skillsauth add ndpvt-web/arxiv-claude-skills beyond-accuracy-cognitive-loadInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill applies the Cognitive Load Framework (Wang et al., AAAI 2026) to diagnose, analyze, and reduce the complexity of tool-use agent workflows. Instead of treating agent failures as opaque accuracy drops, this framework decomposes task complexity into two quantifiable axes -- Intrinsic Load (structural complexity of the tool dependency chain) and Extraneous Load (ambiguity in how the task and tools are presented) -- enabling you to identify exactly where and why an agent pipeline breaks, then restructure it to stay within capability boundaries.
The framework borrows from Cognitive Load Theory in educational psychology, which distinguishes between intrinsic load (inherent difficulty of the material) and extraneous load (unnecessary difficulty from poor presentation). Applied to tool-use agents:
Intrinsic Load is formalized via the Tool Interaction Graph (TIG) -- a directed acyclic graph where nodes are tool calls and edges represent data dependencies (one tool's output feeds another's input). The key measurable properties are: (1) depth -- the longest chain of sequential tool calls, (2) branching factor -- how many tool choices exist at each decision point, and (3) dependency density -- how many cross-tool data handoffs are required. Empirically, models hit sharp performance cliffs at TIG depth >= 5, branching factor > 6, and when these compound together.
Extraneous Load captures difficulty from how tools and tasks are presented, not their inherent structure. It is quantified by: (1) parameter semantic overlap -- tools with similarly-named but differently-behaved parameters (e.g., id meaning user ID in one tool and order ID in another), (2) distractor tool density -- how many functionally similar tools the agent must discriminate between, and (3) specification clarity -- how unambiguous the tool descriptions are. Parameter ambiguity above 40% semantic overlap causes catastrophic selection errors in most models. The critical insight is that total cognitive load is multiplicative, not additive -- high intrinsic load combined with high extraneous load produces compound failures far worse than either alone.
Enumerate every tool the agent can call. For each task or workflow, draw the dependency graph: which tool outputs feed into which tool inputs. Record the graph as an adjacency list or visual DAG.
From the TIG, compute:
Flag the workflow if depth >= 5, average branching > 6, or dependency density > 0.7.
For each pair of tools in the available set:
Flag if parameter overlap > 40%, distractor count > 3 per decision point, or descriptions lack disambiguation.
Estimate total load as: Compound Load = Depth x max(Branching, 1) x (1 + Extraneous_Overlap). This captures the multiplicative interaction. Compare against known thresholds:
In the TIG, find the longest dependency chain (critical path). This is the primary bottleneck. Errors on this path cascade to all downstream tools. Prioritize reducing load along this path first.
If depth > 5, break the workflow into sub-agents or staged pipelines:
user_id, order_id -- never bare id).Recompute the TIG metrics after restructuring. Confirm compound load is in the safe zone. Run a small set of test cases to verify the failure rate dropped.
Record the load thresholds where your target model fails. This becomes a design constraint for future workflows: new tool additions or workflow changes must not push the compound load past the boundary.
Example 1: Diagnosing a Flaky Customer Support Agent
User: "My customer support agent uses 12 tools and fails about 40% of the time on refund requests. Why?"
Approach:
get_customer(email) -> get_orders(customer_id) -> get_order_details(order_id)
-> check_refund_policy(order_id, product_id) -> calculate_refund(order_id, amount)
-> process_refund(refund_amount, payment_method) -> send_confirmation(email, refund_id)
get_customer and get_orders both have an id parameter with different semantics. 4 tools have overlapping "get" descriptions.Output -- Diagnosis Report:
COGNITIVE LOAD ANALYSIS
=======================
Workflow: Customer refund processing
Intrinsic Load:
- TIG Depth: 7 (CRITICAL: exceeds threshold of 5)
- Max Branching: 3
- Dependency Density: 0.85
Extraneous Load:
- Parameter Overlap: 45% (WARNING: >40% threshold)
- Distractor Tools: 5 unused tools in scope
- Ambiguous Descriptions: get_customer vs get_orders
Compound Load: 30.5 (FAILURE ZONE: >30)
RECOMMENDED FIXES:
1. Split into 2 sub-agents:
- Retrieval Agent (depth 3): get_customer -> get_orders -> get_order_details
- Action Agent (depth 3): check_policy -> calculate -> process -> confirm
2. Rename 'id' params: customer_id, order_id, refund_id
3. Scope tool set: remove 5 irrelevant tools from each sub-agent
Projected Compound Load after fix: 3 x 2 x 1.1 = 6.6 (SAFE)
Example 2: Designing a New Data Pipeline Agent
User: "I'm building an agent that queries a database, transforms data, generates a chart, and emails it. How should I structure the tools?"
Approach:
query_db(sql) -> transform_data(data, operations) -> generate_chart(data, chart_type)
-> compose_email(recipient, subject, attachment) -> send_email(email_obj)
Output -- Design Recommendation:
PROACTIVE LOAD ANALYSIS
=======================
Proposed Workflow: Data pipeline + email
TIG Depth: 5 (borderline -- no room for added complexity)
Branching: 1 (linear chain, good)
Compound Load: 5.0 (SAFE, but tight)
DESIGN GUIDELINES:
- Keep the linear structure -- do not add optional branching tools
- Use distinct parameter names: sql_query, raw_data, chart_data, email_body
- If you later add "choose chart type" branching (depth+1, branch+3),
load jumps to 6 x 3 x 1.0 = 18 (RISK ZONE) -- decompose at that point
- Tool descriptions should specify: "Use query_db for initial data fetch,
use transform_data ONLY for post-query reshaping"
Example 3: Reducing Parameter Confusion
User: "My agent keeps calling the wrong API -- it confuses search_users with search_accounts."
Approach:
search_users(query, limit, filter) -- "Search for users matching criteria"search_accounts(query, limit, filter) -- "Search for accounts matching criteria"Output -- Fix:
EXTRANEOUS LOAD FIX
====================
Problem: search_users and search_accounts have 100% parameter overlap
and 90% description overlap.
Fix tool descriptions:
search_users:
"Find individual people by name or email. Use when the task
mentions a person, employee, or human user. Returns: name, email, role."
search_accounts:
"Find organization/company accounts by company name or domain.
Use when the task mentions a company, organization, or billing entity.
Returns: company_name, domain, plan_tier."
Rename parameters:
search_users(user_query, max_users, user_filter)
search_accounts(account_query, max_accounts, account_filter)
Projected overlap after fix: <15% (SAFE)
id, data, query, or result.limit always means max results) are not confusion sources. Only flag genuinely ambiguous overlaps.Wang, Q., Hu, Y., Lu, M., Wu, J., & Liu, Y. (2026). Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents. AAAI 2026. arXiv:2601.20412 -- Read Sections 3-4 for the TIG formalism and load quantification, Section 5 for ToolLoad-Bench construction, and Section 6 for the performance cliff analysis and capability boundary maps.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".