Cognitive Load Framework for Tool-Use Agent Design

This skill applies the Cognitive Load Framework (Wang et al., AAAI 2026) to diagnose, analyze, and reduce the complexity of tool-use agent workflows. Instead of treating agent failures as opaque accuracy drops, this framework decomposes task complexity into two quantifiable axes -- Intrinsic Load (structural complexity of the tool dependency chain) and Extraneous Load (ambiguity in how the task and tools are presented) -- enabling you to identify exactly where and why an agent pipeline breaks, then restructure it to stay within capability boundaries.

When to Use

When a multi-tool agent pipeline has unreliable success rates and you need to diagnose the root cause
When designing a new tool-calling workflow and you want to predict whether an LLM can handle the complexity
When an agent must choose among many similar tools and keeps picking the wrong one (parameter confusion)
When you need to simplify an existing agent orchestration to improve reliability
When evaluating whether a task should be decomposed into sub-agents vs. handled by a single agent
When building tool descriptions/schemas and you want to minimize ambiguity that causes failures
When comparing whether a workflow is within the capability boundary of a given model tier

Key Technique

The framework borrows from Cognitive Load Theory in educational psychology, which distinguishes between intrinsic load (inherent difficulty of the material) and extraneous load (unnecessary difficulty from poor presentation). Applied to tool-use agents:

Intrinsic Load is formalized via the Tool Interaction Graph (TIG) -- a directed acyclic graph where nodes are tool calls and edges represent data dependencies (one tool's output feeds another's input). The key measurable properties are: (1) depth -- the longest chain of sequential tool calls, (2) branching factor -- how many tool choices exist at each decision point, and (3) dependency density -- how many cross-tool data handoffs are required. Empirically, models hit sharp performance cliffs at TIG depth >= 5, branching factor > 6, and when these compound together.

Extraneous Load captures difficulty from how tools and tasks are presented, not their inherent structure. It is quantified by: (1) parameter semantic overlap -- tools with similarly-named but differently-behaved parameters (e.g., id meaning user ID in one tool and order ID in another), (2) distractor tool density -- how many functionally similar tools the agent must discriminate between, and (3) specification clarity -- how unambiguous the tool descriptions are. Parameter ambiguity above 40% semantic overlap causes catastrophic selection errors in most models. The critical insight is that total cognitive load is multiplicative, not additive -- high intrinsic load combined with high extraneous load produces compound failures far worse than either alone.

Step-by-Step Workflow

1. Map the Tool Interaction Graph (TIG)

Enumerate every tool the agent can call. For each task or workflow, draw the dependency graph: which tool outputs feed into which tool inputs. Record the graph as an adjacency list or visual DAG.

2. Measure Intrinsic Load Metrics

From the TIG, compute:

Depth: Longest path from any entry tool to the final output (count edges).
Branching factor: At each decision node, count the number of valid tool choices. Take the average and max.
Dependency density: Total number of cross-tool data handoffs divided by total tool calls.

Flag the workflow if depth >= 5, average branching > 6, or dependency density > 0.7.

3. Audit Extraneous Load

For each pair of tools in the available set:

Compare parameter names and types. Score parameter semantic overlap as the fraction of parameter names shared between tools that have different semantics.
Count distractor tools: tools whose descriptions overlap > 60% in purpose.
Rate specification clarity: are parameter descriptions unambiguous? Do tool docstrings explain when to use the tool vs. alternatives?

Flag if parameter overlap > 40%, distractor count > 3 per decision point, or descriptions lack disambiguation.

4. Compute Compound Load Score

Estimate total load as: Compound Load = Depth x max(Branching, 1) x (1 + Extraneous_Overlap). This captures the multiplicative interaction. Compare against known thresholds:

Safe zone: Compound Load < 15
Risk zone: 15-30 (expect 20-40% failure rate)
Failure zone: > 30 (expect > 50% failure rate, restructure required)

5. Identify the Critical Path

In the TIG, find the longest dependency chain (critical path). This is the primary bottleneck. Errors on this path cascade to all downstream tools. Prioritize reducing load along this path first.

6. Reduce Intrinsic Load via Decomposition

If depth > 5, break the workflow into sub-agents or staged pipelines:

Group tightly-coupled tool calls (depth <= 4 each) into sub-tasks.
Use an orchestrator agent that calls sub-agents rather than all tools directly.
Ensure each sub-agent's TIG stays within the safe zone.

7. Reduce Extraneous Load via Tool Design

Rename parameters to be globally unambiguous (user_id, order_id -- never bare id).
Add "when to use" preambles to each tool description that distinguish it from similar tools.
Remove distractor tools from the available set when they are irrelevant to the current task scope.
Use hierarchical tool organization: group related tools under namespaces.

8. Validate the Restructured Workflow

Recompute the TIG metrics after restructuring. Confirm compound load is in the safe zone. Run a small set of test cases to verify the failure rate dropped.

9. Document Capability Boundaries

Record the load thresholds where your target model fails. This becomes a design constraint for future workflows: new tool additions or workflow changes must not push the compound load past the boundary.

Concrete Examples

Example 1: Diagnosing a Flaky Customer Support Agent

User: "My customer support agent uses 12 tools and fails about 40% of the time on refund requests. Why?"

Approach:

Map the TIG for the refund workflow:

get_customer(email) -> get_orders(customer_id) -> get_order_details(order_id)
  -> check_refund_policy(order_id, product_id) -> calculate_refund(order_id, amount)
    -> process_refund(refund_amount, payment_method) -> send_confirmation(email, refund_id)

Measure: Depth = 7, Branching = 3 (at order selection), 12 total tools available.
Audit: get_customer and get_orders both have an id parameter with different semantics. 4 tools have overlapping "get" descriptions.
Compound Load = 7 x 3 x (1 + 0.45) = 30.5 -- failure zone.

Output -- Diagnosis Report:

COGNITIVE LOAD ANALYSIS
=======================
Workflow: Customer refund processing
Intrinsic Load:
  - TIG Depth: 7 (CRITICAL: exceeds threshold of 5)
  - Max Branching: 3
  - Dependency Density: 0.85
Extraneous Load:
  - Parameter Overlap: 45% (WARNING: >40% threshold)
  - Distractor Tools: 5 unused tools in scope
  - Ambiguous Descriptions: get_customer vs get_orders
Compound Load: 30.5 (FAILURE ZONE: >30)

RECOMMENDED FIXES:
1. Split into 2 sub-agents:
   - Retrieval Agent (depth 3): get_customer -> get_orders -> get_order_details
   - Action Agent (depth 3): check_policy -> calculate -> process -> confirm
2. Rename 'id' params: customer_id, order_id, refund_id
3. Scope tool set: remove 5 irrelevant tools from each sub-agent
Projected Compound Load after fix: 3 x 2 x 1.1 = 6.6 (SAFE)

Example 2: Designing a New Data Pipeline Agent

User: "I'm building an agent that queries a database, transforms data, generates a chart, and emails it. How should I structure the tools?"

Approach:

Draft the TIG before building:

query_db(sql) -> transform_data(data, operations) -> generate_chart(data, chart_type)
  -> compose_email(recipient, subject, attachment) -> send_email(email_obj)

Measure: Depth = 5 (borderline), Branching = 1 (linear), Dependency Density = 1.0.
Compound Load = 5 x 1 x 1.0 = 5.0 -- safe zone, but at depth limit.

Output -- Design Recommendation:

PROACTIVE LOAD ANALYSIS
=======================
Proposed Workflow: Data pipeline + email
TIG Depth: 5 (borderline -- no room for added complexity)
Branching: 1 (linear chain, good)
Compound Load: 5.0 (SAFE, but tight)

DESIGN GUIDELINES:
- Keep the linear structure -- do not add optional branching tools
- Use distinct parameter names: sql_query, raw_data, chart_data, email_body
- If you later add "choose chart type" branching (depth+1, branch+3),
  load jumps to 6 x 3 x 1.0 = 18 (RISK ZONE) -- decompose at that point
- Tool descriptions should specify: "Use query_db for initial data fetch,
  use transform_data ONLY for post-query reshaping"

Example 3: Reducing Parameter Confusion

User: "My agent keeps calling the wrong API -- it confuses search_users with search_accounts."

Approach:

Audit the two tool schemas:
- search_users(query, limit, filter) -- "Search for users matching criteria"
- search_accounts(query, limit, filter) -- "Search for accounts matching criteria"
Parameter overlap: 100% (identical names). Description overlap: ~90%.
Extraneous Load is the dominant failure mode here.

Output -- Fix:

EXTRANEOUS LOAD FIX
====================
Problem: search_users and search_accounts have 100% parameter overlap
         and 90% description overlap.

Fix tool descriptions:
  search_users:
    "Find individual people by name or email. Use when the task
     mentions a person, employee, or human user. Returns: name, email, role."
  search_accounts:
    "Find organization/company accounts by company name or domain.
     Use when the task mentions a company, organization, or billing entity.
     Returns: company_name, domain, plan_tier."

Rename parameters:
  search_users(user_query, max_users, user_filter)
  search_accounts(account_query, max_accounts, account_filter)

Projected overlap after fix: <15% (SAFE)

Best Practices

Do: Always map the TIG before building a multi-tool workflow. Catching depth > 5 at design time prevents flaky production agents.
Do: Scope the tool set per task. An agent with 20 available tools but only needing 4 has unnecessary extraneous load from the 16 distractors.
Do: Use globally unique, semantically clear parameter names across all tools in an agent's scope. Never use bare id, data, query, or result.
Do: Include "when to use this tool vs. alternatives" in every tool description.
Avoid: Assuming that a more capable model eliminates the need for load management. Even GPT-4-class models collapse at compound load > 30.
Avoid: Adding "nice to have" tools to an agent's tool set. Each additional tool increases branching and distractor density.
Avoid: Chains deeper than 4-5 without decomposition into sub-agents, even if each individual step seems simple.

Error Handling

Misidentified critical path: If the workflow has parallel branches, identify all paths and measure the longest. Parallel branches add branching factor, not depth.
Overestimated extraneous load: Parameter names that are identical but have identical semantics across tools (e.g., limit always means max results) are not confusion sources. Only flag genuinely ambiguous overlaps.
Restructuring introduces new failure modes: When splitting into sub-agents, verify that the orchestrator's own tool set (calling sub-agents) doesn't itself exceed load thresholds. An orchestrator calling 8 sub-agents is branching factor 8.
Dynamic tool sets: If the available tools change per turn (e.g., tools unlock after authentication), re-evaluate load at each stage rather than computing a single static score.

Limitations

The compound load formula is a heuristic approximation. Real-world failure rates depend on the specific model, prompt engineering, and task domain -- use the thresholds as guidelines, not guarantees.
The framework assumes tool calls are the primary source of complexity. Tasks with complex reasoning between tool calls (e.g., multi-step math) have cognitive demands not captured by TIG metrics alone.
Extraneous load measurement requires subjective judgment about semantic similarity of parameters and descriptions. Two reviewers may score overlap differently.
The framework is calibrated primarily on English-language tool descriptions and tasks. Multilingual tool use may shift the thresholds.
ToolLoad-Bench uses synthetic tasks. Real-world tool APIs have additional complexity (rate limits, auth, error codes) not modeled by the load metrics.

Reference

Wang, Q., Hu, Y., Lu, M., Wu, J., & Liu, Y. (2026). Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents. AAAI 2026. arXiv:2601.20412 -- Read Sections 3-4 for the TIG formalism and load quantification, Section 5 for ToolLoad-Bench construction, and Section 6 for the performance cliff analysis and capability boundary maps.

Cognitive Load Framework for Tool-Use Agent Design

When to Use

When a multi-tool agent pipeline has unreliable success rates and you need to diagnose the root cause
When designing a new tool-calling workflow and you want to predict whether an LLM can handle the complexity
When an agent must choose among many similar tools and keeps picking the wrong one (parameter confusion)
When you need to simplify an existing agent orchestration to improve reliability
When evaluating whether a task should be decomposed into sub-agents vs. handled by a single agent
When building tool descriptions/schemas and you want to minimize ambiguity that causes failures
When comparing whether a workflow is within the capability boundary of a given model tier

Key Technique

Step-by-Step Workflow

1. Map the Tool Interaction Graph (TIG)

Enumerate every tool the agent can call. For each task or workflow, draw the dependency graph: which tool outputs feed into which tool inputs. Record the graph as an adjacency list or visual DAG.

2. Measure Intrinsic Load Metrics

From the TIG, compute:

Depth: Longest path from any entry tool to the final output (count edges).
Branching factor: At each decision node, count the number of valid tool choices. Take the average and max.
Dependency density: Total number of cross-tool data handoffs divided by total tool calls.

Flag the workflow if depth >= 5, average branching > 6, or dependency density > 0.7.

3. Audit Extraneous Load

For each pair of tools in the available set:

Compare parameter names and types. Score parameter semantic overlap as the fraction of parameter names shared between tools that have different semantics.
Count distractor tools: tools whose descriptions overlap > 60% in purpose.
Rate specification clarity: are parameter descriptions unambiguous? Do tool docstrings explain when to use the tool vs. alternatives?

Flag if parameter overlap > 40%, distractor count > 3 per decision point, or descriptions lack disambiguation.

4. Compute Compound Load Score

Estimate total load as: Compound Load = Depth x max(Branching, 1) x (1 + Extraneous_Overlap). This captures the multiplicative interaction. Compare against known thresholds:

Safe zone: Compound Load < 15
Risk zone: 15-30 (expect 20-40% failure rate)
Failure zone: > 30 (expect > 50% failure rate, restructure required)

5. Identify the Critical Path

In the TIG, find the longest dependency chain (critical path). This is the primary bottleneck. Errors on this path cascade to all downstream tools. Prioritize reducing load along this path first.

6. Reduce Intrinsic Load via Decomposition

If depth > 5, break the workflow into sub-agents or staged pipelines:

Group tightly-coupled tool calls (depth <= 4 each) into sub-tasks.
Use an orchestrator agent that calls sub-agents rather than all tools directly.
Ensure each sub-agent's TIG stays within the safe zone.

7. Reduce Extraneous Load via Tool Design

Rename parameters to be globally unambiguous (user_id, order_id -- never bare id).
Add "when to use" preambles to each tool description that distinguish it from similar tools.
Remove distractor tools from the available set when they are irrelevant to the current task scope.
Use hierarchical tool organization: group related tools under namespaces.

8. Validate the Restructured Workflow

Recompute the TIG metrics after restructuring. Confirm compound load is in the safe zone. Run a small set of test cases to verify the failure rate dropped.

9. Document Capability Boundaries

Concrete Examples

Example 1: Diagnosing a Flaky Customer Support Agent

User: "My customer support agent uses 12 tools and fails about 40% of the time on refund requests. Why?"

Approach:

Map the TIG for the refund workflow:

get_customer(email) -> get_orders(customer_id) -> get_order_details(order_id)
  -> check_refund_policy(order_id, product_id) -> calculate_refund(order_id, amount)
    -> process_refund(refund_amount, payment_method) -> send_confirmation(email, refund_id)

Measure: Depth = 7, Branching = 3 (at order selection), 12 total tools available.
Audit: get_customer and get_orders both have an id parameter with different semantics. 4 tools have overlapping "get" descriptions.
Compound Load = 7 x 3 x (1 + 0.45) = 30.5 -- failure zone.

Output -- Diagnosis Report:

COGNITIVE LOAD ANALYSIS
=======================
Workflow: Customer refund processing
Intrinsic Load:
  - TIG Depth: 7 (CRITICAL: exceeds threshold of 5)
  - Max Branching: 3
  - Dependency Density: 0.85
Extraneous Load:
  - Parameter Overlap: 45% (WARNING: >40% threshold)
  - Distractor Tools: 5 unused tools in scope
  - Ambiguous Descriptions: get_customer vs get_orders
Compound Load: 30.5 (FAILURE ZONE: >30)

RECOMMENDED FIXES:
1. Split into 2 sub-agents:
   - Retrieval Agent (depth 3): get_customer -> get_orders -> get_order_details
   - Action Agent (depth 3): check_policy -> calculate -> process -> confirm
2. Rename 'id' params: customer_id, order_id, refund_id
3. Scope tool set: remove 5 irrelevant tools from each sub-agent
Projected Compound Load after fix: 3 x 2 x 1.1 = 6.6 (SAFE)

Example 2: Designing a New Data Pipeline Agent

User: "I'm building an agent that queries a database, transforms data, generates a chart, and emails it. How should I structure the tools?"

Approach:

Draft the TIG before building:

query_db(sql) -> transform_data(data, operations) -> generate_chart(data, chart_type)
  -> compose_email(recipient, subject, attachment) -> send_email(email_obj)

Measure: Depth = 5 (borderline), Branching = 1 (linear), Dependency Density = 1.0.
Compound Load = 5 x 1 x 1.0 = 5.0 -- safe zone, but at depth limit.

Output -- Design Recommendation:

PROACTIVE LOAD ANALYSIS
=======================
Proposed Workflow: Data pipeline + email
TIG Depth: 5 (borderline -- no room for added complexity)
Branching: 1 (linear chain, good)
Compound Load: 5.0 (SAFE, but tight)

DESIGN GUIDELINES:
- Keep the linear structure -- do not add optional branching tools
- Use distinct parameter names: sql_query, raw_data, chart_data, email_body
- If you later add "choose chart type" branching (depth+1, branch+3),
  load jumps to 6 x 3 x 1.0 = 18 (RISK ZONE) -- decompose at that point
- Tool descriptions should specify: "Use query_db for initial data fetch,
  use transform_data ONLY for post-query reshaping"

Example 3: Reducing Parameter Confusion

User: "My agent keeps calling the wrong API -- it confuses search_users with search_accounts."

Approach:

Audit the two tool schemas:
- search_users(query, limit, filter) -- "Search for users matching criteria"
- search_accounts(query, limit, filter) -- "Search for accounts matching criteria"
Parameter overlap: 100% (identical names). Description overlap: ~90%.
Extraneous Load is the dominant failure mode here.

Output -- Fix:

EXTRANEOUS LOAD FIX
====================
Problem: search_users and search_accounts have 100% parameter overlap
         and 90% description overlap.

Fix tool descriptions:
  search_users:
    "Find individual people by name or email. Use when the task
     mentions a person, employee, or human user. Returns: name, email, role."
  search_accounts:
    "Find organization/company accounts by company name or domain.
     Use when the task mentions a company, organization, or billing entity.
     Returns: company_name, domain, plan_tier."

Rename parameters:
  search_users(user_query, max_users, user_filter)
  search_accounts(account_query, max_accounts, account_filter)

Projected overlap after fix: <15% (SAFE)

Best Practices

Do: Always map the TIG before building a multi-tool workflow. Catching depth > 5 at design time prevents flaky production agents.
Do: Scope the tool set per task. An agent with 20 available tools but only needing 4 has unnecessary extraneous load from the 16 distractors.
Do: Use globally unique, semantically clear parameter names across all tools in an agent's scope. Never use bare id, data, query, or result.
Do: Include "when to use this tool vs. alternatives" in every tool description.
Avoid: Assuming that a more capable model eliminates the need for load management. Even GPT-4-class models collapse at compound load > 30.
Avoid: Adding "nice to have" tools to an agent's tool set. Each additional tool increases branching and distractor density.
Avoid: Chains deeper than 4-5 without decomposition into sub-agents, even if each individual step seems simple.

Error Handling

Misidentified critical path: If the workflow has parallel branches, identify all paths and measure the longest. Parallel branches add branching factor, not depth.
Overestimated extraneous load: Parameter names that are identical but have identical semantics across tools (e.g., limit always means max results) are not confusion sources. Only flag genuinely ambiguous overlaps.
Restructuring introduces new failure modes: When splitting into sub-agents, verify that the orchestrator's own tool set (calling sub-agents) doesn't itself exceed load thresholds. An orchestrator calling 8 sub-agents is branching factor 8.
Dynamic tool sets: If the available tools change per turn (e.g., tools unlock after authentication), re-evaluate load at each stage rather than computing a single static score.

Limitations

The compound load formula is a heuristic approximation. Real-world failure rates depend on the specific model, prompt engineering, and task domain -- use the thresholds as guidelines, not guarantees.
The framework assumes tool calls are the primary source of complexity. Tasks with complex reasoning between tool calls (e.g., multi-step math) have cognitive demands not captured by TIG metrics alone.
Extraneous load measurement requires subjective judgment about semantic similarity of parameters and descriptions. Two reviewers may score overlap differently.
The framework is calibrated primarily on English-language tool descriptions and tasks. Multilingual tool use may shift the thresholds.
ToolLoad-Bench uses synthetic tasks. Real-world tool APIs have additional complexity (rate limits, auth, error codes) not modeled by the load metrics.

Adoption

ndpvt-web/beyond-accuracy-cognitive-load

$ install --global

Security Scan Results

SKILL.md

Cognitive Load Framework for Tool-Use Agent Design

When to Use

Key Technique

Step-by-Step Workflow

1. Map the Tool Interaction Graph (TIG)

2. Measure Intrinsic Load Metrics

3. Audit Extraneous Load

4. Compute Compound Load Score

5. Identify the Critical Path

6. Reduce Intrinsic Load via Decomposition

7. Reduce Extraneous Load via Tool Design

8. Validate the Restructured Workflow

9. Document Capability Boundaries

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge

ndpvt-web/beyond-accuracy-cognitive-load

$ install --global

Security Scan Results

SKILL.md

Cognitive Load Framework for Tool-Use Agent Design

When to Use

Key Technique

Step-by-Step Workflow

1. Map the Tool Interaction Graph (TIG)

2. Measure Intrinsic Load Metrics

3. Audit Extraneous Load

4. Compute Compound Load Score

5. Identify the Critical Path

6. Reduce Intrinsic Load via Decomposition

7. Reduce Extraneous Load via Tool Design

8. Validate the Restructured Workflow

9. Document Capability Boundaries

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Related Skills

ndpvt-web/gradingattack-attacking-short-answer

ndpvt-web/gisa-benchmark-general-information-seeking

ndpvt-web/gflowpo-generative-flow-network

ndpvt-web/generative-ontology-structured-knowledge