skills/from-features-actions-explainability/SKILL.md
Diagnose and explain failures in agentic AI systems using trace-based rubric evaluation, bridging static feature attribution (SHAP/LIME) with trajectory-level diagnostics. Use when: 'debug why my agent failed', 'explain agent behavior', 'evaluate agent traces', 'add explainability to my agent pipeline', 'diagnose agentic failures', 'trace-based agent analysis'.
npx skillsauth add ndpvt-web/arxiv-claude-skills from-features-actions-explainabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the unified XAI evaluation framework from Chaduvula et al. (2026) to diagnose failures in both traditional ML models and multi-step agentic AI systems. The core insight: attribution methods (SHAP, LIME) work well for static predictions (Spearman rho = 0.86) but fail to diagnose execution-level failures in agentic trajectories. Instead, trace-grounded rubric evaluation — scoring agent runs against six behavioral dimensions — reliably localizes breakdowns, revealing that state tracking inconsistency is 2.7x more prevalent in failed runs and cuts success probability by 49%.
The problem with applying SHAP/LIME to agents: Traditional attribution methods explain which input features drove a single prediction. Agentic systems produce trajectories — sequences of (state, action, observation) tuples — where failure emerges from compounding decisions, not a single input-output mapping. Applying SHAP to raw agent outputs yields correlational rankings but cannot localize which step caused failure or why.
Trace-grounded rubric evaluation: Instead of attributing outcomes to inputs, this approach evaluates each agent run against six binary behavioral dimensions derived from the execution trace alone (no outcome leakage). An LLM judge reads the full trace — tool calls, parameters, returns, intermediate states — and scores each dimension as satisfied (0) or violated (1). This produces a structured diagnostic vector per run that directly localizes failure modes.
The bridge methodology: To rigorously compare paradigms, the framework encodes rubric scores as binary features, trains a logistic regression predicting task success, then applies SHAP to the rubric-level features. This recovers sensible importance rankings (intent alignment: 0.473, state tracking: 0.422) but the authors emphasize these remain correlational — the rubric evaluation itself is the actionable diagnostic, not the SHAP values computed over it.
Each dimension receives a binary violation flag per agent run:
| Dimension | What It Tests | Failure Signal | |---|---|---| | Intent Alignment | Actions align with stated goals and task requirements | Agent pursues wrong objective | | Plan Adherence | Maintains coherent multi-step plans throughout execution | Agent abandons or contradicts its own plan | | Tool Correctness | Invokes tools with valid parameters and correct syntax | Malformed API calls, wrong argument types | | Tool Choice Accuracy | Selects the optimal tool for each sub-task | Uses search when it should use update, etc. | | State Tracking Consistency | Maintains coherent internal state across steps | Latent drift between agent memory and environment | | Error Recovery | Detects and recovers from execution failures | Ignores error responses, repeats failed actions |
Collect execution traces. Ensure the agent pipeline logs complete trajectories: each step's action (tool name + parameters), observation (tool return value), and any intermediate reasoning or state. Store as structured JSON with fields step, action, parameters, observation, reasoning.
Define the rubric prompt template. For each of the six dimensions, write a judge prompt that receives ONLY the trace (not the final outcome) and outputs a binary violation label with a one-sentence justification. Example for State Tracking:
Given the following agent execution trace, determine whether the agent
maintained consistent internal state across all steps. A violation occurs
when the agent acts on outdated, contradictory, or fabricated state
information. Output: {"violated": true/false, "evidence": "..."}
Run the LLM judge over each trace. Apply each rubric prompt to the full trace using a capable model (GPT-4-class or above) with low temperature (0.1) in a single pass. Collect the six binary flags per run into a diagnostic vector.
Compute failure-mode prevalence. For a batch of N runs with known outcomes, calculate P(violation | failure) and P(violation | success) for each dimension. The prevalence ratio reveals which failure modes are disproportionately associated with task failure.
Compute reliability correlates. Calculate P(success | violation) vs P(success | no violation) for each dimension. The relative risk RR = P(success|flag) / P(success|no flag) quantifies how much each violation reduces success probability.
Identify the dominant failure pattern. Classify the failure regime:
Generate the Minimal Explanation Packet (MEP). Bundle three components:
Act on diagnostics. Use the localized failure to fix the agent: add state validation checkpoints for state drift, improve tool selection prompting for tool choice failures, add plan-checking loops for plan adherence issues.
Example 1: Diagnosing a failing customer-service agent
User: "My airline booking agent succeeds only 56% of the time on TAU-bench tasks. Help me figure out why it fails."
Approach:
Output:
Failure Diagnosis Report
========================
Dominant failure mode: State Tracking Inconsistency
Prevalence ratio: 2.7x (63.6% in failures vs 23.8% in successes)
Success impact: -36 percentage points when violated (RR = 0.51)
Pattern: Agent retrieves customer record, applies modification via
update_booking tool, but subsequent tool calls reference stale pre-update
fields (e.g., old seat assignment after seat change).
Recommendation: Add a state re-fetch step after every mutating tool call.
Validate that the agent's working memory matches the latest API response
before proceeding to the next action.
Example 2: Building a rubric evaluator for a code-generation agent
User: "I have a ReAct agent that writes and executes code. I want to add trace-based diagnostics to my eval pipeline."
Approach:
Output (evaluation harness pseudocode):
RUBRIC_DIMENSIONS = [
"intent_alignment", "plan_adherence", "tool_correctness",
"tool_choice", "state_tracking", "error_recovery"
]
def evaluate_run(trace: list[dict], judge_model: str) -> dict:
results = {}
for dim in RUBRIC_DIMENSIONS:
prompt = load_rubric_prompt(dim)
judgment = call_llm(judge_model, prompt + format_trace(trace))
results[dim] = parse_binary_judgment(judgment)
return results
def compute_failure_analysis(all_results: list[dict], outcomes: list[bool]):
for dim in RUBRIC_DIMENSIONS:
violations_in_failures = mean(r[dim] for r, o in zip(all_results, outcomes) if not o)
violations_in_successes = mean(r[dim] for r, o in zip(all_results, outcomes) if o)
ratio = violations_in_failures / max(violations_in_successes, 0.01)
print(f"{dim}: prevalence ratio = {ratio:.1f}x")
Example 3: Deciding between SHAP and trace diagnostics
User: "Should I use SHAP or trace-based evaluation to explain my system's behavior?"
Approach:
Output:
Decision Matrix
===============
Static model (classifier, regressor, single-call LLM):
-> Use SHAP/LIME. Expect stable rankings (rho > 0.80 for linear models).
-> Cross-validate with at least 2 methods to confirm stability.
Agentic system (multi-step, tool-calling, stateful):
-> Use trace-based rubric evaluation with LLM judge.
-> Attribution methods on raw features will be correlational, not causal.
-> Optionally apply SHAP to rubric-level features as a summary view,
but rely on per-run rubric diagnostics for actionable debugging.
Hybrid (agent with a static component, e.g., classifier gating tool calls):
-> Use SHAP/LIME on the static component.
-> Use trace diagnostics on the overall trajectory.
-> Combine into a layered MEP (Minimal Explanation Packet).
Paper: Chaduvula, Ho, Kim, Narayanan, Alinoori. "From Features to Actions: Explainability in Traditional and Agentic AI Systems." arXiv:2602.06841v2, 2026. https://arxiv.org/abs/2602.06841v2
Code: https://github.com/VectorInstitute/unified-xai-evaluation-framework
Key takeaway: Look for Tables 7-8 (failure-mode prevalence and reliability correlates) and the MEP structure definition — these are the actionable components for building trace-based diagnostics into your own agent evaluation pipeline.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".