skills/cam-causality-based-analysis-framework/SKILL.md
Analyze and optimize multi-agent code generation pipelines using causality-based importance ranking of intermediate features. Identifies which pipeline stages matter most, enables targeted failure repair, token-efficient pruning, and hybrid LLM backend assignment. Triggers: 'analyze my multi-agent pipeline', 'optimize agent code generation', 'which pipeline stages matter most', 'reduce token usage in my agent system', 'fix failing multi-agent code generation', 'assign LLMs to pipeline stages'
npx skillsauth add ndpvt-web/arxiv-claude-skills cam-causality-based-analysis-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the CAM framework to analyze, diagnose, and optimize multi-agent code generation (MACGS) pipelines. CAM treats each intermediate output of a multi-agent pipeline (requirements, design docs, data structures, etc.) as a causal feature and quantifies its contribution to final code correctness using counterfactual error injection and a Feature Responsibility (FR) metric. This allows targeted failure repair (73.3% success rate on top-3 features), aggressive token pruning (up to 66.8% reduction), and hybrid LLM backend assignment (up to 7.2% Pass@1 gain).
Counterfactual Feature Importance via Error Injection. CAM works by injecting realistic errors into each intermediate output of a MACGS pipeline and measuring whether the final code breaks. Unlike random perturbation, CAM uses an LLM to generate semantically coherent but incorrect variants of each feature — plausible mistakes that mirror real failures. For each problem, CAM discovers minimal feature combinations whose simultaneous corruption causes failure, then aggregates these across problems into a Feature Responsibility (FR) score.
Feature Responsibility (FR) Metric. The FR score is computed as: FR(f_i) = sum over problems p, sum over important sets S containing f_i, of (1/|S|)^2. The squared inverse of set size amplifies features that cause failure on their own (|S|=1) versus those that only matter in combination. This distinguishes independently critical features (like Data_Struct and Req_Stat, which rank #1 across all configurations) from context-dependent features (like Program_Lang, which matters in 78.8% of cases only when corrupted alongside other features).
Three Practical Applications. (1) Failure Repair: When code generation fails, inspect and fix the top-3 FR-ranked features first — this achieves 73.3% repair success versus 27-41% for baselines. (2) Feature Pruning: Remove the lowest-FR features from the pipeline to cut tokens by 6-67% with negligible quality loss. (3) Hybrid Backend: Compare FR distributions across LLMs and assign each pipeline stage to the model that is strongest on that feature category — yields up to 7.2% Pass@1 improvement.
Map the pipeline's intermediate features. Catalog every intermediate output your MACGS produces. Classify each into one of four categories:
Req_Stat), programming language (Program_Lang), documentation language (Language)Req_Anal), competitive analysis (Compet_Anal), logic breakdown (Logic_Anal), requirement pool (Req_Pool)Implement), data structures (Data_Struct)Req_Pack), file listing (File_List), shared knowledge (Share_Know)Select a benchmark problem set. Choose 50-164 problems from a relevant benchmark (HumanEval, MBPP, CodeContest, or your own test suite). These must have verifiable correctness criteria (test cases).
Generate baseline outputs. Run the pipeline end-to-end on the problem set, recording every intermediate feature and the final Pass@1 rate. Store these as ground-truth feature values.
Inject counterfactual errors per feature. For each feature f_i and each problem, prompt an LLM to produce a realistic but incorrect variant of f_i that is superficially plausible. Validate that the corrupted variant differs meaningfully (sentence-transformer cosine similarity < 0.5 with original).
Discover minimal important feature sets. For each problem, test single-feature corruptions first. If corruption of f_i alone causes failure, record {f_i} as an important set. For surviving features, test combinations of increasing length (2 up to L_max=5), using an influence-set heuristic to prioritize combinations likely to cause failure. Cap total executions at N=100 per problem with early stopping patience k=10.
Compute Feature Responsibility scores. Aggregate across all problems using FR(f_i) = sum_p sum_{S in S_p, f_i in S} (1/|S|)^2. Rank features by FR. The top features are your optimization targets.
Apply failure repair (if diagnosing failures). For each failing problem, inspect the top-3 FR-ranked features in order. Check each for semantic clarity, completeness, and consistency with the problem statement. Augment or correct the feature, then re-run the pipeline.
Apply feature pruning (if reducing cost). Remove the bottom-N FR-ranked features from the pipeline. Start conservatively (N=2) and measure Pass@1 delta. Increase N until quality degrades beyond your tolerance. Typical sweet spot: pruning 2-4 features for 6-34% token savings with no quality loss.
Apply hybrid backend assignment (if using multiple LLMs). Compare FR distributions across candidate LLMs. Assign each pipeline stage to the model with the lowest FR (fewest failures) on that feature category. For example, if DeepSeek excels at Design features and GPT-4o-mini at Specification features, route accordingly.
Validate and iterate. Re-run the full benchmark with your changes (repaired features, pruned pipeline, or hybrid backend). Compare Pass@1 against baseline. Iterate by adjusting the FR threshold or backend assignments.
Example 1: Diagnosing a Failing MetaGPT Pipeline
User: "My MetaGPT pipeline is failing on 40% of HumanEval problems. Which intermediate step should I fix first?"
Approach:
Data_Struct and Req_Stat to rank highest (consistent with paper findings across all configurations)Data_Struct outputs on failing problems for incorrect structure choices (e.g., using a list where a hashmap is needed)Output:
Feature Responsibility Rankings (top 5):
1. Data_Struct (FR=14.2) — Design category
2. Req_Stat (FR=13.8) — Specification category
3. Implement (FR=9.1) — Design category
4. Language (FR=7.4) — Specification category
5. Req_Anal (FR=5.6) — Analysis category
Recommendation: Focus repair on Data_Struct and Req_Stat first.
These two features appear in the important set for 31 of 66 failing problems.
Repairing top-3 features should fix ~73% of failures based on CAM benchmarks.
Example 2: Reducing Token Costs in a Code Generation Pipeline
User: "My multi-agent pipeline uses 12 intermediate features and costs too much. Which features can I safely remove?"
Approach:
Compet_Anal, Share_Know, File_List)Output:
Pruning Analysis:
- Remove bottom 2 (Compet_Anal, Share_Know): -6.4% tokens, +0.2% Pass@1 (slight improvement)
- Remove bottom 4 (+ File_List, Req_Pool): -33.6% tokens, -0.8% Pass@1 (negligible loss)
- Remove bottom 6 (+ Logic_Anal, Req_Pack): -51.2% tokens, -3.1% Pass@1 (moderate loss)
- Remove bottom 8: -66.8% tokens, -8.4% Pass@1 (significant loss)
Recommended: Prune bottom 4 features for 33.6% token savings with negligible quality impact.
Example 3: Hybrid LLM Backend Assignment
User: "I have access to GPT-4o-mini and DeepSeek. How should I split my pipeline between them?"
Approach:
Output:
Per-Category FR Comparison (lower = better):
GPT-4o-mini DeepSeek
Specification: 8.2 11.4 → GPT-4o-mini
Analysis: 6.9 7.1 → GPT-4o-mini (marginal)
Design: 12.3 6.8 → DeepSeek
Dependency: 5.1 4.9 → DeepSeek (marginal)
Hybrid Assignment:
- Specification + Analysis stages → GPT-4o-mini
- Design + Dependency stages → DeepSeek
Expected improvement: ~5-7% Pass@1 over uniform single-backend.
Program_Lang) indicates cross-feature consistency issues — fix the interaction, not the feature in isolation.f_j have their outputs change (similarity < theta). Prioritize combinations containing features with large influence sets.Paper: CAM: A Causality-based Analysis Framework for Multi-Agent Code Generation Systems — Lyu et al., 2026. Look for Algorithm 1 (computing important feature sets), Equation 4 (FR metric), Tables 3-6 (importance rankings across configurations), and Section 5 (failure repair and pruning applications).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".