skills/agentxray-white-boxing-agentic-systems/SKILL.md
Reverse-engineer black-box agentic systems into editable, interpretable workflows using search-based reconstruction. Use when the user says 'reconstruct this agent workflow', 'reverse-engineer this pipeline', 'white-box this agentic system', 'explain what this agent chain is doing', 'approximate this black-box agent', or 'build an interpretable surrogate for this system'.
npx skillsauth add ndpvt-web/arxiv-claude-skills agentxray-white-boxing-agentic-systemsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to reconstruct interpretable, editable surrogate workflows from black-box agentic systems using only input-output access. Based on the AgentXRay framework, it formulates Agentic Workflow Reconstruction (AWR) as a combinatorial search over discrete agent roles and tool invocations, using Monte Carlo Tree Search with Red-Black Pruning to efficiently navigate the space of possible chain-structured workflows. The result is a transparent, modifiable pipeline that approximates the original system's behavior without needing access to its internals.
Agentic Workflow Reconstruction (AWR) treats workflow recovery as an optimization problem: given a black-box system that maps task inputs to outputs, find a sequential workflow s = [s1, s2, ..., sL] of primitives that maximizes output similarity. Each primitive sj is a 4-tuple (role, model, thought_pattern, toolset) representing one agent step. The search space is all sequences up to length Lmax, drawn from a catalog of available agent roles and tools. The key insight is the Linearity Hypothesis -- most deployed agentic systems, regardless of their internal architecture, execute as sequential chains at runtime, making linear reconstruction a practical approximation.
Monte Carlo Tree Search (MCTS) with Red-Black Pruning navigates this combinatorial space efficiently. Each tree node represents a partial workflow prefix; edges append one primitive. The search cycles through: (1) selection via UCB balancing exploration/exploitation, (2) expansion adding new agent-tool combinations, (3) simulation by completing the workflow randomly and scoring it, and (4) backpropagation of rewards. The Red-Black Pruning mechanism scores each node as Score(v) = (Q(v)/N(v)) * ((d(v)+1)/(Lmax+1)) * (|C(v)|/M), combining quality (average reward), depth (how far into the workflow), and width (branching diversity). Nodes scoring above a quantile threshold are marked Red (promising -- exploit via depth refinement); those below are Black (unpromising -- explore via width expansion or prune). This reduces token consumption by 8-22% while maintaining or improving reconstruction fidelity.
Output comparison uses Static Functional Equivalence (SFE), a proxy metric measuring structural and semantic similarity between the reconstructed workflow's output and the black-box target output. SFE correlates with human judgment (Spearman rho=0.61, p<0.001) and avoids requiring execution-based evaluation, making it practical for diverse output types including code, text, and structured data.
Define the black-box interface. Identify what inputs the target agentic system accepts and what outputs it produces. Collect 5-20 representative input-output pairs as your evaluation dataset D. Ensure diverse task coverage.
Build a primitive catalog. Enumerate candidate agent roles (e.g., "Planner", "Coder", "Reviewer", "Researcher") and available tools (e.g., code_executor, web_search, file_writer, calculator). Each primitive is a tuple (role, model, thought_pattern, toolset). Start with 8-15 primitives covering likely capabilities.
Set search parameters. Choose Lmax (maximum workflow length, typically 3-7 steps), branching cap M (children per node, typically 5-10), iteration budget (50-200 MCTS iterations), and pruning quantile beta (0.3-0.5 for moderate pruning).
Initialize the MCTS tree. Create a root node representing the empty workflow prefix. Each expansion generates child nodes by appending one primitive from the catalog to the current prefix.
Run the search loop. For each iteration: (a) Select a leaf node via UCB traversal, (b) Expand by trying a new primitive, (c) Simulate by completing the workflow with random primitives up to Lmax, (d) Execute the complete workflow on a sampled input and compute SFE similarity against the target output, (e) Backpropagate the reward through ancestor nodes.
Apply Red-Black Pruning periodically. Every k iterations (e.g., k=10), compute Score(v) for all active nodes. Color nodes Red (score >= threshold) or Black (below). Prune subtrees rooted at consistently Black leaf nodes. Focus subsequent search budget on Red branches.
Extract the best workflow. After the budget is exhausted, select the complete workflow path with the highest average SFE score. This is your reconstructed white-box surrogate.
Validate on held-out inputs. Run the reconstructed workflow on inputs not used during search. Compare outputs against the black-box system. Acceptable SFE scores are typically 0.35+ (partial reconstruction) or 0.45+ (strong reconstruction).
Refine and edit. The reconstructed workflow is now an explicit, modifiable sequence. Remove steps that ablation shows are redundant. Swap agent roles or tools to improve quality. This editability is the core advantage over distillation.
Deploy the surrogate. Use the reconstructed workflow as a transparent, cheaper replacement or as documentation of the original system's effective behavior.
Example 1: Reconstructing a Code-Generation Agent Pipeline
User: "I have a black-box coding agent that takes feature requests and produces Python modules. I can query it but don't know its internal workflow. Help me reconstruct what it's doing."
Approach:
Output:
Reconstructed Workflow (SFE: 0.43):
Step 1: Planner (chain-of-thought, no tools)
-> Decomposes feature request into subtasks
Step 2: Coder (direct, tools: [code_executor])
-> Generates initial implementation
Step 3: Reviewer (critique, tools: [code_executor])
-> Runs tests, identifies issues
Step 4: Coder (direct, tools: [code_executor])
-> Fixes issues from review
Token savings vs. black-box: ~35% fewer tokens
Editable: You can remove Step 3 for simple features or add a second Reviewer pass for complex ones.
Example 2: Pruning a Bloated Multi-Agent Data Analysis Pipeline
User: "Our data analysis agent chain has 8 steps and costs too much. Can you figure out which steps actually matter?"
Approach:
Output:
Original pipeline: 8 steps, ~45K tokens/query
Red-Black Analysis:
Step 1: DataLoader [RED] Score: 0.72 -- essential
Step 2: SchemaValidator [BLACK] Score: 0.11 -- redundant
Step 3: Analyst [RED] Score: 0.68 -- essential
Step 4: Statistician [RED] Score: 0.55 -- essential
Step 5: Formatter [BLACK] Score: 0.09 -- redundant
Step 6: Visualizer [RED] Score: 0.61 -- essential
Step 7: SummaryWriter [BLACK] Score: 0.13 -- redundant (Visualizer covers this)
Step 8: ReportAssembler [RED] Score: 0.58 -- essential
Recommended pruned pipeline: [DataLoader, Analyst, Statistician, Visualizer, ReportAssembler]
Estimated token reduction: ~22%
SFE retention: 0.98 of original
Example 3: Building an Interpretable Surrogate for a Competitor's Agent
User: "A competitor has an AI tutoring agent. I can send it questions and get responses. Help me understand its workflow structure."
Approach:
Output:
Reconstructed Tutoring Workflow (SFE: 0.38):
Step 1: SubjectExpert -- retrieves domain knowledge, sets context
Step 2: Socratic -- generates guiding questions rather than direct answers
Step 3: Simplifier -- rephrases complex concepts using analogies
Step 4: Assessor -- creates a check-for-understanding question
Insight: The competitor's agent likely uses a 4-stage pedagogical
pipeline. No "Encourager" step was detected -- their system is
knowledge-focused rather than motivational.
M or reduce Lmax to focus on shorter, more achievable workflows.AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction -- Shi et al., 2026. Focus on Section 3 (AWR formalization), Section 4 (MCTS + Red-Black Pruning algorithm), and Table 1 (benchmark results across five domains showing 0.426 average SFE vs. 0.339 for the best baseline).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".