skills/agent-architecture-audit/SKILL.md
Full-stack diagnostic for agent and LLM applications. Audits the 12-layer agent stack for wrapper regression, memory pollution, tool discipline failures, hidden repair loops, and rendering corruption. Produces severity-ranked findings with code-first fixes. Essential for developers building agent applications, autonomous loops, or any LLM-powered feature.
npx skillsauth add affaan-m/everything-claude-code agent-architecture-auditInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A diagnostic workflow for agent systems that hide failures behind wrapper layers, stale memory, retry loops, or transport/rendering mutations.
MANDATORY for:
Especially critical when:
Do not use for:
agent-introspection-debuggingsecurity-review or security-review/scanagent-evalEvery agent system has these layers. Any of them can corrupt the answer:
| # | Layer | What Goes Wrong | |---|-------|----------------| | 1 | System prompt | Conflicting instructions, instruction bloat | | 2 | Session history | Stale context injection from previous turns | | 3 | Long-term memory | Pollution across sessions, old topics in new conversations | | 4 | Distillation | Compressed artifacts re-entering as pseudo-facts | | 5 | Active recall | Redundant re-summary layers wasting context | | 6 | Tool selection | Wrong tool routing, model skips required tools | | 7 | Tool execution | Hallucinated execution — claims to call but doesn't | | 8 | Tool interpretation | Misread or ignored tool output | | 9 | Answer shaping | Format corruption in final response | | 10 | Platform rendering | Transport-layer mutation (UI, API, CLI mutates valid answers) | | 11 | Hidden repair loops | Silent fallback/retry agents running second LLM pass | | 12 | Persistence | Expired state or cached artifacts reused as live evidence |
The base model produces correct answers, but the wrapper layers make it worse.
Symptoms:
Old topics leak into new conversations through history, memory retrieval, or distillation.
Symptoms:
Tools are declared in the prompt but not enforced in code. The model skips them or hallucinates execution.
Symptoms:
The agent's internal answer is correct, but the platform layer mutates it during delivery.
Symptoms:
Silent repair, retry, summarization, or recall agents run without explicit contracts.
Symptoms:
Define what you're auditing:
Gather evidence from the codebase:
Use rg to search for anti-patterns:
# Tool requirements expressed only in prompt text (not code)
rg "must.*tool|必须.*工具|required.*call" --type md
# Tool execution without validation
rg "tool_call|toolCall|tool_use" --type py --type ts
# Hidden LLM calls outside main agent loop
rg "completion|chat\.create|messages\.create|llm\.invoke"
# Memory admission without user-correction priority
rg "memory.*admit|long.*term.*update|persist.*memory" --type py --type ts
# Fallback loops that run additional LLM calls
rg "fallback|retry.*llm|repair.*prompt|re-?prompt" --type py --type ts
# Silent output mutation
rg "mutate|rewrite.*response|transform.*output|shap" --type py --type ts
For each finding, document:
Default fix order (code-first, not prompt-first):
| Level | Meaning | Action |
|-------|---------|--------|
| critical | Agent can confidently produce wrong operational behavior | Fix before next release |
| high | Agent frequently degrades correctness or stability | Fix this sprint |
| medium | Correctness usually survives but output is fragile or wasteful | Plan for next cycle |
| low | Mostly cosmetic or maintainability issues | Backlog |
Present findings to the user in this order:
Do not lead with compliments or summaries. If the system is broken, say so directly.
When auditing an agent system, answer these:
| # | Question | If Yes → | |---|----------|----------| | 1 | Can the model skip a required tool and still answer? | Tool not code-gated | | 2 | Does old conversation content appear in new turns? | Memory contamination | | 3 | Is the same info in system prompt AND memory AND history? | Context duplication | | 4 | Does the platform run a second LLM pass before delivery? | Hidden repair loop | | 5 | Does the output differ between internal generation and user delivery? | Rendering corruption | | 6 | Are "must use tool X" rules only in prompt text? | Tool discipline failure | | 7 | Can the agent's own monologue become persistent memory? | Memory poisoning |
Audits should produce structured reports following this shape:
{
"schema_version": "ecc.agent-architecture-audit.report.v1",
"executive_verdict": {
"overall_health": "high_risk",
"primary_failure_mode": "string",
"most_urgent_fix": "string"
},
"scope": {
"target_name": "string",
"model_stack": ["string"],
"layers_to_audit": ["string"]
},
"findings": [
{
"severity": "critical|high|medium|low",
"title": "string",
"mechanism": "string",
"source_layer": "string",
"root_cause": "string",
"evidence_refs": ["file:line"],
"confidence": 0.0,
"recommended_fix": "string"
}
],
"ordered_fix_plan": [
{ "order": 1, "goal": "string", "why_now": "string", "expected_effect": "string" }
]
}
agent-introspection-debugging — Debug agent runtime failures (loops, timeouts, state errors)agent-eval — Benchmark agent performance head-to-headsecurity-review — Security audit for code and configurationautonomous-agent-harness — Set up autonomous agent operationsagent-harness-construction — Build agent harnesses from scratchtools
Garbage collection for your Claude Code configuration. Periodically scans ~/.claude (skills, memory, hooks, permissions, MCP servers, caches) for redundant, stale, orphaned, or low-value items, then walks the user through a confirm-each-deletion cleanup. Use when the user says "clean up my config", "config GC", "too many skills", "audit my setup", "my .claude is bloated", or asks for a periodic config review.
data-ai
当用户希望通过并行工作、并发 agents、批量工具调用、隔离 worktree 或多条独立验证通道来大幅加速任务、同时不损失正确性时使用。
documentation
在回答之前先读取仓库的实时状态,引导用户了解 ECC 当前的 agents、skills、命令、hooks、规则、安装配置档案以及项目接入流程。
testing
Fact-forcing gate that blocks Edit/Write/Bash (including MultiEdit) and demands concrete investigation (importers, data schemas, user instruction) before allowing the action. Measurably improves output quality by +2.25 points vs ungated agents.