skills/agent-design-review/SKILL.md
Designs, reviews, and iterates on LLM agents and agent-like workflows. Use when asked to "design an agent", "review this agent", "improve our system prompt", "optimize prompts for caching", "improve tool calling", "reduce hallucinated tool calls", "add structured outputs", "decide if this should be multi-agent", "reduce false positives", "tune agent thresholds", or "build evals for this agent". Covers architecture choice, cache-friendly prompt templates, tool and schema design, runtime loops, trust boundaries, and eval-driven iteration.
npx skillsauth add dcramer/peated agent-design-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design or review agents by identifying the success contract first, mapping the real execution path second, and changing the smallest layer that is actually causing failures.
Load only what applies:
| Need | Read |
| --- | --- |
| Choose architecture or multi-agent shape | references/principles.md |
| Rewrite prompts or improve cache reuse | references/prompt-and-caching.md |
| Draft or rewrite an actual system prompt | references/system-prompt-templates.md |
| Draft a provider-specific prompt for OpenAI Responses or Anthropic tool use | references/provider-specific-templates.md |
| Improve tool calling, tool schemas, or final outputs | references/tool-and-schema-design.md |
| Draft or fix actual tool schemas | references/tool-schema-examples.md |
| Review loops, approvals, side effects, or trust boundaries | references/runtime-and-guardrails.md |
| Build evals or decide how to iterate | references/evals-and-iteration.md |
| Review classifier, matcher, router, extractor, ranker, or moderation agent | references/classifier-agents.md |
| Need examples of strong and weak output | references/review-examples.md |
Set the task mode first:
design: a new agent or major redesignreview: assess an existing agent and prioritize changesdebug: explain why a current agent is failing and what to change firstThen write a short success contract:
If the user asks only for a prompt rewrite, still check whether retrieval, tools, thresholds, or runtime policy dominate the failures.
Use references/principles.md.
Classify the system before proposing changes:
| Pattern | Use when | Avoid when | | --- | --- | --- | | Deterministic workflow | The task is mostly rule-based or decomposes cleanly in code | The model must explore or use tools adaptively | | Single agent | One prompt plus tools can reliably solve the task in a loop | Prompt complexity or tool overload makes behavior unstable | | Multi-agent system | Distinct roles, tools, or trust boundaries must stay separate | You are adding agents without a measured bottleneck |
Prefer deterministic preprocessing, retrieval, routing, or thresholds before adding more agent autonomy.
Write an execution-path summary that names:
For classifier-style systems, separate deterministic stages from model-driven stages. Do not review only the prompt if code outside the prompt decides most of the behavior.
Inspect the highest-risk layer first:
| Layer | Check | | --- | --- | | Architecture | Is this over-agentized? | | Prompt | Is policy explicit, structured, and stable enough for caching? | | Retrieval | Is the right evidence or candidate set available before the model decides? | | Tools | Are tool interfaces narrow, typed, and easy to choose correctly? | | Output contract | Are actions and state machine-checkable? | | Runtime | Are retries, stop conditions, and fallbacks explicit? | | Boundaries | Are approvals, auth, and trust boundaries enforced outside the prompt? | | Thresholds | Do confidence and automation gates map to real consequences? | | Evals | Can proposed changes be measured? |
Do not default to prompt rewrites if retrieval, thresholds, or post-model guards dominate the failures.
For each finding, include:
If you write a prompt, return a cache-friendly prompt skeleton with clear slots for dynamic inputs rather than an unstructured wall of text. If you write tool schemas, return concrete schema drafts with parameter descriptions, enums, and required fields instead of only high-level advice.
When reviewing or debugging, produce:
When designing, produce:
The work is complete only when:
testing
Iterate on a PR until CI passes. Use when you need to fix CI failures, address review feedback, or continuously push fixes until all checks are green. Automates the feedback-fix-push-wait cycle.
tools
ALWAYS use this skill when creating pull requests — never create a PR directly without it. Follows Sentry conventions for PR titles, descriptions, and issue references. Trigger on any create PR, open PR, submit PR, make PR, push and create PR, or prepare changes for review task.
development
Simplifies and refines code for clarity, consistency, and maintainability while preserving all functionality. Use when asked to "simplify code", "clean up code", "refactor for clarity", "improve readability", or review recently modified code for elegance. Focuses on project-specific best practices.
tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.