skills/agent-engineering/SKILL.md
Use this skill when designing, building, optimizing, or debugging AI agents — autonomous systems that use LLMs with tools to accomplish tasks. Triggers when the user asks about agent architecture, prompt engineering for agents, tool use optimization, token efficiency, context window management, agent loops, multi-agent systems, agent reliability, reducing agent cost, making agents faster, agent evaluation, or any discussion of building systems where an LLM orchestrates tool calls to achieve goals. Also triggers when an agent is working but is slow, expensive, unreliable, or producing inconsistent results. Do NOT use for simple single-turn LLM API calls without tool use or autonomy.
npx skillsauth add kylejryan/better-code agent-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
An agent is an LLM that decides which tools to call, in what order, with what arguments, based on intermediate results — looping until the task is complete. Every loop iteration costs tokens, latency, and money. Every unnecessary tool call, every bloated prompt, every redundant context injection is waste that compounds across thousands of executions.
The engineering discipline is: accomplish the task in the minimum number of LLM calls, with the minimum tokens per call, using the minimum tool invocations, while maintaining reliability. These four objectives are not in tension — wasteful agents are also unreliable agents, because every unnecessary step is another opportunity for the model to hallucinate, lose track of its goal, or choose the wrong tool.
Every agent design decision involves three competing forces:
Reliability — does the agent accomplish the task correctly? This is the constraint, not the optimization target. Establish a reliability floor first (e.g., 95% correct on representative tasks), then optimize cost and speed without dropping below it.
Cost — how many tokens does the agent consume? Token cost = input tokens + output tokens across all LLM calls. Cost scales linearly with loop iterations and context size.
Latency — how long does the task take end-to-end? LLM inference time scales with input token count (time to first token) and output token count (generation time). Sequential tool calls add latency linearly; parallel tool calls add latency once.
The highest-leverage optimizations improve all three simultaneously: fewer loop iterations means lower cost, lower latency, AND fewer chances to go off-track.
Use this skill when:
| # | Category | Prefix | Impact | Description |
|---|----------|--------|--------|-------------|
| 1 | System Prompt Engineering | prompt | CRITICAL | Dense, structured prompts that earn their tokens on every LLM call |
| 2 | Context Window Management | context | CRITICAL | Control what enters context — less is more when it's more relevant |
| 3 | Tool Design | tool | CRITICAL | Tools are the agent's hands — interface, granularity, and error design |
| 4 | Agent Loop Architecture | loop | HIGH | Loop control, iteration reduction, parallel calls, ReAct vs Plan-and-Execute |
| 5 | Model Selection & Routing | routing | HIGH | Right model for each step — plan with large, execute with medium, classify with small |
| 6 | Multi-Agent Systems | multi | HIGH | Decomposition, coordination patterns, and inter-agent communication |
| 7 | Reliability Engineering | reliability | CRITICAL | Failure modes, guardrails, and preventing runaway agents |
| 8 | Evaluation & Measurement | eval | HIGH | Metrics, benchmarking, and data-driven optimization |
| 9 | Token Optimization | token | HIGH | Concrete techniques for reducing token consumption |
Detailed patterns and examples are in references/. Each file follows the format:
{prefix}-{topic}.md
Access them when you need specific implementation patterns for a category.
System prompt:
Tool design:
Context management:
Loop architecture:
Reliability:
Evaluation:
development
Use this skill when performing the actual vulnerability analysis AFTER a threat model has been established (see threat-model skill). Triggers when the user asks to find vulnerabilities, audit code for security, hunt for bugs, or perform security review of source code AND a threat model already exists or the codebase context is clear. This skill enforces depth-first, exploitability-proven analysis — it actively prevents the breadth-first pattern-matching that produces lists of theoretical vulnerabilities. Do NOT use without a threat model; use threat-model skill first. Do NOT use for general code quality review.
development
Staff+ engineering patterns for maximum leverage per line of code. Use this skill when designing abstractions, building reusable primitives, creating shared libraries, reducing code through architecture, reviewing code for leverage and reuse potential, choosing between building vs configuring, or establishing conventions and patterns across a codebase.
development
Use this skill when designing test strategies, writing tests beyond basic unit tests, verifying software for production readiness, or improving test coverage and reliability. Triggers when the user asks about testing strategy, integration tests, end-to-end tests, contract tests, property-based tests, load tests, chaos testing, test architecture, flaky tests, test confidence, 'how do I test this,' 'how do I know this is safe to deploy,' 'my tests are flaky,' 'what should I test,' 'test coverage,' CI/CD test pipelines, or any question about software verification and validation. Also triggers when the user is shipping a change and wants confidence it won't break production. Primarily targets TypeScript and Go but principles apply universally. Do NOT use for writing basic unit tests for simple functions — this skill is for the harder testing questions.
development
Use this skill when debugging software issues, performing root cause analysis, triaging errors from logs or alerts, or investigating why code isn't working as expected. Triggers when the user shares an error message, stack trace, log output, failing test, unexpected behavior, crash report, performance degradation, or says things like 'this isn't working,' 'I'm getting an error,' 'help me debug,' 'why is this failing,' 'something broke,' or 'I can't figure out what's wrong.' Also use when the user has been going back and forth trying fixes that aren't working — this is the signal to stop guessing and start systematically diagnosing. Do NOT use for writing new code from scratch, general code review, or feature development unless a bug is involved.