skills/prompt-engineering/SKILL.md
Use this skill when crafting, reviewing, or improving prompts for LLM pipelines — including task prompts, system prompts, and LLM-as-Judge prompts. Triggers include: requests to write or refine a prompt, diagnose why an LLM produces inconsistent or incorrect outputs, bridge the gap between intent and model behavior, reduce ambiguity in instructions, add few-shot examples, structure complex prompts, or improve output formatting. Also use when the user needs help distinguishing specification failures (unclear instructions) from generalization failures (model limitations), or when iterating on prompts based on observed failure modes. Do NOT use for general coding tasks, document creation, or non-LLM writing.
npx skillsauth add maragudk/evals-skills prompt-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Effective prompt engineering is fundamentally about closing two gaps between human intent and model behavior. Understanding which gap you're dealing with determines whether prompt refinement will actually solve your problem.
This gulf separates what you mean from what you actually wrote in the prompt. Your intent — the task you want the LLM to perform — is often only loosely captured by the words you write. Specifying tasks precisely in natural language is surprisingly hard.
Even prompts that seem clear often leave crucial details unstated. For example:
"Extract the sender's name and summarize the key requests in this email."
This sounds specific, but critical questions are left unanswered:
Without complete instructions, the model is forced to guess your true intent, producing inconsistent outputs. Underspecified prompts are usually a direct result of not looking at real data — you don't know what edge cases and ambiguities exist until you see them.
Key insight: Prompt clarity often matters as much as task complexity. Many teams rush to build evaluators for preferences they never specified in the prompt (like concise responses or a specific structure). The better approach is to first include such instructions explicitly, and only create an evaluator if the LLM still fails to follow them.
This gulf separates your data from the model's actual behavior across diverse inputs. Even if prompts are carefully written, LLMs may behave inconsistently on different inputs.
Example: An email processing pipeline might encounter an email mentioning a public figure like "Elon Musk" in the body. The model might mistakenly extract that name as the sender, even though it's unrelated to the actual email metadata. This is not a prompting error — it's a generalization failure where the model applies instructions incorrectly on unusual inputs.
The Gulf of Generalization will always exist to some degree. No model will ever be perfectly accurate on all inputs.
Fix specification first, then measure generalization. There are two reasons:
Efficiency: Many specification failures can be resolved rapidly by simply adding clarity or detail to an existing prompt. It's wasteful to build an automated evaluator for a failure mode that a prompt edit would fix.
Measurement validity: You want evaluations to reflect the LLM's ability to generalize from clear instructions, not its capacity to decipher your ambiguous intent. Evaluating poorly specified tasks essentially measures how well the LLM can "read your mind," which isn't reliable.
Decision framework when you see a failure:
A well-structured prompt typically includes several key pieces. Not every prompt needs all of them, but knowing the full toolkit helps you decide what's needed.
Clearly define the persona or role the LLM should adopt and its overall goal. This sets the stage for desired behavior and helps guide tone and reasoning style, especially for open-ended tasks.
You are an expert technical writer tasked with explaining complex AI concepts to a non-technical audience.
You are a careful tax advisor reviewing client filings for potential issues.
This is the core component. Provide clear, specific, and unambiguous directives. Modern models interpret instructions literally, so be explicit about what to do and what not to do.
Use bullet points or numbered lists for multiple instructions. For complex instruction sets, break them into sub-categories.
### Task
Summarize the following research paper abstract.
### Constraints
- The summary must be exactly three sentences long.
- Avoid using technical jargon above a high-school reading level.
- Do not include any personal opinions or interpretations.
### Tone and Style
- Use active voice.
- Write for a general audience.
The relevant background information, data, or text the LLM needs. This could be a customer email, a document to summarize, a code snippet to debug, or user dialogue history.
When providing multiple documents or long context, clear delimiters are crucial (see Component 7).
<customer_email>
[Insert the full text of the customer email here]
</customer_email>
Provide one or more examples of desired input-output pairs. This is highly effective for guiding the model towards the correct format, style, and level of detail. Examples can also clarify nuanced instructions or demonstrate complex tool usage.
Critical rule: Ensure that any important behavior demonstrated in your examples is also explicitly stated in your rules/instructions. Examples illustrate; rules specify.
### Example
Input email:
"Hi team, can we move the Thursday standup to 2pm? Also, please review the Q3 deck before Friday."
Output:
{
"sender": "Unknown (no signature)",
"requests": [
"Reschedule Thursday standup to 2pm",
"Review Q3 deck before Friday"
],
"urgency": "medium"
}
For complex problems, instruct the model to think step-by-step or outline a specific reasoning process. This encourages the model to break down the problem and leads to more accurate outputs.
Before generating the summary, first identify the main hypothesis, then list the key supporting evidence, and finally explain the primary conclusion. Then, write the summary.
Explicitly define the desired structure, format, or constraints for the response. This is critical for programmatic use of the output.
Respond using only JSON format with the following keys:
- sender_name (string)
- main_issue (string)
- suggested_action_items (array of strings)
Ensure your response is a single paragraph and ends with a question to the user.
Use clear delimiters (Markdown headers, triple backticks, XML tags) to separate different parts of your prompt. This helps the model understand distinct components, especially in long or complex prompts.
Recommended organization for complex prompts:
For cache efficiency: Place static instructions before any user-provided or changing data. This maximizes KV cache reuse across requests and reduces inference cost.
Finding the perfect prompt is rarely immediate. It's an iterative cycle:
Avoid automated prompt-writing and optimization tools in the early stages of development. Writing the prompt yourself forces you to externalize your specification and clarify your thinking. People who delegate prompt writing to a black box too aggressively struggle to fully understand their failure modes. After you have experience looking at your data and understanding failures, you can introduce these tools — but do so carefully.
When a prompt isn't working well, try these low-effort, high-impact changes first:
Clarify ambiguous wording. If the model gets confused about phrasing (e.g., "West Berkeley" vs. "Berkeley West"), update the prompt to be more explicit.
Add a few examples. Include 2–3 representative input/output pairs targeting observed failure cases.
Use role-based guidance. A persona like "You are a careful tax advisor..." can guide tone and reasoning, especially for open-ended tasks.
Ask for step-by-step reasoning. For tasks involving logic or multiple steps, explicitly asking the model to "think step by step" improves correctness and completeness.
Specify what NOT to do. Often, failures come from the model doing something you didn't want but also didn't explicitly prohibit.
Break complex tasks into subtasks. Instead of one massive prompt, decompose into sequential steps (extract → filter → summarize → format).
When building automated evaluators that use an LLM to judge outputs, the same principles apply — but with additional structure. Each evaluator should target a single failure mode with a binary Pass/Fail judgment.
Clear task and evaluation criterion. Focus on one well-scoped failure mode. Vague criteria lead to unreliable judgments. Instead of asking whether an email is "good," ask whether "the tone is appropriate for a luxury buyer persona."
Precise Pass/Fail definitions. Define exactly what counts as Pass (failure absent) and Fail (failure present), grounded in your observed failure descriptions.
Few-shot examples. Include labeled outputs that clearly Pass and clearly Fail. Draw these from human-labeled traces when possible. If using finer-grained scales (e.g., 1–3 severity), include examples for every point on the scale.
Structured output format. The judge should respond in a consistent, machine-readable format — typically JSON with reasoning (1–2 sentence explanation) and answer ("Pass" or "Fail").
You are an expert evaluator assessing outputs from a real estate assistant chatbot.
Your Task: Determine if the assistant-generated email to a client uses a tone appropriate for the specified client persona.
Evaluation Criterion: Tone Appropriateness
Definition of Pass/Fail:
- Fail: The email's tone, language, or level of formality is inconsistent with or unsuitable for the described client persona.
- Pass: The email's tone, language, and formality align well with the client persona's expectations.
Client Personas Overview:
- Luxury Buyers: Expect polished, highly professional, and deferential language. Avoid slang or excessive casualness.
- First-Time Homebuyers: Benefit from a friendly, reassuring, and patient tone. Avoid overly complex jargon.
- Investors: Prefer concise, data-driven, and direct communication. Avoid effusiveness.
Output Format: Return your evaluation as a JSON object with two keys:
1. reasoning: A brief explanation (1-2 sentences) for your decision.
2. answer: Either "Pass" or "Fail".
Examples:
---
Input:
Client Persona: Luxury Buyer
Generated Email: "Hey there! Got some awesome listings for you. Super views, totally posh. Wanna check 'em out?"
Evaluation: {"reasoning": "Uses excessive slang and an overly casual tone unsuitable for a Luxury Buyer persona.", "answer": "Fail"}
---
Input:
Client Persona: First-Time Homebuyer
Generated Email: "Good morning! I've found a few properties that seem like a great fit for getting started in the market, keeping your budget in mind."
Evaluation: {"reasoning": "The tone is friendly, reassuring, and avoids jargon — appropriate for a first-time homebuyer.", "answer": "Pass"}
---
Now evaluate the following:
Client Persona: {persona}
Generated Email: {email}
When asked to help with a prompt, follow this process:
When writing prompts from scratch, start with Components 1 (Role), 2 (Instructions), and 6 (Output Format) as the minimum viable prompt, then layer in Context, Examples, Reasoning Steps, and Delimiters as complexity demands.
tools
Generate a custom trace annotation web app for open coding during LLM error analysis. Use when the user wants to review LLM traces, annotate failures with freeform comments, and do first-pass qualitative labeling (open coding). Also use when the user mentions "annotate traces", "trace review tool", "open coding tool", "label traces", "build an annotation interface", "review LLM outputs", or wants to manually inspect pipeline traces before building a failure taxonomy. This skill produces a tailored Python web application using FastHTML, TailwindCSS, and HTMX.
development
Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.
development
Build a structured taxonomy of failure modes from open-coded trace annotations. Use this skill whenever the user has freeform annotations from reviewing LLM traces and wants to cluster them into a coherent, non-overlapping set of binary failure categories (axial coding). Also use when the user mentions "failure modes", "error taxonomy", "axial coding", "cluster annotations", "categorize errors", "failure analysis", or wants to go from raw observation notes to structured evaluation criteria. This skill covers the full pipeline: grouping open codes, defining failure modes, re-labeling traces, and quantifying error rates.
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.