dot_claude/skills/prompt-engineering/SKILL.md
Use when designing, testing, or deploying LLM prompts for applications - systematic prompt design methodology (pattern selection, structured output, evaluation, versioning) ensuring every prompt is tested against ground truth before production | LLMプロンプトの設計、テスト、デプロイ時に使用 - 体系的なプロンプト設計手法(パターン選択、構造化出力、評価、バージョン管理)により、すべてのプロンプトが本番前にグランドトゥルースに対してテスト済みであることを保証
npx skillsauth add lv416e/dotfiles prompt-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Untested prompts in production are bugs you haven't found yet. Vibes-based prompt tuning is not engineering.
Core principle: EVERY prompt is versioned, tested, and evaluated against ground truth before deployment.
Violating the letter of this process is violating the spirit of LLM engineering.
EVERY PROMPT IS VERSIONED, TESTED, AND EVALUATED AGAINST GROUND TRUTH
If you haven't evaluated it on a test set, it's not ready for production. "It looked good in the playground" is not evaluation.
Use for ANY LLM integration work:
Use this ESPECIALLY when:
Don't skip when:
You MUST complete each phase before proceeding to the next.
BEFORE writing ANY prompt:
Define the Task Precisely
Select the Right Pattern
| Pattern | When to Use | Example | |---------|------------|---------| | Zero-shot | Simple, well-defined tasks | Classification, extraction | | Few-shot | Task needs examples to clarify format/behavior | Structured data extraction, style matching | | Chain-of-thought | Reasoning, analysis, multi-step logic | Math, code review, complex classification | | System/User/Assistant roles | Conversational applications | Chatbots, assistants | | Tool use | Model needs to take actions or access data | API calls, database queries, calculations |
Structure the Prompt
<Good> ```xml <system> You are a code review assistant. Analyze code for bugs, security issues, and style violations.Rules:
Design Output Format
When using Claude models:
XML Tags for Structure
<document>
{{document_content}}
</document>
<instructions>
Summarize the document above in 3 bullet points.
Focus on actionable insights only.
</instructions>
Prefilling for Format Control
Assistant: {"analysis": [
Prompt Caching
Extended Thinking
When building retrieval-augmented generation:
Retrieval Quality First
Context Window Management
<retrieved_documents>
<document index="1" source="{{source_1}}" relevance_score="{{score_1}}">
{{content_1}}
</document>
<document index="2" source="{{source_2}}" relevance_score="{{score_2}}">
{{content_2}}
</document>
</retrieved_documents>
<instructions>
Answer the user's question using ONLY the documents above.
If the answer is not in the documents, say "I don't have enough information."
Cite document numbers for each claim.
</instructions>
Grounding and Attribution
Chunking Strategy
BEFORE deploying ANY prompt:
Build an Evaluation Dataset
Define Metrics
| Task Type | Metrics | |-----------|---------| | Classification | Accuracy, precision, recall, F1 | | Extraction | Exact match, partial match, field-level accuracy | | Generation | LLM-as-judge, human eval, ROUGE/BLEU (limited) | | RAG | Faithfulness, relevance, citation accuracy |
Run Evaluations Systematically
# Every prompt change triggers evaluation
results = evaluate(
prompt=prompt_v2,
dataset=eval_dataset,
metrics=[accuracy, faithfulness, latency],
)
# Compare against previous version
assert results.accuracy >= baseline.accuracy - REGRESSION_THRESHOLD
assert results.faithfulness >= 0.95
Test for Failure Modes
LLM-as-Judge for Generation Quality
Every prompt in production follows these rules:
Version Control
prompts/
├── code-review/
│ ├── v1.0.0.txt # Initial version
│ ├── v1.1.0.txt # Added severity classification
│ ├── v2.0.0.txt # Restructured for tool use
│ ├── eval_dataset.jsonl # Test cases
│ └── CHANGELOG.md # What changed and why
A/B Testing
Cost Optimization
Security
If you catch yourself thinking:
ALL of these mean: STOP. Return to Phase 1.
| Excuse | Reality | |--------|---------| | "Works in the playground" | Playground tests 3-5 cases. Production sees thousands of edge cases. | | "Simple prompt, no eval needed" | Simple prompts fail on edge cases you haven't imagined. Evaluate. | | "We'll add tests later" | Later means after the first production incident. Test now. | | "Same prompt works across models" | Models have different behaviors. Re-evaluate on every model change. | | "Just add more few-shot examples" | More examples without evaluation is guess-and-check. Measure first. | | "Users won't try to break it" | Users will absolutely try to break it. Test adversarial inputs. | | "Cost doesn't matter" | Cost scales with traffic. A 2x token reduction saves thousands. | | "Bigger model fixes everything" | Bigger model with a bad prompt is still bad. Fix the prompt. | | "LLM evaluation is unreliable" | LLM-as-judge with good rubrics correlates well with human eval. Calibrate it. | | "Prompt engineering isn't real engineering" | Untested prompts are untested code. Same discipline applies. |
| Anti-Pattern | Consequence | Correct Approach | |-------------|-------------|-----------------| | Untested prompts in production | Silent failures, inconsistent outputs, user complaints | Evaluation dataset, automated testing | | No evaluation metrics | Can't measure improvement, can't detect regression | Define metrics per task type, track over time | | Prompt injection vulnerabilities | Data leaks, unauthorized actions, system prompt exposure | Input sanitization, output validation, adversarial testing | | Vibes-based tuning | Fixes one case, breaks three others | Systematic evaluation, regression testing | | No versioning | Can't rollback, can't compare, can't reproduce | Version control prompts like code | | Model coupling | Prompt breaks on model update or migration | Test across model versions, abstract model-specific syntax |
| Phase | Key Activities | Success Criteria | |-------|---------------|------------------| | 1. Design | Define task, select pattern, structure prompt, design output | Clear prompt with explicit constraints | | 2. Anthropic | XML tags, prefilling, caching, extended thinking | Model-specific optimizations applied | | 3. RAG | Retrieval testing, context management, grounding | Faithful, cited, hallucination-resistant | | 4. Evaluation | Build eval set, define metrics, test failure modes | Meets accuracy targets, handles edge cases | | 5. Operations | Version, A/B test, optimize cost, secure | Versioned, monitored, cost-efficient, secure |
Before deploying any prompt to production:
Can't check all boxes? You're not ready to deploy.
This skill requires using:
Complementary skills:
No eval dataset → no production deployment
No metrics → no "improvement"
No version control → no prompt changes
Design. Test. Evaluate. Version. Deploy. Monitor. In that order. Always.
development
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
testing
Use when creating new skills, editing existing skills, or verifying skills work before deployment - applies TDD to process documentation by testing with subagents before writing, iterating until bulletproof against rationalization | 新しいスキルの作成、既存スキルの編集、またはデプロイ前にスキルが機能するか検証する際に使用 - プロセスドキュメントにTDDを適用し、記述前にサブエージェントでテストし、合理化に対して堅牢になるまで反復
development
Use when design is complete and you need detailed implementation tasks for engineers with zero codebase context - creates comprehensive implementation plans with exact file paths, complete code examples, and verification steps assuming engineer has minimal domain knowledge | 設計が完了し、コードベースの知識がゼロのエンジニア向けに詳細な実装タスクが必要な場合に使用 - 正確なファイルパス、完全なコード例、検証ステップを含む包括的な実装計画を作成。エンジニアの領域知識が最小限であることを前提
tools
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.