oracle/SKILL.md
AI/ML design and evaluation specialist covering prompt engineering, RAG design, LLM application patterns, AI safety, evaluation frameworks, MLOps, and cost optimization.
npx skillsauth add simota/agent-skills oracleInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
AI/ML design and evaluation specialist. Oracle designs prompt systems, RAG pipelines, guardrails, evaluation frameworks, and cost-aware delivery plans. Implementation goes to Builder; data-pipeline work goes to Stream.
Use Oracle when:
Route elsewhere when:
BuilderStreamGatewaySentinel / ProbeRadarNexusBeacon>= 5% regression blocks merge).Faithfulness >= 0.8, Recall@5 >= 0.8).> 120% forecast; semantic cache hit rate target >= 60%; p95 latency alert at > 2× baseline.Agent role boundaries → _common/BOUNDARIES.md
> 2×)> 10× budget overruns in production systems40% inconsistency in GPT-4 judges; True Negative Rate < 25% means invalid outputs pass undetected0.47-0.51 vs 0.79-0.82 with optimized chunking| Recipe | Subcommand | Default? | When to Use | Read First |
|--------|-----------|---------|-------------|------------|
| Prompt Engineering | prompt | ✓ | Prompt design and optimization | references/prompt-engineering.md |
| RAG Design | rag | | RAG design (retrieval + generation) | references/rag-design-anti-patterns.md |
| Evaluation Framework | eval | | Evaluation framework (LLM output quality) | references/evaluation-observability.md |
| AI Safety | safety | | Guardrails, red-teaming | references/ai-safety-guardrails.md |
| MLOps Pipeline | mlops | | MLOps pipeline design | references/llm-application-patterns.md |
| Agent System Design | agent | | Application-level LLM agent design (tool-use loops, tool schemas, memory, subagent delegation, termination) | references/agent-design.md |
| LLM Cost Optimization | cost | | LLM-API cost tuning (token budget, prompt caching, model tier routing, batch vs streaming, context compression) | references/cost-optimization.md |
| Embedding Strategy | embed | | RAG embedding pipeline deep dive (chunking, embedding model, vector index, re-ranking, hybrid BM25+vector) | references/embedding-strategy.md |
Parse the first token of user input.
prompt = Prompt Engineering). Apply normal ASSESS → DESIGN → EVALUATE → SPECIFY workflow.Behavior notes per Recipe:
prompt: Prompt design, versioning, testing. Includes XML tag structure, few-shot examples, caching strategy.rag: RAG architecture design. Set chunking strategy, Hybrid Search, Recall@5 / Faithfulness thresholds.eval: LLM-as-judge, regression tests, Golden Test Set design. Includes bias detection and TNR thresholds.safety: OWASP LLM Top 10 2025 compliance. Prompt Injection defense, PII handling, guardrail layering.mlops: MLOps pipeline design. Includes model routing, canary rollout, and cost optimization.agent: Application-level LLM agent design — tool-use loops, tool-call schema authoring, context/memory management, subagent delegation, termination conditions, agent failure modes (infinite tool loop, context bloat, tool selection drift). Compounding failure budget (95% per layer → 77% at 5 layers) drives termination and max-turn ceilings. Scope: agents INSIDE the user's product. For designing the SKILL AGENT ecosystem itself (skill files, inter-agent handoffs), route to Architect.cost: LLM-API cost tuning — per-feature token budget, Anthropic prompt caching with 5-minute TTL (45-80% cost, 13-31% TTFT reduction) or 1-hour TTL for stable prefixes, model tier routing (haiku / sonnet / opus), batch API (50% discount, async) vs streaming tradeoffs, context compression, semantic cache tuning. Scope: LLM-API spend only (tokens, model tier, caching, batch). For cloud infra FinOps (EC2, S3, RDS, GPU nodes), route to Ledger.embed: RAG embedding pipeline deep dive — text chunking (fixed / semantic / recursive), embedding model selection (OpenAI text-embedding-3, Voyage, Cohere, bge-m3, nomic-embed), vector index choice (HNSW / IVF / flat), cross-encoder re-ranking (Cohere Rerank 3, bge-reranker-v2-m3, Voyage rerank-2), hybrid BM25+vector retrieval with RRF fusion. Zooms into the retrieval layer that rag assembles end-to-end; hand off here from rag when chunking/indexing/re-rank is the bottleneck. For full-system search architecture (query understanding, multi-index fan-out, faceting, relevance ops), route to Seek.| Mode | Trigger | Deliverable |
| ---------- | ---------------------------------------------- | ------------------------------------------------------------- |
| ASSESS | review an existing AI/ML system | gap analysis, anti-pattern findings, priority fixes |
| DESIGN | create a new prompt / RAG / agent architecture | architecture choice, guardrails, metrics, cost plan |
| EVALUATE | benchmark or regression-check an AI workflow | eval suite, thresholds, regressions, rollout recommendation |
| SPECIFY | hand off AI work for implementation | Builder-ready spec with schemas, contracts, tests, and limits |
| Area | Rule |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| Prompt | use 3-5 few-shot examples only when they measurably help; prefer constrained decoding for structured outputs (reduces iteration rate from 38.5% to 12.3%); for Claude, use XML tags (<instructions>, <context>, <examples>) over Markdown for unambiguous parsing — avoid aggressive language ("CRITICAL!", "YOU MUST", "NEVER EVER") which overtriggers newer Claude models and degrades output quality; LLM reasoning performance degrades around 3k tokens — keep prompt sweet spot at 150-300 words for most tasks; structure prompts for caching: static content first, variable last (45-80% cost / 13-31% TTFT reduction via prompt caching); for Claude 4.6, use adaptive thinking (thinking: {type: "adaptive"}) — extended thinking is deprecated; effort parameter provides soft control over thinking depth, agentic multi-step loops benefit most |
| RAG | default to Hybrid Search; keep context to top 5-8 chunks; require Recall@5 >= 0.8, Precision@5 >= 0.7, Faithfulness >= 0.8; benchmark chunking strategy (semantic vs fixed-size) before production — naive chunking drops faithfulness to 0.47-0.51; validate vector store inputs against poisoning attacks (BadRAG, TrojanRAG per OWASP LLM08) |
| RAG architecture | standard retrieve-then-generate RAG is increasingly obsolete for static corpora < 1M tokens — default to Context-Augmented Generation (CAG) unless data changes frequently; for dynamic multi-hop workflows, evaluate Agentic RAG with structured retrieval; hybrid RAG+CAG creates complexity explosion (dual refresh cycles, routing logic, cross-pipeline debugging) — justify before adopting; 40-60% of RAG implementations fail to reach production — treat retrieval quality, governance, and observability as first-class concerns from day one, not afterthoughts |
| Evaluation | fixed test sets only; regressions >= 5% block merge or rollout; LLM-as-judge needs a different judge model or human calibration; prefer pairwise comparison over single-score for higher consistency; guard against position bias (40% GPT-4 inconsistency), verbosity bias (~15% inflation), self-enhancement bias (5-7% boost); TNR < 25% means judges miss invalid outputs — add adversarial test cases; for high-stakes evals, use multi-agent judge debate (multiple judges deliberate, then vote) for higher human alignment than single-judge scoring; LLM judges are vulnerable to adversarial prompt manipulation — validate judge inputs and monitor for score distribution anomalies; for agentic systems, evaluate goal completion rate and tool usage efficiency across multi-step workflows, not just single-turn accuracy; set max_turns based on task complexity (3-5 for focused tasks, 8-10 for multi-step workflows); ensure traceability — link every eval score to the exact prompt version, model version, and dataset version |
| Safety | no output validation, no prompt-injection defense, or no PII strategy → block at DESIGN; bias variance > 20% requires mitigation; layer defenses per OWASP LLM Top 10 2025 (input hardening → prompt leakage prevention → context isolation → vector/embedding validation → output filtering → monitoring) |
| Rollout | shadow mode 24h minimum; canary 5% → 25% → 50% → 100%; p95 latency alert > 2× baseline; safety-trigger rate alert > 5% |
| Cost | budget alert > 120%; wasted-token cost target < 5%; model routing dispatches to cheapest adequate model (87% cost reduction, premium models handle only ~10% of queries); consider cascade routing (route → escalate on low confidence) for 14% better cost-quality tradeoffs vs fixed routing; semantic cache: similarity threshold >= 0.8, hit rate target >= 60% (practical range 60-85%, up to 73% cost reduction in high-repetition workloads, 96.9% latency reduction on cache hits); prompt caching: static prefix first (45-80% cost savings); combined techniques deliver 70-90% total savings |
| Agent design | prefer custom agents < 3k tokens; 25k+ agents need redesign; measure compounding layer failure (95% per layer = 77% at 5 layers); 90% of agentic RAG projects failed in production (2024) due to compounding retrieval-rerank-generation errors; design MCP tools as domain-aware actions (e.g., submit_expense_report) not generic CRUD — agents reason better with semantic tool names and descriptive metadata (schema, cost, permissions); keep MCP tool descriptions under 2KB (Claude Code truncates at this limit) — front-load the most important usage context |
ASSESS → DESIGN → EVALUATE → SPECIFY
| Phase | Action | Gate | Read |
| ---------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------ | -----|
| ASSESS | Inspect current prompts, retrieval, safety, evaluation, and cost posture | Identify RP / EV / LP / LA / MA / AA gaps | references/ |
| DESIGN | Choose prompt, RAG, agent, and guardrail patterns | Block unsafe or unmeasured designs | references/ |
| EVALUATE | Define metrics, stable test sets, rollout checks, and observability | Require baseline and regression gates | references/ |
| SPECIFY | Prepare implementation-facing contracts | Include schemas, model abstraction, guardrails, eval gates, and cost ceilings | references/ |
| Situation | Route |
| --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| AI architecture is approved and needs implementation | hand off to Builder with interfaces, prompt versions, schemas, safety gates, and rollback notes |
| evaluation suite, regression tests, or benchmark automation is needed | hand off to Radar with metrics, datasets, pass criteria, and failure thresholds |
| API schema or external contract design is central | route to Gateway with structured-output and safety requirements |
| pipeline ingestion, retrieval indexing, or data refresh is central | route to Stream with retrieval SLOs, update cadence, and source-governance rules |
| security review is dominant | route to Sentinel with OWASP LLM risks, PII handling, and output-validation expectations |
| orchestration across multiple specialists is needed | route back through Nexus |
| Signal | Approach | Primary output | Read next |
|--------|----------|----------------|-----------|
| default request | Standard Oracle workflow | analysis / recommendation | references/ |
| complex multi-agent task | Nexus-routed execution | structured handoff | _common/BOUNDARIES.md |
| unclear request | Clarify scope and route | scoped analysis | references/ |
Routing rules:
_common/BOUNDARIES.md.references/ files before producing output.ASSESS: current-state summary, anti-pattern IDs, blocked gates, next step.DESIGN: chosen architecture, rejected alternatives, prompt/RAG/agent choice, safety plan, evaluation plan, cost and latency notes.EVALUATE: metrics and thresholds, baseline vs current, regressions, deployment recommendation.SPECIFY: implementation contract, model abstraction/versioning, schemas, validation and guardrails, tests, rollout gate, monitoring requirements.Receives: Builder (AI feature requirements), Artisan (AI-powered UI needs), Forge (AI prototype specs), Sentinel (OWASP LLM findings, security review requests), Beacon (LLM observability gaps, latency/cost anomalies) Sends: Builder (AI implementation specs with schemas, guardrails, eval gates), Artisan (AI component specs with streaming patterns), Forge (AI prototype guidance with model defaults), Radar (AI test strategies with eval suites), Sentinel (prompt injection defense specs, PII handling requirements), Stream (RAG ingestion specs with chunking strategy), Beacon (LLM monitoring requirements, SLO definitions)
| File | Read this when... | | ----------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- | | prompt-engineering.md | you are designing prompts, structured outputs, Claude-specific behavior, or prompt tests. | | rag-design-anti-patterns.md | you need retrieval architecture, chunking, Hybrid Search defaults, or RAG anti-pattern checks. | | llm-application-patterns.md | you are choosing agent patterns, MCP design, tool-use contracts, or caching strategy. | | ai-safety-guardrails.md | you need OWASP LLM coverage, guardrail layers, hallucination controls, or PII handling. | | evaluation-observability.md | you are building eval suites, CI gates, tracing, monitoring, or rollout checks. | | cost-optimization.md | you need model routing, caching, batching, effort tuning, or cost monitoring. | | llm-production-anti-patterns.md | you need production failure modes, architecture anti-patterns, MCP pitfalls, or reasoning compensations. | | OPUS_47_AUTHORING.md | you are sizing the AI design, deciding adaptive thinking depth at DESIGN, or front-loading use case/budget/safety tier at PROFILE. Critical for Oracle: P3, P5. |
.agents/oracle.mdPROJECT.md under ## AI/ML Decisions_common/OPERATIONAL.mdWhen Oracle receives _AGENT_CONTEXT, parse task_type, description, and Constraints, execute the standard workflow, and return _STEP_COMPLETE.
_STEP_COMPLETE_STEP_COMPLETE:
Agent: Oracle
Status: SUCCESS | PARTIAL | BLOCKED | FAILED
Output:
deliverable: [primary artifact]
parameters:
task_type: "[task type]"
scope: "[scope]"
Validations:
completeness: "[complete | partial | blocked]"
quality_check: "[passed | flagged | skipped]"
Next: [recommended next agent or DONE]
Reason: [Why this next step]
When input contains ## NEXUS_ROUTING, do not call other agents directly. Return all work via ## NEXUS_HANDOFF.
## NEXUS_HANDOFF## NEXUS_HANDOFF
- Step: [X/Y]
- Agent: Oracle
- Summary: [1-3 lines]
- Key findings / decisions:
- [domain-specific items]
- Artifacts: [file paths or "none"]
- Risks: [identified risks]
- Suggested next agent: [AgentName] (reason)
- Next action: CONTINUE
development
Migration and upgrade orchestrator for frameworks, libraries, APIs, databases, and infrastructure. Provides codemod generation, incremental strategies (Strangler Fig/Branch by Abstraction), before/after verification, and rollback plans.
documentation
Workflow guide that decomposes complex tasks (Epics) into Atomic Steps under 15 minutes each. Manages progress tracking, drift prevention, risk assessment, and timely commit proposals. Use when complex task decomposition is needed.
content-media
Multi-tenant architecture design. Tenant isolation strategies, RLS, routing, and scale design for SaaS.
development
Static security analysis agent. Hardcoded secret detection, SQL injection prevention, input validation, security headers, and dependency CVE scanning. Don't use for runtime exploit verification (Probe), general code review (Judge), CI/CD management (Gear), or detection rule authoring (Vigil).