plugins/backend-toolkit/skills/ai-llm-backend/SKILL.md
Build LLM features on the backend — deterministic agent loops (round-trip every tool call by id), RAG over a vector store, token/cost accounting, streaming, eval harness, and prompt-injection defense (treat all model context as untrusted). Use when adding an AI feature, building RAG, or wiring an agent loop. Not for the AI streaming UI on the frontend (use frontend-toolkit's AI integration) or general boundary input parsing (use data-validation).
npx skillsauth add jaykim88/claude-ai-engineering ai-llm-backendInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build production LLM features that are deterministic where they must be, cost-controlled, observable, and safe against prompt injection — rather than a fragile prompt glued to an API call.
Universal — agent-loop discipline, RAG architecture, token accounting, streaming, eval, and treating model context as untrusted are LLM-backend principles independent of the model vendor; pgvector/Postgres is the default vector store.
Distinguish workflows from agents
Make the agent loop deterministic and bounded
tool_use_id; OpenAI calls it tool_call_id)RAG: keep the vector store in the existing database
Account for tokens and cost per call — and survive provider limits
resilience-patterns); cap parallel in-flight calls per key; for high-availability paths, define a model-fallback chain (primary → secondary → cached/degraded)caching-strategy (this is what makes evals cheap to re-run)Stream responses
ReadableStream for token-by-token output (pairs with frontend-toolkit AI streaming)Treat ALL model-context content as untrusted (prompt injection is structural)
user role, never concatenated into the system prompt or a tool description. Same for retrieved docs — wrap each as a user message with a clear "untrusted retrieved content" boundarydata-validation); a Human-in-the-loop gate for high-stakes actions; least-privilege toolsBuild an eval harness
Validate (validation loop)
| ❌ Anti-pattern | ✅ Correct | |---|---| | Agent loop with no iteration cap | Bounded loop + tool-result compaction | | Mismatched/ignored tool-call id | Round-trip every tool call by id | | Trusting retrieved docs / tool output as safe | Treat all context as untrusted; gate privileged actions | | No token/cost logging | Per-call token + cost accounting + budgets | | Shipping prompt changes with no eval | Eval harness gates prompt/model changes | | Standing up a second vector DB when your DB can store vectors | Vector store in the existing database | | Swapping embedding models without a reindex plan | Pin the embedding model id with each vector; reindex (or dual-write) before swap | | User text concatenated into the system prompt | Channel separation: user content in the user role only | | No backoff / fallback on provider 429 / 503 | Exponential backoff + jitter + parallel-call cap; multi-model fallback for critical paths |
| Tier | Examples | Action SLA | |---|---|---| | Critical | Prompt injection can trigger a privileged action (delete data, send money); unbounded agent loop / cost; raw model output executed; embedding model swapped with no reindex (RAG silently returns garbage) | Block release; fix immediately | | Major | No token/cost accounting; no eval harness; tool-call-id mishandling causing failures; user content mixed into the system prompt (collapses channel-separation defense) | Fix this sprint | | Minor | Suboptimal chunking; ANN index params untuned; missing stream cancellation; no multi-model fallback | Schedule within 2 sprints |
feat(ai): RAG over <corpus> with pgvector / feat(ai): eval harness for <feature>tool_use_id on every tool_result; cap iterationspgvector extension, vector column, HNSW index (USING hnsw (embedding vector_cosine_ops)); embeddings via the model providerReadableStream; pairs with frontend-toolkit ai-llm UIusage (input/output tokens) per call to observabilitypgvector-python; LangChain/LlamaIndex optional (prefer thin)pgxdata-validation — tool inputs and model outputs are untrusted — parse themobservability-setup — token/cost/latency are first-class metrics for AI featurescaching-strategy — cache embeddings and deterministic completionstool_use_id), cap tool-result tokens, stream with explicit cost accounting. Two operational landmines: changing the embedding model invalidates all stored vectors (plan the reindex), and provider rate-limit / outage handling needs explicit backoff + multi-model fallback for critical paths.development
Design webhooks correctly on both sides — sending (HMAC signing, retries with backoff, at-least-once) and receiving (verify signature on raw body, enqueue + 200 fast, dedupe on event id). Use when adding webhook delivery or consuming a provider's webhooks. Not for internal service-to-service events (use async-messaging) or general outbound-call retry policy (use resilience-patterns).
testing
Use transactions and isolation levels correctly — keep them short, no network calls inside, explicit isolation, retry on serialization conflicts, and choose optimistic vs pessimistic locking. Use when a write spans multiple tables, when concurrent updates corrupt data, or when designing money/inventory flows. Not for cross-service event delivery (use async-messaging Outbox) or schema-level constraints (use schema-design).
development
Backend testing pyramid — unit for pure logic, integration against a real DB (Testcontainers), and consumer-driven contract testing (Pact) for service boundaries. Use before a feature, after a bug fix, or when services break each other on deploy. Not for load testing (use performance-profiling) or security testing (use backend-security-audit).
data-ai
Design a relational schema — normalize to 3NF then denormalize with justification, choose the right Postgres index type per data shape, enforce constraints at the DB. Use when modeling a new domain, when queries are slow, or before a migration. Not for diagnosing slow queries (use query-optimization) or shipping the change without downtime (use migration-strategy).