Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jaykim88/ai-llm-backend

Name: ai-llm-backend
Author: jaykim88

plugins/backend-toolkit/skills/ai-llm-backend/SKILL.md

npx skillsauth add jaykim88/claude-ai-engineering ai-llm-backend

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AI / LLM Backend

Purpose

Build production LLM features that are deterministic where they must be, cost-controlled, observable, and safe against prompt injection — rather than a fragile prompt glued to an API call.

Universal — agent-loop discipline, RAG architecture, token accounting, streaming, eval, and treating model context as untrusted are LLM-backend principles independent of the model vendor; pgvector/Postgres is the default vector store.

Procedure

Distinguish workflows from agents
- Workflow — predefined LLM call sequence (classify → extract → format); deterministic, cheaper, debuggable; prefer this
- Agent — model dynamically chooses tools/steps in a loop; powerful but less predictable; use only when the path genuinely can't be predefined
Make the agent loop deterministic and bounded
- Round-trip every tool call correctly: each tool-use block gets a matching tool-result carrying the SAME tool-call id — mismatches corrupt the conversation (Anthropic names this field tool_use_id; OpenAI calls it tool_call_id)
- Cap loop iterations (no infinite tool-calling); cap tool-result token size (compaction)
- Invest in the tool interface (clear schemas + descriptions) as much as the prompt
RAG: keep the vector store in the existing database
- One datastore avoids operating a second system; an embeddings column alongside your data is enough for most workloads
- Use an approximate-nearest-neighbour (ANN) index tuned for the recall/latency balance you need
- Chunk deliberately, store source metadata for citations, retrieve top-k then re-rank
- Pin the embedding model id with each stored vector — changing the embedding model invalidates every existing vector (different model = different vector space). Plan a reindex (or dual-write embeddings during a window) before swapping; this is a one-way migration and the #1 RAG operational landmine
Account for tokens and cost per call — and survive provider limits
- Log input/output tokens + model + cost for every LLM call
- Set per-user / per-session budgets; alert on spikes (a runaway agent loop or prompt-injection can explode cost)
- Handle 429/503 from the provider with exponential backoff + jitter (see resilience-patterns); cap parallel in-flight calls per key; for high-availability paths, define a model-fallback chain (primary → secondary → cached/degraded)
- For deterministic prompts (temperature = 0, same input) cache the response — see caching-strategy (this is what makes evals cheap to re-run)
Stream responses
- SSE / ReadableStream for token-by-token output (pairs with frontend-toolkit AI streaming)
- Handle mid-stream cancellation (client disconnect → stop generation → stop billing)
Treat ALL model-context content as untrusted (prompt injection is structural)
- User input, retrieved documents, tool results — all can carry injection; you can't fully "patch" it
- Channel separation is the structural defense: keep user-supplied content in the user role, never concatenated into the system prompt or a tool description. Same for retrieved docs — wrap each as a user message with a clear "untrusted retrieved content" boundary
- Defenses: never let the model's raw output trigger privileged actions without a gate; validate/parse tool arguments (see data-validation); a Human-in-the-loop gate for high-stakes actions; least-privilege tools
Build an eval harness
- A fixed test set of inputs + expected properties; score outputs (exact, rubric, LLM-judge)
- Run on prompt/model changes — regressions in LLM features are invisible without evals
Validate (validation loop)
- Run the eval set; if quality drops below threshold on a prompt/model change → revert or fix and re-run
- Inject a prompt-injection payload via a retrieved doc → verify it can't trigger a privileged action
- Force a tool error → verify the loop handles it (tool-call id still round-tripped, doesn't hang)

Anti-patterns

| ❌ Anti-pattern | ✅ Correct | |---|---| | Agent loop with no iteration cap | Bounded loop + tool-result compaction | | Mismatched/ignored tool-call id | Round-trip every tool call by id | | Trusting retrieved docs / tool output as safe | Treat all context as untrusted; gate privileged actions | | No token/cost logging | Per-call token + cost accounting + budgets | | Shipping prompt changes with no eval | Eval harness gates prompt/model changes | | Standing up a second vector DB when your DB can store vectors | Vector store in the existing database | | Swapping embedding models without a reindex plan | Pin the embedding model id with each vector; reindex (or dual-write) before swap | | User text concatenated into the system prompt | Channel separation: user content in the user role only | | No backoff / fallback on provider 429 / 503 | Exponential backoff + jitter + parallel-call cap; multi-model fallback for critical paths |

Severity tiers

| Tier | Examples | Action SLA | |---|---|---| | Critical | Prompt injection can trigger a privileged action (delete data, send money); unbounded agent loop / cost; raw model output executed; embedding model swapped with no reindex (RAG silently returns garbage) | Block release; fix immediately | | Major | No token/cost accounting; no eval harness; tool-call-id mishandling causing failures; user content mixed into the system prompt (collapses channel-separation defense) | Fix this sprint | | Minor | Suboptimal chunking; ANN index params untuned; missing stream cancellation; no multi-model fallback | Schedule within 2 sprints |

Completion Criteria

[ ] Agent loops bounded + tool calls round-tripped by id
[ ] RAG uses a vector store with an ANN index + source metadata
[ ] Token/cost logged per call + budgets set
[ ] Responses stream with cancellation handling
[ ] All model context treated as untrusted; privileged actions gated
[ ] Eval harness gates prompt/model changes
[ ] Embedding model id pinned per vector; reindex plan in place before any swap
[ ] Provider rate-limit handling (429/503 backoff + parallel cap); multi-model fallback for critical paths

Output

AI feature code: agent loop / RAG pipeline / streaming endpoint
Eval harness: test set + scoring + CI integration
Cost dashboard: per-feature token/cost metrics
Commit format: feat(ai): RAG over <corpus> with pgvector / feat(ai): eval harness for <feature>

Implementation

TypeScript + Postgres(pgvector) + Anthropic SDK (default)

Agent loop: Anthropic SDK tool-use; match tool_use_id on every tool_result; cap iterations
RAG: pgvector extension, vector column, HNSW index (USING hnsw (embedding vector_cosine_ops)); embeddings via the model provider
Streaming: SSE from a NestJS endpoint or ReadableStream; pairs with frontend-toolkit ai-llm UI
Cost: log usage (input/output tokens) per call to observability
Eval: a test suite of prompts + assertions (run in CI)

Other stacks

Python / FastAPI: Anthropic/OpenAI SDK; pgvector via SQLAlchemy or pgvector-python; LangChain/LlamaIndex optional (prefer thin)
Go: provider SDKs; pgvector via pgx
Universal: agent-loop discipline, prompt-injection-is-untrusted, token accounting, and eval are vendor-agnostic; pgvector is Postgres (alternatives: Qdrant/Weaviate, but prefer one datastore)

Related skills

data-validation — tool inputs and model outputs are untrusted — parse them
observability-setup — token/cost/latency are first-class metrics for AI features
caching-strategy — cache embeddings and deterministic completions

Reference

Key insight encoded: Distinguish workflows (predefined, prefer) from agents (dynamic); treat ALL model-context content as untrusted (prompt injection is structural, not patchable) and defend with channel separation — user content stays in the user role, never concatenated into the system prompt; make the loop deterministic — round-trip every tool call by id (tool_use_id), cap tool-result tokens, stream with explicit cost accounting. Two operational landmines: changing the embedding model invalidates all stored vectors (plan the reindex), and provider rate-limit / outage handling needs explicit backoff + multi-model fallback for critical paths.

jaykim88/ai-llm-backend

plugins/backend-toolkit/skills/ai-llm-backend/SKILL.md

Build LLM features on the backend — deterministic agent loops (round-trip every tool call by id), RAG over a vector store, token/cost accounting, streaming, eval harness, and prompt-injection defense (treat all model context as untrusted). Use when adding an AI feature, building RAG, or wiring an agent loop. Not for the AI streaming UI on the frontend (use frontend-toolkit's AI integration) or general boundary input parsing (use data-validation).

tools

Updated Jun 9, 2026

$ install --global

skillsauth

npx skillsauth add jaykim88/claude-ai-engineering ai-llm-backend

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 9, 2026, 8:27 AM301.1s1 file scanned

SKILL.md

name:: ai-llm-backend
description:: Build LLM features on the backend — deterministic agent loops (round-trip every tool call by id), RAG over a vector store, token/cost accounting, streaming, eval harness, and prompt-injection defense (treat all model context as untrusted). Use when adding an AI feature, building RAG, or wiring an agent loop. Not for the AI streaming UI on the frontend (use frontend-toolkit's AI integration) or general boundary input parsing (use data-validation).
license:: MIT

AI / LLM Backend

Purpose

Build production LLM features that are deterministic where they must be, cost-controlled, observable, and safe against prompt injection — rather than a fragile prompt glued to an API call.

Procedure

Distinguish workflows from agents
- Workflow — predefined LLM call sequence (classify → extract → format); deterministic, cheaper, debuggable; prefer this
- Agent — model dynamically chooses tools/steps in a loop; powerful but less predictable; use only when the path genuinely can't be predefined
Make the agent loop deterministic and bounded
- Round-trip every tool call correctly: each tool-use block gets a matching tool-result carrying the SAME tool-call id — mismatches corrupt the conversation (Anthropic names this field tool_use_id; OpenAI calls it tool_call_id)
- Cap loop iterations (no infinite tool-calling); cap tool-result token size (compaction)
- Invest in the tool interface (clear schemas + descriptions) as much as the prompt
RAG: keep the vector store in the existing database
- One datastore avoids operating a second system; an embeddings column alongside your data is enough for most workloads
- Use an approximate-nearest-neighbour (ANN) index tuned for the recall/latency balance you need
- Chunk deliberately, store source metadata for citations, retrieve top-k then re-rank
- Pin the embedding model id with each stored vector — changing the embedding model invalidates every existing vector (different model = different vector space). Plan a reindex (or dual-write embeddings during a window) before swapping; this is a one-way migration and the #1 RAG operational landmine
Account for tokens and cost per call — and survive provider limits
- Log input/output tokens + model + cost for every LLM call
- Set per-user / per-session budgets; alert on spikes (a runaway agent loop or prompt-injection can explode cost)
- Handle 429/503 from the provider with exponential backoff + jitter (see resilience-patterns); cap parallel in-flight calls per key; for high-availability paths, define a model-fallback chain (primary → secondary → cached/degraded)
- For deterministic prompts (temperature = 0, same input) cache the response — see caching-strategy (this is what makes evals cheap to re-run)
Stream responses
- SSE / ReadableStream for token-by-token output (pairs with frontend-toolkit AI streaming)
- Handle mid-stream cancellation (client disconnect → stop generation → stop billing)
Treat ALL model-context content as untrusted (prompt injection is structural)
- User input, retrieved documents, tool results — all can carry injection; you can't fully "patch" it
- Channel separation is the structural defense: keep user-supplied content in the user role, never concatenated into the system prompt or a tool description. Same for retrieved docs — wrap each as a user message with a clear "untrusted retrieved content" boundary
- Defenses: never let the model's raw output trigger privileged actions without a gate; validate/parse tool arguments (see data-validation); a Human-in-the-loop gate for high-stakes actions; least-privilege tools
Build an eval harness
- A fixed test set of inputs + expected properties; score outputs (exact, rubric, LLM-judge)
- Run on prompt/model changes — regressions in LLM features are invisible without evals
Validate (validation loop)
- Run the eval set; if quality drops below threshold on a prompt/model change → revert or fix and re-run
- Inject a prompt-injection payload via a retrieved doc → verify it can't trigger a privileged action
- Force a tool error → verify the loop handles it (tool-call id still round-tripped, doesn't hang)

Anti-patterns

Severity tiers

Completion Criteria

[ ] Agent loops bounded + tool calls round-tripped by id
[ ] RAG uses a vector store with an ANN index + source metadata
[ ] Token/cost logged per call + budgets set
[ ] Responses stream with cancellation handling
[ ] All model context treated as untrusted; privileged actions gated
[ ] Eval harness gates prompt/model changes
[ ] Embedding model id pinned per vector; reindex plan in place before any swap
[ ] Provider rate-limit handling (429/503 backoff + parallel cap); multi-model fallback for critical paths

Output

AI feature code: agent loop / RAG pipeline / streaming endpoint
Eval harness: test set + scoring + CI integration
Cost dashboard: per-feature token/cost metrics
Commit format: feat(ai): RAG over <corpus> with pgvector / feat(ai): eval harness for <feature>

Implementation

TypeScript + Postgres(pgvector) + Anthropic SDK (default)

Agent loop: Anthropic SDK tool-use; match tool_use_id on every tool_result; cap iterations
RAG: pgvector extension, vector column, HNSW index (USING hnsw (embedding vector_cosine_ops)); embeddings via the model provider
Streaming: SSE from a NestJS endpoint or ReadableStream; pairs with frontend-toolkit ai-llm UI
Cost: log usage (input/output tokens) per call to observability
Eval: a test suite of prompts + assertions (run in CI)

Other stacks

Python / FastAPI: Anthropic/OpenAI SDK; pgvector via SQLAlchemy or pgvector-python; LangChain/LlamaIndex optional (prefer thin)
Go: provider SDKs; pgvector via pgx
Universal: agent-loop discipline, prompt-injection-is-untrusted, token accounting, and eval are vendor-agnostic; pgvector is Postgres (alternatives: Qdrant/Weaviate, but prefer one datastore)

Related skills

data-validation — tool inputs and model outputs are untrusted — parse them
observability-setup — token/cost/latency are first-class metrics for AI features
caching-strategy — cache embeddings and deterministic completions

Reference

Key insight encoded: Distinguish workflows (predefined, prefer) from agents (dynamic); treat ALL model-context content as untrusted (prompt injection is structural, not patchable) and defend with channel separation — user content stays in the user role, never concatenated into the system prompt; make the loop deterministic — round-trip every tool call by id (tool_use_id), cap tool-result tokens, stream with explicit cost accounting. Two operational landmines: changing the embedding model invalidates all stored vectors (plan the reindex), and provider rate-limit / outage handling needs explicit backoff + multi-model fallback for critical paths.

Related Skills

jaykim88/webhook-design

development

VerifiedTrustedCommunity

Design webhooks correctly on both sides — sending (HMAC signing, retries with backoff, at-least-once) and receiving (verify signature on raw body, enqueue + 200 fast, dedupe on event id). Use when adding webhook delivery or consuming a provider's webhooks. Not for internal service-to-service events (use async-messaging) or general outbound-call retry policy (use resilience-patterns).

SKILL.mdUpdated Jun 9, 2026

jaykim88/webhook-design

jaykim88/transaction-management

testing

VerifiedTrustedCommunity

Use transactions and isolation levels correctly — keep them short, no network calls inside, explicit isolation, retry on serialization conflicts, and choose optimistic vs pessimistic locking. Use when a write spans multiple tables, when concurrent updates corrupt data, or when designing money/inventory flows. Not for cross-service event delivery (use async-messaging Outbox) or schema-level constraints (use schema-design).

SKILL.mdUpdated Jun 9, 2026

jaykim88/transaction-management

jaykim88/test-strategy

development

VerifiedTrustedCommunity

Backend testing pyramid — unit for pure logic, integration against a real DB (Testcontainers), and consumer-driven contract testing (Pact) for service boundaries. Use before a feature, after a bug fix, or when services break each other on deploy. Not for load testing (use performance-profiling) or security testing (use backend-security-audit).

SKILL.mdUpdated Jun 9, 2026

jaykim88/test-strategy

jaykim88/schema-design

data-ai

VerifiedTrustedCommunity

Design a relational schema — normalize to 3NF then denormalize with justification, choose the right Postgres index type per data shape, enforce constraints at the DB. Use when modeling a new domain, when queries are slow, or before a migration. Not for diagnosing slow queries (use query-optimization) or shipping the change without downtime (use migration-strategy).

SKILL.mdUpdated Jun 9, 2026

jaykim88/schema-design

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jaykim88/claude-ai-engineering.git

# Copy into Claude Code skills folder (global)
cp -r claude-ai-engineering/plugins/backend-toolkit/skills/ai-llm-backend ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jaykim88/claude-ai-engineering

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT