dist/plugins/api-ai-modal/skills/api-ai-modal/SKILL.md
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
npx skillsauth add agents-inc/skills api-ai-modalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Quick Guide: Modal is a serverless GPU compute platform where you define Python functions with decorators and Modal handles containers, scaling, and GPU provisioning. TypeScript apps interact with Modal via HTTP endpoints (calling
@modal.fastapi_endpointor@modal.asgi_appfunctions) or themodalnpm SDK (calling functions directly via gRPC). Define container images, secrets, and volumes as code -- no YAML config files. Usemodal deployfor production,modal servefor dev.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use @modal.fastapi_endpoint (not the old @modal.web_endpoint) for simple web endpoints -- renamed in Modal 1.0)
(You MUST use modal.Volume for model weight caching -- @modal.build is deprecated in Modal 1.0)
(You MUST never hardcode secrets in Modal code -- use modal.Secret.from_name() and access via os.environ)
(You MUST bind to 0.0.0.0 (not 127.0.0.1) when using @modal.web_server)
</critical_requirements>
Auto-detection: Modal, modal, modal.App, modal.Image, modal.Volume, modal.Secret, modal.gpu, modal.fastapi_endpoint, modal.asgi_app, modal.web_server, modal.Cron, modal.Period, modal deploy, modal serve, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, ModalClient
When to use:
modal npm SDKKey patterns covered:
@modal.fastapi_endpoint, @modal.asgi_app, @modal.web_server) for HTTP accessmodal npm SDK)When NOT to use:
Modal eliminates infrastructure management for GPU workloads. Everything is code -- container images, GPU allocation, secrets, volumes, scaling rules. There are no YAML configs, Dockerfiles, or Kubernetes manifests.
Core principles:
modal npm SDK for direct function invocation without HTTP overhead.modal deploy creates a named, persistent deployment with stable URLs. modal serve creates ephemeral dev endpoints.When to use Modal:
When NOT to use:
The most common pattern: define a Python endpoint on Modal, call it from TypeScript via fetch.
# inference.py
import modal
app = modal.App("my-inference-api")
image = modal.Image.debian_slim().uv_pip_install(["transformers", "torch"])
@app.function(image=image, gpu="A10G")
@modal.fastapi_endpoint(method="POST")
def predict(payload: dict):
# GPU-accelerated inference
text = payload["text"]
result = run_model(text)
return {"prediction": result}
// lib/modal-client.ts
const MODAL_ENDPOINT =
"https://your-workspace--my-inference-api-predict.modal.run";
const REQUEST_TIMEOUT_MS = 30_000;
interface PredictionRequest {
text: string;
}
interface PredictionResponse {
prediction: string;
}
async function predict(input: PredictionRequest): Promise<PredictionResponse> {
const response = await fetch(MODAL_ENDPOINT, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(input),
signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS),
});
if (!response.ok) {
throw new Error(
`Modal API error [${response.status}]: ${await response.text()}`,
);
}
return response.json() as Promise<PredictionResponse>;
}
export { predict };
Why good: Clean separation -- Python handles GPU compute, TypeScript handles the app. Named constants for config, typed request/response, timeout handling, proper error checking.
// BAD: No timeout, no error handling, hardcoded URL in call site
const res = await fetch("https://workspace--app-fn.modal.run", {
method: "POST",
body: JSON.stringify({ text: "hello" }),
});
const data = await res.json();
Why bad: No Content-Type header (may fail), no timeout (hangs on cold start), no error checking, URL scattered across codebase
Modal supports proxy auth tokens that protect endpoints without spinning up containers for unauthorized requests.
@app.function(image=image, gpu="A10G")
@modal.fastapi_endpoint(method="POST", requires_proxy_auth=True)
def predict_secure(payload: dict):
return {"prediction": run_model(payload["text"])}
// lib/modal-client.ts
const MODAL_KEY = process.env.MODAL_PROXY_KEY;
const MODAL_SECRET = process.env.MODAL_PROXY_SECRET;
async function predictSecure(
input: PredictionRequest,
): Promise<PredictionResponse> {
if (!MODAL_KEY || !MODAL_SECRET) {
throw new Error("Missing MODAL_PROXY_KEY or MODAL_PROXY_SECRET env vars");
}
const response = await fetch(MODAL_ENDPOINT, {
method: "POST",
headers: {
"Content-Type": "application/json",
"Modal-Key": MODAL_KEY,
"Modal-Secret": MODAL_SECRET,
},
body: JSON.stringify(input),
signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS),
});
if (response.status === 401) {
throw new Error(
"Modal proxy auth failed -- check MODAL_PROXY_KEY and MODAL_PROXY_SECRET",
);
}
if (!response.ok) {
throw new Error(
`Modal API error [${response.status}]: ${await response.text()}`,
);
}
return response.json() as Promise<PredictionResponse>;
}
Why good: Auth handled at Modal's proxy layer (no container spin-up for bad requests), env vars for credentials, explicit 401 handling
For TypeScript apps that need to call Modal functions without HTTP overhead. Requires Node 22+.
import { ModalClient } from "modal";
const modal = new ModalClient(); // Create once, reuse
const fn = await modal.functions.fromName("my-inference-api", "predict");
const result = await fn.remote([text]); // sync call
const call = await fn.spawn([text]); // async (fire-and-forget)
const later = await call.get(); // retrieve result later
Why good: No HTTP serialization overhead, typed SDK, supports async spawn for long-running jobs
See examples/core.md for complete TypeScript SDK patterns including error handling and fire-and-forget job IDs.
When to use: Backend-to-Modal calls where you control the Node.js runtime (Node 22+). Not for browser or edge runtimes.
Modal functions define their compute environment inline.
import modal
app = modal.App("gpu-inference")
# Container image with ML dependencies
inference_image = (
modal.Image.debian_slim(python_version="3.11")
.uv_pip_install(["torch==2.5.0", "transformers==4.47.0", "accelerate"])
.apt_install(["libgl1"])
)
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
@app.function(
image=inference_image,
gpu="A100", # GPU type: "T4", "A10G", "A100", "H100", etc.
secrets=[modal.Secret.from_name("huggingface-secret")],
volumes={"/models": modal.Volume.from_name("model-cache", create_if_missing=True)},
min_containers=1, # Keep warm to avoid cold starts
scaledown_window=300, # Seconds before scaling to zero
)
def generate(prompt: str) -> str:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load from volume cache
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/models")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, cache_dir="/models")
# ... generate and return
Why good: Pinned dependency versions, volume-based model caching (avoids re-download), min_containers for warm starts, secrets for HF token
# BAD: No version pinning, no volume cache, model re-downloads every cold start
@app.function(gpu="A100")
def generate(prompt: str):
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
return pipe(prompt)[0]["generated_text"]
Why bad: Unpinned deps break reproducibility, no volume means multi-GB model download on every cold start (30-60s+ delay), no secret for gated models
For endpoints that need routing, middleware, or multiple routes.
import modal
from fastapi import FastAPI
app = modal.App("my-api")
web_app = FastAPI()
@web_app.post("/predict")
async def predict(payload: dict):
return {"result": "prediction"}
@web_app.get("/health")
async def health():
return {"status": "ok"}
@app.function(image=modal.Image.debian_slim().uv_pip_install(["fastapi"]))
@modal.asgi_app()
def serve():
return web_app
Why good: Full FastAPI capabilities (routing, middleware, validation), multiple endpoints under one function
# Creating secrets via CLI
# modal secret create my-api-keys API_KEY=sk-xxx DB_URL=postgres://...
@app.function(
secrets=[
modal.Secret.from_name("my-api-keys"),
modal.Secret.from_name("huggingface-secret"),
]
)
def my_function():
import os
api_key = os.environ["API_KEY"] # Injected by Modal
hf_token = os.environ["HF_TOKEN"]
Why good: Secrets created via dashboard or CLI, referenced by name in code, accessed as standard env vars, multiple secrets composable
@app.function(
schedule=modal.Cron("0 2 * * *"), # 2 AM daily
image=inference_image,
gpu="A10G",
volumes={"/data": modal.Volume.from_name("training-data")},
)
def nightly_batch_inference():
# Process accumulated data
# Write results to volume
pass
@app.function(schedule=modal.Period(hours=6))
def periodic_health_check():
# Check model freshness, data quality, etc.
pass
Why good: modal.Cron for precise scheduling, modal.Period for intervals. Scheduled functions cannot accept arguments -- use volumes or secrets for input data.
<decision_framework>
Does your TypeScript app need to call Modal?
+-- Via HTTP (most common)
| +-- Single endpoint? -> @modal.fastapi_endpoint
| +-- Multiple routes? -> @modal.asgi_app with FastAPI
| +-- Non-Python server (vLLM, TGI)? -> @modal.web_server(port=8000)
| +-- Need auth? -> Add requires_proxy_auth=True
+-- Via SDK (direct gRPC)
| +-- Node 22+ backend? -> npm install modal, use ModalClient
| +-- Browser/edge? -> Use HTTP endpoints instead
+-- Async job?
+-- Fire-and-forget? -> SDK spawn() + later get()
+-- Webhook callback? -> Modal calls your endpoint on completion
What are you serving?
+-- Simple function -> @modal.fastapi_endpoint (auto-wraps in FastAPI)
+-- Full web app -> @modal.asgi_app (FastAPI, Starlette, FastHTML)
+-- Legacy sync app -> @modal.wsgi_app (Flask, Django)
+-- Custom server binary -> @modal.web_server(port=8000) (vLLM, TGI, Ollama)
How should TypeScript call Modal?
+-- Browser/edge runtime? -> HTTP (fetch)
+-- Server-side Node 22+? -> Either works
| +-- Need simplicity? -> HTTP
| +-- Need speed (no serialization overhead)? -> SDK
| +-- Need async spawn? -> SDK
+-- Multiple providers? -> HTTP (vendor-agnostic)
</decision_framework>
<red_flags>
High Priority Issues:
@modal.web_endpoint instead of @modal.fastapi_endpoint (renamed in Modal 1.0)@modal.build for downloading model weights (deprecated -- use modal.Volume instead).lookup() for object references (deprecated -- use .from_name())modal.Secret.from_name() + os.environ)@modal.web_server to 127.0.0.1 instead of 0.0.0.0 (endpoint unreachable)Medium Priority Issues:
uv_pip_install() (breaks reproducibility)min_containers for latency-sensitive endpoints (2-4s cold starts)signal: AbortSignal.timeout() on TypeScript fetch calls (hangs on cold starts)modal.Period when you need exact times (use modal.Cron -- Period resets on redeploy)Common Mistakes:
modal serve (ephemeral dev) with modal deploy (persistent production)messages parameter with @modal.fastapi_endpoint (it is not OpenAI -- it is a plain HTTP endpoint)Content-Type: application/json header from TypeScript (FastAPI endpoints may reject the request)modal npm SDK in browser or edge runtimes (requires Node 22+, native modules)Gotchas & Edge Cases:
modal serve URLs get a -dev suffix to avoid production conflictshttps://<workspace>--<app-name>-<function-name>.modal.runversion=2)modal.Cron maintains schedule across redeploys; modal.Period resetsstr, int, bool, bytesImage.add_local_python_source() (automounting removed in 1.0)</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use @modal.fastapi_endpoint (not the old @modal.web_endpoint) for simple web endpoints -- renamed in Modal 1.0)
(You MUST use modal.Volume for model weight caching -- @modal.build is deprecated in Modal 1.0)
(You MUST never hardcode secrets in Modal code -- use modal.Secret.from_name() and access via os.environ)
(You MUST bind to 0.0.0.0 (not 127.0.0.1) when using @modal.web_server)
Failure to follow these rules will produce broken deployments, security vulnerabilities, or unreachable endpoints.
</critical_reminders>
development
Material Design component library for Vue 3
development
VitePress 1.x — Vue-powered static site generator for documentation sites, built on Vite
tools
Docusaurus 3.x documentation framework — site configuration, docs/blog plugins, sidebars, versioning, MDX, swizzling, and deployment
development
TanStack Form patterns - useForm, form.Field, validators, arrays, linked fields, createFormHook, type safety