src/skills/ai-infrastructure-modal/SKILL.md
Serverless GPU compute platform for AI model deployment — web endpoints, GPU functions, model serving, and TypeScript client patterns
npx skillsauth add agents-inc/skills ai-infrastructure-modalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Quick Guide: Modal is a serverless GPU compute platform where you define Python functions with decorators and Modal handles containers, scaling, and GPU provisioning. TypeScript apps interact with Modal via HTTP endpoints (calling
@modal.fastapi_endpointor@modal.asgi_appfunctions) or themodalnpm SDK (calling functions directly via gRPC). Define container images, secrets, and volumes as code -- no YAML config files. Usemodal deployfor production,modal servefor dev.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use @modal.fastapi_endpoint (not the old @modal.web_endpoint) for simple web endpoints -- renamed in Modal 1.0)
(You MUST use modal.Volume for model weight caching -- @modal.build is deprecated in Modal 1.0)
(You MUST never hardcode secrets in Modal code -- use modal.Secret.from_name() and access via os.environ)
(You MUST bind to 0.0.0.0 (not 127.0.0.1) when using @modal.web_server)
</critical_requirements>
Auto-detection: Modal, modal, modal.App, modal.Image, modal.Volume, modal.Secret, modal.gpu, modal.fastapi_endpoint, modal.asgi_app, modal.web_server, modal.Cron, modal.Period, modal deploy, modal serve, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, ModalClient
When to use:
modal npm SDKKey patterns covered:
@modal.fastapi_endpoint, @modal.asgi_app, @modal.web_server) for HTTP accessmodal npm SDK)When NOT to use:
Modal eliminates infrastructure management for GPU workloads. Everything is code -- container images, GPU allocation, secrets, volumes, scaling rules. There are no YAML configs, Dockerfiles, or Kubernetes manifests.
Core principles:
modal npm SDK for direct function invocation without HTTP overhead.modal deploy creates a named, persistent deployment with stable URLs. modal serve creates ephemeral dev endpoints.The most common pattern: define a Python endpoint on Modal, call it from TypeScript via fetch.
# inference.py
import modal
app = modal.App("my-inference-api")
image = modal.Image.debian_slim().uv_pip_install(["fastapi[standard]", "transformers", "torch"])
@app.function(image=image, gpu="A10G")
@modal.fastapi_endpoint(method="POST")
def predict(payload: dict):
# GPU-accelerated inference
text = payload["text"]
result = run_model(text)
return {"prediction": result}
const response = await fetch(MODAL_ENDPOINT, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(input),
signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS), // Essential for cold starts
});
Key requirements: Named constant for URL (not hardcoded at call sites), Content-Type: application/json header (FastAPI rejects without it), AbortSignal.timeout() to handle cold start delays, typed request/response interfaces.
See examples/core.md for a complete TypeScript client with error handling and typed interfaces.
Modal supports proxy auth tokens that protect endpoints without spinning up containers for unauthorized requests.
@app.function(image=image, gpu="A10G")
@modal.fastapi_endpoint(method="POST", requires_proxy_auth=True)
def predict_secure(payload: dict):
return {"prediction": run_model(payload["text"])}
headers: {
"Content-Type": "application/json",
"Modal-Key": process.env.MODAL_PROXY_KEY, // Proxy auth token
"Modal-Secret": process.env.MODAL_PROXY_SECRET,
},
Why good: Auth handled at Modal's proxy layer (no container spin-up for bad requests), env vars for credentials. Add explicit 401 handling in your error logic.
See examples/core.md for a complete authenticated TypeScript client with error handling.
For TypeScript apps that need to call Modal functions without HTTP overhead. Requires Node 22+.
import { ModalClient } from "modal";
const modal = new ModalClient(); // Create once, reuse
const fn = await modal.functions.fromName("my-inference-api", "predict");
const result = await fn.remote([text]); // sync call
const call = await fn.spawn([text]); // async (fire-and-forget)
const later = await call.get(); // retrieve result later
Why good: No HTTP serialization overhead, typed SDK, supports async spawn for long-running jobs
See examples/core.md for complete TypeScript SDK patterns including error handling and fire-and-forget job IDs.
When to use: Backend-to-Modal calls where you control the Node.js runtime (Node 22+). Not for browser or edge runtimes.
Modal functions define their compute environment inline.
import modal
app = modal.App("gpu-inference")
# Container image with ML dependencies
inference_image = (
modal.Image.debian_slim(python_version="3.11")
.uv_pip_install(["torch==2.5.0", "transformers==4.47.0", "accelerate"])
.apt_install(["libgl1"])
)
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
@app.function(
image=inference_image,
gpu="A100", # GPU type: "T4", "A10G", "A100", "H100", etc.
secrets=[modal.Secret.from_name("huggingface-secret")],
volumes={"/models": modal.Volume.from_name("model-cache", create_if_missing=True)},
min_containers=1, # Keep warm to avoid cold starts
scaledown_window=300, # Seconds before scaling to zero
)
def generate(prompt: str) -> str:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load from volume cache
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/models")
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, cache_dir="/models")
# ... generate and return
Why good: Pinned dependency versions, volume-based model caching (avoids re-download), min_containers for warm starts, secrets for HF token
# BAD: No version pinning, no volume cache, model re-downloads every cold start
@app.function(gpu="A100")
def generate(prompt: str):
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
return pipe(prompt)[0]["generated_text"]
Why bad: Unpinned deps break reproducibility, no volume means multi-GB model download on every cold start (30-60s+ delay), no secret for gated models
For endpoints that need routing, middleware, or multiple routes.
import modal
from fastapi import FastAPI
app = modal.App("my-api")
web_app = FastAPI()
@web_app.post("/predict")
async def predict(payload: dict):
return {"result": "prediction"}
@web_app.get("/health")
async def health():
return {"status": "ok"}
@app.function(image=modal.Image.debian_slim().uv_pip_install(["fastapi"]))
@modal.asgi_app()
def serve():
return web_app
Why good: Full FastAPI capabilities (routing, middleware, validation), multiple endpoints under one function
# Creating secrets via CLI
# modal secret create my-api-keys API_KEY=sk-xxx DB_URL=postgres://...
@app.function(
secrets=[
modal.Secret.from_name("my-api-keys"),
modal.Secret.from_name("huggingface-secret"),
]
)
def my_function():
import os
api_key = os.environ["API_KEY"] # Injected by Modal
hf_token = os.environ["HF_TOKEN"]
Why good: Secrets created via dashboard or CLI, referenced by name in code, accessed as standard env vars, multiple secrets composable
@app.function(
schedule=modal.Cron("0 2 * * *"), # 2 AM daily
image=inference_image,
gpu="A10G",
volumes={"/data": modal.Volume.from_name("training-data")},
)
def nightly_batch_inference():
# Process accumulated data
# Write results to volume
pass
@app.function(schedule=modal.Period(hours=6))
def periodic_health_check():
# Check model freshness, data quality, etc.
pass
Why good: modal.Cron for precise scheduling, modal.Period for intervals. Scheduled functions cannot accept arguments -- use volumes or secrets for input data.
<decision_framework>
Does your TypeScript app need to call Modal?
+-- Via HTTP (most common)
| +-- Single endpoint? -> @modal.fastapi_endpoint
| +-- Multiple routes? -> @modal.asgi_app with FastAPI
| +-- Non-Python server (vLLM, TGI)? -> @modal.web_server(port=8000)
| +-- Need auth? -> Add requires_proxy_auth=True
+-- Via SDK (direct gRPC)
| +-- Node 22+ backend? -> npm install modal, use ModalClient
| +-- Browser/edge? -> Use HTTP endpoints instead
+-- Async job?
+-- Fire-and-forget? -> SDK spawn() + later get()
+-- Webhook callback? -> Modal calls your endpoint on completion
What are you serving?
+-- Simple function -> @modal.fastapi_endpoint (auto-wraps in FastAPI)
+-- Full web app -> @modal.asgi_app (FastAPI, Starlette, FastHTML)
+-- Legacy sync app -> @modal.wsgi_app (Flask, Django)
+-- Custom server binary -> @modal.web_server(port=8000) (vLLM, TGI, Ollama)
How should TypeScript call Modal?
+-- Browser/edge runtime? -> HTTP (fetch)
+-- Server-side Node 22+? -> Either works
| +-- Need simplicity? -> HTTP
| +-- Need speed (no serialization overhead)? -> SDK
| +-- Need async spawn? -> SDK
+-- Multiple providers? -> HTTP (vendor-agnostic)
</decision_framework>
<red_flags>
High Priority Issues:
@modal.web_endpoint instead of @modal.fastapi_endpoint (renamed in Modal 1.0)@modal.build for downloading model weights (deprecated -- use modal.Volume instead).lookup() for object references (deprecated -- use .from_name())modal.Secret.from_name() + os.environ)@modal.web_server to 127.0.0.1 instead of 0.0.0.0 (endpoint unreachable)Medium Priority Issues:
uv_pip_install() (breaks reproducibility)min_containers for latency-sensitive endpoints (2-4s cold starts)signal: AbortSignal.timeout() on TypeScript fetch calls (hangs on cold starts)modal.Period when you need exact times (use modal.Cron -- Period resets on redeploy)Common Mistakes:
modal serve (ephemeral dev) with modal deploy (persistent production)messages parameter with @modal.fastapi_endpoint (it is not OpenAI -- it is a plain HTTP endpoint)Content-Type: application/json header from TypeScript (FastAPI endpoints may reject the request)modal npm SDK in browser or edge runtimes (requires Node 22+, native modules)Gotchas & Edge Cases:
modal serve URLs get a -dev suffix to avoid production conflictshttps://<workspace>--<app-name>-<function-name>.modal.runversion=2)modal.Cron maintains schedule across redeploys; modal.Period resetsstr, int, bool, bytesImage.add_local_python_source() (automounting removed in 1.0)</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST define Modal functions in Python -- the TypeScript SDK can call functions and manage resources but cannot define them)
(You MUST use @modal.fastapi_endpoint (not the old @modal.web_endpoint) for simple web endpoints -- renamed in Modal 1.0)
(You MUST use modal.Volume for model weight caching -- @modal.build is deprecated in Modal 1.0)
(You MUST never hardcode secrets in Modal code -- use modal.Secret.from_name() and access via os.environ)
(You MUST bind to 0.0.0.0 (not 127.0.0.1) when using @modal.web_server)
Failure to follow these rules will produce broken deployments, security vulnerabilities, or unreachable endpoints.
</critical_reminders>
development
Material Design component library for Vue 3
development
VitePress 1.x — Vue-powered static site generator for documentation sites, built on Vite
tools
Docusaurus 3.x documentation framework — site configuration, docs/blog plugins, sidebars, versioning, MDX, swizzling, and deployment
development
TanStack Form patterns - useForm, form.Field, validators, arrays, linked fields, createFormHook, type safety