plugins/modal-compute/skills/modal-compute-knowledge/SKILL.md
Modal.com serverless compute, GPU, autoscaling, batching, and deployment knowledge. PROACTIVELY activate for: (1) Modal app/function definitions (modal.App, @app.function, @app.cls), (2) running ML/AI workloads serverlessly, (3) GPU configuration (T4, L4, A10G, A100, H100, H200), (4) Modal volumes and shared state, (5) Modal Image builder (apt, pip, run_commands), (6) Modal secrets and environment management, (7) web endpoints (FastAPI, asgi_app, wsgi_app), (8) cron schedules and continuous deployments, (9) Modal sandboxes, (10) cost optimization on Modal (autoscaling, spot, idle timeout), (11) Modal CLI (modal deploy, modal run, modal serve). Provides: app templates, GPU sizing matrix, Image builder recipes, secret patterns, web endpoint examples, and cost-optimization guidance.
npx skillsauth add JosiahSiegel/claude-plugin-marketplace modal-compute-knowledgeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Comprehensive Modal.com platform knowledge covering all features, pricing, and best practices. Activate this skill when users need detailed information about Modal's serverless cloud platform.
Activate this skill when users ask about:
Modal is a serverless cloud platform for running Python code, optimized for AI/ML workloads with:
import modal
app = modal.App("app-name")
@app.function()
def basic_function(arg: str) -> str:
return f"Result: {arg}"
@app.local_entrypoint()
def main():
result = basic_function.remote("test")
print(result)
| Parameter | Type | Description |
|-----------|------|-------------|
| image | Image | Container image configuration |
| gpu | str/list | GPU type(s): "T4", "A100", ["H100", "A100"] |
| cpu | float | CPU cores (0.125 to 64) |
| memory | int | Memory in MB (128 to 262144) |
| timeout | int | Max execution seconds |
| retries | int | Retry attempts on failure |
| secrets | list | Secrets to inject |
| volumes | dict | Volume mount points |
| schedule | Cron/Period | Scheduled execution |
| concurrency_limit | int | Max concurrent executions |
| container_idle_timeout | int | Seconds to keep warm |
| include_source | bool | Auto-sync source code |
| GPU | Memory | Use Case | ~Cost/hr | |-----|--------|----------|----------| | T4 | 16 GB | Small inference | $0.59 | | L4 | 24 GB | Medium inference | $0.80 | | A10G | 24 GB | Inference/fine-tuning | $1.10 | | L40S | 48 GB | Heavy inference | $1.50 | | A100-40GB | 40 GB | Training | $2.00 | | A100-80GB | 80 GB | Large models | $3.00 | | H100 | 80 GB | Cutting-edge | $5.00 | | H200 | 141 GB | Largest models | $5.00 | | B200 | 180+ GB | Latest gen | $6.25 |
# Single GPU
@app.function(gpu="A100")
# Specific memory variant
@app.function(gpu="A100-80GB")
# Multi-GPU
@app.function(gpu="H100:4")
# Fallbacks (tries in order)
@app.function(gpu=["H100", "A100", "any"])
# "any" = L4, A10G, or T4
@app.function(gpu="any")
# Debian slim (recommended)
modal.Image.debian_slim(python_version="3.11")
# From Dockerfile
modal.Image.from_dockerfile("./Dockerfile")
# From Docker registry
modal.Image.from_registry("nvidia/cuda:12.1.0-base-ubuntu22.04")
# pip (standard)
image.pip_install("torch", "transformers")
# uv (FASTER - 10-100x)
image.uv_pip_install("torch", "transformers")
# System packages
image.apt_install("ffmpeg", "libsm6")
# Shell commands
image.run_commands("apt-get update", "make install")
# Single file
image.add_local_file("./config.json", "/app/config.json")
# Directory
image.add_local_dir("./models", "/app/models")
# Python source
image.add_local_python_source("my_module")
# Environment variables
image.env({"VAR": "value"})
def download_model():
from huggingface_hub import snapshot_download
snapshot_download("model-name")
image.run_function(download_model, secrets=[...])
# Create/reference volume
vol = modal.Volume.from_name("my-vol", create_if_missing=True)
# Mount in function
@app.function(volumes={"/data": vol})
def func():
# Read/write to /data
vol.commit() # Persist changes
# From dashboard (recommended)
modal.Secret.from_name("secret-name")
# From dictionary
modal.Secret.from_dict({"KEY": "value"})
# From local env
modal.Secret.from_local_environ(["KEY1", "KEY2"])
# From .env file
modal.Secret.from_dotenv()
# Usage
@app.function(secrets=[modal.Secret.from_name("api-keys")])
def func():
import os
key = os.environ["API_KEY"]
# Distributed dict
d = modal.Dict.from_name("cache", create_if_missing=True)
d["key"] = "value"
d.put("key", "value", ttl=3600)
# Distributed queue
q = modal.Queue.from_name("jobs", create_if_missing=True)
q.put("task")
item = q.get()
@app.function()
@modal.fastapi_endpoint()
def hello(name: str = "World"):
return {"message": f"Hello, {name}!"}
from fastapi import FastAPI
web_app = FastAPI()
@web_app.post("/predict")
def predict(text: str):
return {"result": process(text)}
@app.function()
@modal.asgi_app()
def fastapi_app():
return web_app
from flask import Flask
flask_app = Flask(__name__)
@app.function()
@modal.wsgi_app()
def flask_endpoint():
return flask_app
@app.function()
@modal.web_server(port=8000)
def custom_server():
subprocess.run(["python", "-m", "http.server", "8000"])
@modal.asgi_app(custom_domains=["api.example.com"])
# Daily at 8 AM UTC
@app.function(schedule=modal.Cron("0 8 * * *"))
# With timezone
@app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York"))
@app.function(schedule=modal.Period(hours=5))
@app.function(schedule=modal.Period(days=1))
Note: Scheduled functions only run with modal deploy, not modal run.
# Parallel execution (up to 1000 concurrent)
results = list(func.map(items))
# Unordered (faster)
results = list(func.map(items, order_outputs=False))
# Spread args
pairs = [(1, 2), (3, 4)]
results = list(add.starmap(pairs))
# Async job (returns immediately)
call = func.spawn(data)
result = call.get() # Get result later
# Spawn many
calls = [func.spawn(item) for item in items]
results = [call.get() for call in calls]
@app.cls(gpu="A100", container_idle_timeout=300)
class Server:
@modal.enter()
def load(self):
self.model = load_model()
@modal.method()
def predict(self, text):
return self.model(text)
@modal.exit()
def cleanup(self):
del self.model
@modal.concurrent(max_inputs=100, target_inputs=80)
@modal.method()
def batched(self, item):
pass
modal run app.py # Run function
modal serve app.py # Hot-reload dev server
modal shell app.py # Interactive shell
modal shell app.py --gpu A100 # Shell with GPU
modal deploy app.py # Deploy
modal app list # List apps
modal app logs app-name # View logs
modal app stop app-name # Stop app
# Volumes
modal volume create name
modal volume list
modal volume put name local remote
modal volume get name remote local
# Secrets
modal secret create name KEY=value
modal secret list
# Environments
modal environment create staging
| Plan | Price | Containers | GPU Concurrency | |------|-------|------------|-----------------| | Starter | Free ($30 credits) | 100 | 10 | | Team | $250/month | 1000 | 50 | | Enterprise | Custom | Unlimited | Custom |
@modal.enter() for model loadinguv_pip_install for faster buildsorder_outputs=False when order doesn't mattercontainer_idle_timeout to balance cost/latencymodal run before modal deploy@app.cls(gpu="A100", container_idle_timeout=300)
class LLM:
@modal.enter()
def load(self):
from vllm import LLM
self.llm = LLM(model="...")
@modal.method()
def generate(self, prompt):
return self.llm.generate([prompt])
@app.function(volumes={"/data": vol})
def process(file):
# Process file
vol.commit()
# Parallel
results = list(process.map(files))
@app.function(
schedule=modal.Cron("0 6 * * *"),
secrets=[modal.Secret.from_name("db")]
)
def daily_etl():
extract()
transform()
load()
| Task | Code |
|------|------|
| Create app | app = modal.App("name") |
| Basic function | @app.function() |
| With GPU | @app.function(gpu="A100") |
| With image | @app.function(image=img) |
| Web endpoint | @modal.asgi_app() |
| Scheduled | schedule=modal.Cron("...") |
| Mount volume | volumes={"/path": vol} |
| Use secret | secrets=[modal.Secret.from_name("x")] |
| Parallel map | func.map(items) |
| Async spawn | func.spawn(arg) |
| Class pattern | @app.cls() with @modal.enter() |
development
This skill should be used when the user asks to train, debug, scale, or improve ML models. PROACTIVELY activate for: (1) PyTorch, TensorFlow/Keras, JAX, Flax, Hugging Face Trainer/Accelerate training loops, (2) distributed training, DDP/FSDP/DeepSpeed, TPU/GPU setup, (3) mixed precision AMP/bf16, gradient accumulation, checkpointing, seeding, (4) overfitting, imbalance, loss functions, regularization, LR schedules, warmup, (5) memory optimization, gradient checkpointing, offloading, quantization-aware training. Provides: reproducible training best practices across deep learning and classical ML.
development
This skill should be used when the user asks to productionize, track, version, govern, monitor, or automate ML systems. PROACTIVELY activate for: (1) MLflow, Weights & Biases, Neptune, Comet, ClearML experiment tracking, (2) model registry, model versioning, artifact lineage, reproducibility, (3) Kubeflow, SageMaker Pipelines, Vertex AI Pipelines, Azure ML pipelines, Databricks workflows, (4) CI/CD, continuous training/evaluation, A/B tests, canary/shadow deployments, (5) drift detection, model monitoring, data validation, responsible AI governance. Provides: end-to-end MLOps architecture and operational safeguards.
development
This skill should be used when the user asks to optimize, export, serve, compress, or accelerate ML inference. PROACTIVELY activate for: (1) latency, throughput, p95/p99, batching, concurrency, KV cache, memory, or cost issues, (2) quantization INT8/INT4, GPTQ, AWQ, bitsandbytes, pruning, sparsity, distillation, (3) ONNX export, ONNX Runtime, TensorRT, TorchScript, torch.compile, XLA, OpenVINO, Core ML, TFLite, (4) Triton, TorchServe, TF Serving, BentoML, Seldon, KServe configuration, (5) edge deployment, CPU/GPU/TPU/Inferentia serving. Provides: hardware-aware inference optimization and safe benchmarking.
testing
This skill should be used when the user asks to tune hyperparameters, run sweeps, optimize search spaces, or use AutoML. PROACTIVELY activate for: (1) Optuna, Ray Tune, FLAML, AutoGluon, Hyperopt, Nevergrad, KerasTuner, W&B sweeps, (2) grid search, random search, Bayesian optimization, TPE, Gaussian processes, evolutionary search, (3) ASHA, Hyperband, successive halving, multi-fidelity optimization, population-based training, (4) learning-rate finder, batch-size search, early stopping, pruning, (5) reproducible sweep design and experiment analysis. Provides: budget-aware hyperparameter search strategy.