Replicate SDK Patterns

Quick Guide: Use the replicate npm package to run open-source ML models on serverless GPUs. Use replicate.run() for synchronous execution that returns output directly, replicate.stream() for SSE-based streaming, or replicate.predictions.create() for async background jobs with webhook notifications. Models are referenced as owner/model (uses latest version) or owner/model:version (pinned). File outputs are FileOutput objects implementing ReadableStream. Cold starts are expected for infrequently-used models -- use deployments with min_instances to keep models warm.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST never hardcode API tokens -- always use environment variables via process.env.REPLICATE_API_TOKEN)

(You MUST handle FileOutput objects for models that return files -- do not assume outputs are plain strings or URLs)

(You MUST validate webhooks using validateWebhook() from the replicate package -- never trust unverified webhook payloads)

(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)

(You MUST specify model versions (owner/model:version) in production to ensure reproducible results -- unversioned references use the latest, which can change)

</critical_requirements>

Auto-detection: Replicate, replicate, replicate.run, replicate.stream, replicate.predictions, replicate.deployments, replicate.trainings, replicate.models, FileOutput, validateWebhook, REPLICATE_API_TOKEN, serverless GPU, cold start, webhook_events_filter

When to use:

Running open-source ML models (Llama, Stable Diffusion, Whisper, etc.) without managing GPU infrastructure
Generating images, transcribing audio, running LLMs, or any ML inference via API
Streaming LLM output in real-time with server-sent events
Processing predictions asynchronously with webhook notifications
Fine-tuning models with custom training data
Running models on dedicated hardware with custom scaling via deployments

Key patterns covered:

Client initialization and configuration (auth, user agent, file encoding)
Running predictions (replicate.run(), replicate.predictions.create(), replicate.wait())
Streaming output (replicate.stream() with SSE events)
Model versioning (owner/model vs owner/model:version)
File input/output handling (FileOutput, file uploads, Buffer inputs)
Webhooks (setup, event filtering, signature validation)
Deployments (custom hardware, scaling, keeping models warm)
Training / fine-tuning

When NOT to use:

You need a unified multi-provider LLM SDK (OpenAI, Anthropic, Google) -- use a provider-agnostic SDK
You want to run models locally -- Replicate is a cloud-only serverless platform
You need sub-second latency guarantees without deployments -- cold starts can take minutes

Examples Index

Core: Setup, Predictions & Files -- Client init, run(), predictions.create(), wait(), file I/O, error handling
Streaming & Webhooks -- stream(), SSE events, webhook setup, signature validation
Deployments & Training -- Custom hardware, scaling, fine-tuning, model management
Quick API Reference -- Method signatures, constructor options, error types, model reference format

Philosophy

Replicate provides serverless GPU infrastructure for running open-source ML models. You send inputs, Replicate allocates GPU hardware, runs the model, and returns outputs. No Docker, no CUDA drivers, no GPU provisioning.

Core principles:

Serverless execution -- Models run on-demand on Replicate's infrastructure. You pay only for compute time. Cold starts are a trade-off for not maintaining always-on GPUs.
Model marketplace -- Thousands of community and official models available at replicate.com/explore. Run any public model with just its identifier.
Version pinning for reproducibility -- Models are versioned with SHA-256 hashes. Pin to a version in production (owner/model:abc123...) to guarantee identical behavior across deploys.
Three execution modes -- replicate.run() for synchronous wait, replicate.stream() for real-time SSE output, replicate.predictions.create() for fire-and-forget with webhooks.
File-first I/O -- Many models accept and produce files (images, audio, video). The SDK handles file uploads automatically and returns FileOutput objects for file outputs.

</philosophy>

Core Patterns

Pattern 1: Client Setup

Initialize the Replicate client. It auto-reads REPLICATE_API_TOKEN from the environment.

// lib/replicate.ts -- basic setup
import Replicate from "replicate";

const replicate = new Replicate();

export { replicate };

// lib/replicate.ts -- explicit auth + custom user agent
import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN, // Auto-reads from env if omitted
  userAgent: "my-app/1.0.0",
});

export { replicate };

Why good: Minimal setup, env var auto-detected, explicit auth optional but useful for clarity

// BAD: Hardcoded token
const replicate = new Replicate({
  auth: "r8_abc123...",
});

Why bad: Hardcoded API token is a security risk, will leak in version control

See: examples/core.md for full constructor options, error handling patterns

Pattern 2: Running Predictions

Use replicate.run() for synchronous execution. Returns the model output directly.

// Run an image generation model
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: {
    prompt: "a serene mountain landscape at sunset",
  },
});

// output is a FileOutput object for image models
console.log(output.url()); // URL of generated image

// Run an LLM -- output is a string for text models
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
  input: {
    prompt: "Explain TypeScript generics in 3 sentences.",
    max_tokens: 512,
  },
});

console.log(output); // Text response

Why good: Simple API, returns output directly, destructuring works for array outputs (images)

// BAD: Not pinning version in production
const output = await replicate.run("community-user/experimental-model", {
  input: { prompt: "hello" },
});

Why bad: Community models without version pinning can change behavior unexpectedly when authors push updates

See: examples/core.md for version pinning, predictions.create() + wait(), and progress callbacks

Pattern 3: Streaming

Use replicate.stream() for real-time SSE output from language models.

const stream = replicate.stream("meta/meta-llama-3-70b-instruct", {
  input: {
    prompt: "Write a short poem about TypeScript.",
    max_tokens: 512,
  },
});

for await (const event of stream) {
  if (event.event === "output") {
    process.stdout.write(event.data);
  }
}

Why good: Progressive output for better UX, event-based with typed event and data fields

// BAD: Using replicate.run() for user-facing LLM output
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
  input: { prompt: "Write a long essay..." },
});
// User waits for entire generation to complete before seeing anything

Why bad: No progressive feedback, user sees a blank screen for seconds

See: examples/streaming-webhooks.md for event types, error handling, cancellation

Pattern 4: Model Versioning

Models are referenced as owner/model (latest version) or owner/model:sha256hash (pinned version).

// Development: use latest version for convenience
const output = await replicate.run("stability-ai/sdxl", {
  input: { prompt: "a cat" },
});

// Production: pin to a specific version for reproducibility
const VERSION_HASH =
  "39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b";
const output = await replicate.run(`stability-ai/sdxl:${VERSION_HASH}`, {
  input: { prompt: "a cat" },
});

Why good: Pinned version guarantees identical behavior, hash is immutable

See: examples/core.md for listing model versions, getting version details

Pattern 5: File Handling

Models that output files return FileOutput objects implementing ReadableStream.

import { writeFile } from "node:fs/promises";

const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: { prompt: "a sunset over mountains" },
});

// FileOutput has .url() and .blob() methods
console.log(output.url()); // Underlying URL

// Save to disk
const blob = await output.blob();
const buffer = Buffer.from(await blob.arrayBuffer());
await writeFile("./output.png", buffer);

// File inputs: pass URLs, Buffers, or ReadStreams
import { readFile } from "node:fs/promises";

const imageBuffer = await readFile("./input.png");

const output = await replicate.run("some-user/image-model", {
  input: {
    image: imageBuffer, // Auto-uploaded (max 100 MiB)
  },
});

Why good: FileOutput is a ReadableStream, works with Node.js stream APIs, .url() for the underlying URL

// BAD: Treating file output as a plain URL string
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: { prompt: "hello" },
});
const url = output; // WRONG: output is a FileOutput object, not a string

Why bad: FileOutput is an object, not a string -- use .url() to get the URL

See: examples/core.md for file uploads, large file handling, encoding strategies

Pattern 6: Async Predictions with Webhooks

Use replicate.predictions.create() for background jobs with webhook notifications.

const prediction = await replicate.predictions.create({
  model: "owner/model", // OR version: "sha256hash" for pinned version
  input: { prompt: "a painting of a cat" },
  webhook: "https://my.app/webhooks/replicate",
  webhook_events_filter: ["completed"],
});

console.log(prediction.id); // Use to track status
console.log(prediction.status); // "starting"

// Webhook signature validation (CRITICAL for security)
import { validateWebhook } from "replicate";

async function handleWebhook(request: Request): Promise<Response> {
  const secret = process.env.REPLICATE_WEBHOOK_SIGNING_SECRET;
  const isValid = await validateWebhook(request, secret);

  if (!isValid) {
    return new Response("Invalid signature", { status: 401 });
  }

  const prediction = await request.json();
  // Process prediction.output safely
  return new Response("OK", { status: 200 });
}

Why good: Decoupled processing, secure signature validation, filtered events reduce noise

See: examples/streaming-webhooks.md for webhook event types, polling alternative

Pattern 7: Deployments

Deployments give you a private, fixed endpoint with custom hardware and scaling.

// Create a prediction on a deployment (no cold start if min_instances > 0)
const prediction = await replicate.deployments.predictions.create(
  "my-org/my-deployment",
  {
    input: { prompt: "hello world" },
  },
);

const result = await replicate.wait(prediction);
console.log(result.output);

Why good: Predictable latency with min_instances, private endpoint, custom hardware selection

See: examples/deployments-training.md for creating/managing deployments, training API

Pattern 8: Error Handling

Catch API errors with status codes. The SDK auto-retries on 429 and 5xx errors (5 retries by default with exponential backoff).

try {
  const output = await replicate.run("owner/model", {
    input: { prompt: "hello" },
  });
} catch (error) {
  if (error instanceof Error) {
    console.error(`Replicate error: ${error.message}`);

    // Check for specific HTTP status codes in the error
    if ("status" in error) {
      const status = (error as { status: number }).status;
      if (status === 401) {
        throw new Error("Invalid API token. Check REPLICATE_API_TOKEN.");
      }
      if (status === 422) {
        console.error("Invalid input parameters");
      }
      if (status === 429) {
        console.error(
          "Rate limited -- SDK auto-retries (5 attempts) exhausted",
        );
      }
    }
  }
  throw error;
}

Why good: Checks error type, handles specific status codes, re-throws unexpected errors

See: examples/core.md for full error handling example with status code handling

</patterns>

Performance Optimization

Cold Start Mitigation

Frequent model with varying load   -> Use deployments with min_instances >= 1
One-off batch jobs                  -> Use predictions.create() with webhooks (no waiting)
Popular public models               -> Usually warm, replicate.run() is fine
Custom/niche models                 -> Expect 30s-5min cold start on first run

Key Optimization Patterns

Use deployments for latency-sensitive applications -- set min_instances: 1 to eliminate cold starts
Use webhooks instead of polling for async jobs -- reduces API calls and latency
Batch file inputs as URLs instead of uploading buffers -- avoids 100 MiB upload limit and is faster
Pin model versions in production -- avoids unexpected behavior changes and enables caching
Use replicate.stream() for LLMs -- progressive output feels faster than waiting for full completion
Cancel unneeded predictions with replicate.predictions.cancel() -- stops billing immediately

</performance>

<decision_framework>

Decision Framework

Which Execution Method to Use

Is this a user-facing LLM response?
+-- YES -> Use replicate.stream() for real-time SSE output
+-- NO -> Do you need the result immediately?
    +-- YES -> Use replicate.run() (blocks until complete)
    +-- NO -> Use replicate.predictions.create() + webhook
        +-- Need to poll instead? -> Use replicate.wait(prediction)

Model Reference Format

Are you in development/prototyping?
+-- YES -> Use owner/model (latest version, convenient)
+-- NO -> Are you in production?
    +-- YES -> Use owner/model:version_hash (pinned, reproducible)
    +-- Does the model change frequently?
        +-- YES -> Pin version, test updates explicitly
        +-- NO -> Either format works, prefer pinned

Deployments vs Direct API

Do you need consistent low latency?
+-- YES -> Create a deployment with min_instances >= 1
+-- NO -> Do you need custom hardware (A100, H100)?
    +-- YES -> Create a deployment with specific hardware
    +-- NO -> Use replicate.run() / replicate.stream() directly
        (Replicate auto-allocates hardware)

When to Use This SDK vs Other AI SDKs

Are you running open-source models on serverless GPUs?
+-- YES -> Use Replicate SDK
+-- NO -> Are you calling proprietary APIs (OpenAI, Anthropic)?
    +-- YES -> Not this skill's scope -- use provider-specific SDKs
    +-- NO -> Do you need to switch between multiple providers?
        +-- YES -> Not this skill's scope -- use a unified provider SDK
        +-- NO -> Do you want to self-host models?
            +-- YES -> Not this skill's scope -- consider Cog or vLLM
            +-- NO -> Replicate SDK is appropriate

</decision_framework>

<red_flags>

RED FLAGS

High Priority Issues:

Hardcoding REPLICATE_API_TOKEN in source code (security breach risk)
Treating FileOutput as a string (it is a ReadableStream object -- use .url() or .blob())
Not validating webhook signatures with validateWebhook() (allows forged webhook payloads)
Using replicate.run() for long-running models in request handlers (blocks the response, can timeout)

Medium Priority Issues:

Not pinning model versions in production (owner/model uses latest, which can change without notice)
Relying solely on default retry behavior for production (5 retries with exponential backoff may be too aggressive for some use cases)
Uploading large files as Buffer instead of hosting them at a URL (100 MiB limit on uploads)
Ignoring cold start latency for infrequently-used models (first request can take minutes)

Common Mistakes:

Confusing replicate.run() (returns output directly) with replicate.predictions.create() (returns a prediction object with status/id)
Destructuring image output incorrectly: const output = await replicate.run(...) instead of const [output] = await replicate.run(...) (image models return arrays)
Using replicate.stream() with models that do not support streaming (only language models with SSE support)
Forgetting that replicate.predictions.create() accepts either a version hash or a model string (owner/model) -- use version for pinned reproducibility, model for latest-version convenience
Not consuming the async iterator from replicate.stream() (events are lost)

Gotchas & Edge Cases:

Prediction inputs and outputs are automatically deleted after one hour -- persist outputs via webhooks or download immediately
The SDK auto-retries on 429 (rate limit) and 5xx errors -- 5 retries by default with exponential backoff. GET requests retry on 429 and 5xx; non-GET requests retry only on 429
replicate.stream() returns ServerSentEvent objects with .event ("output", "error", "done") and .data (string) properties
File uploads are limited to 100 MiB -- for larger files, host them at a URL and pass the URL as input
Browser usage is not supported -- the SDK requires a server-side environment (Node.js 18+, Bun, Deno, Cloudflare Workers)
webhook_events_filter accepts ["start", "output", "logs", "completed"] -- use ["completed"] unless you need intermediate status updates
The Prefer: wait header enables sync mode on the HTTP API (up to 60s), but replicate.run() already handles this automatically
Community models may disappear or change without warning -- pin versions and maintain fallbacks for critical workflows
replicate.wait() polls the API until the prediction completes -- use webhooks for production to avoid polling overhead
FileOutput.url() returns the underlying URL, but these URLs are temporary -- download or persist the file before it expires

</red_flags>

<critical_reminders>

CRITICAL REMINDERS

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST never hardcode API tokens -- always use environment variables via process.env.REPLICATE_API_TOKEN)

(You MUST handle FileOutput objects for models that return files -- do not assume outputs are plain strings or URLs)

(You MUST validate webhooks using validateWebhook() from the replicate package -- never trust unverified webhook payloads)

(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)

(You MUST specify model versions (owner/model:version) in production to ensure reproducible results -- unversioned references use the latest, which can change)

Failure to follow these rules will produce insecure, unreliable, or unpredictable AI integrations.

</critical_reminders>

Replicate SDK Patterns

Quick Guide: Use the replicate npm package to run open-source ML models on serverless GPUs. Use replicate.run() for synchronous execution that returns output directly, replicate.stream() for SSE-based streaming, or replicate.predictions.create() for async background jobs with webhook notifications. Models are referenced as owner/model (uses latest version) or owner/model:version (pinned). File outputs are FileOutput objects implementing ReadableStream. Cold starts are expected for infrequently-used models -- use deployments with min_instances to keep models warm.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST never hardcode API tokens -- always use environment variables via process.env.REPLICATE_API_TOKEN)

(You MUST handle FileOutput objects for models that return files -- do not assume outputs are plain strings or URLs)

(You MUST validate webhooks using validateWebhook() from the replicate package -- never trust unverified webhook payloads)

(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)

(You MUST specify model versions (owner/model:version) in production to ensure reproducible results -- unversioned references use the latest, which can change)

</critical_requirements>

When to use:

Running open-source ML models (Llama, Stable Diffusion, Whisper, etc.) without managing GPU infrastructure
Generating images, transcribing audio, running LLMs, or any ML inference via API
Streaming LLM output in real-time with server-sent events
Processing predictions asynchronously with webhook notifications
Fine-tuning models with custom training data
Running models on dedicated hardware with custom scaling via deployments

Key patterns covered:

Client initialization and configuration (auth, user agent, file encoding)
Running predictions (replicate.run(), replicate.predictions.create(), replicate.wait())
Streaming output (replicate.stream() with SSE events)
Model versioning (owner/model vs owner/model:version)
File input/output handling (FileOutput, file uploads, Buffer inputs)
Webhooks (setup, event filtering, signature validation)
Deployments (custom hardware, scaling, keeping models warm)
Training / fine-tuning

When NOT to use:

You need a unified multi-provider LLM SDK (OpenAI, Anthropic, Google) -- use a provider-agnostic SDK
You want to run models locally -- Replicate is a cloud-only serverless platform
You need sub-second latency guarantees without deployments -- cold starts can take minutes

Examples Index

Core: Setup, Predictions & Files -- Client init, run(), predictions.create(), wait(), file I/O, error handling
Streaming & Webhooks -- stream(), SSE events, webhook setup, signature validation
Deployments & Training -- Custom hardware, scaling, fine-tuning, model management
Quick API Reference -- Method signatures, constructor options, error types, model reference format

Philosophy

Core principles:

Serverless execution -- Models run on-demand on Replicate's infrastructure. You pay only for compute time. Cold starts are a trade-off for not maintaining always-on GPUs.
Model marketplace -- Thousands of community and official models available at replicate.com/explore. Run any public model with just its identifier.
Version pinning for reproducibility -- Models are versioned with SHA-256 hashes. Pin to a version in production (owner/model:abc123...) to guarantee identical behavior across deploys.
Three execution modes -- replicate.run() for synchronous wait, replicate.stream() for real-time SSE output, replicate.predictions.create() for fire-and-forget with webhooks.
File-first I/O -- Many models accept and produce files (images, audio, video). The SDK handles file uploads automatically and returns FileOutput objects for file outputs.

</philosophy>

Core Patterns

Pattern 1: Client Setup

Initialize the Replicate client. It auto-reads REPLICATE_API_TOKEN from the environment.

// lib/replicate.ts -- basic setup
import Replicate from "replicate";

const replicate = new Replicate();

export { replicate };

// lib/replicate.ts -- explicit auth + custom user agent
import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN, // Auto-reads from env if omitted
  userAgent: "my-app/1.0.0",
});

export { replicate };

Why good: Minimal setup, env var auto-detected, explicit auth optional but useful for clarity

// BAD: Hardcoded token
const replicate = new Replicate({
  auth: "r8_abc123...",
});

Why bad: Hardcoded API token is a security risk, will leak in version control

See: examples/core.md for full constructor options, error handling patterns

Pattern 2: Running Predictions

Use replicate.run() for synchronous execution. Returns the model output directly.

// Run an image generation model
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: {
    prompt: "a serene mountain landscape at sunset",
  },
});

// output is a FileOutput object for image models
console.log(output.url()); // URL of generated image

// Run an LLM -- output is a string for text models
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
  input: {
    prompt: "Explain TypeScript generics in 3 sentences.",
    max_tokens: 512,
  },
});

console.log(output); // Text response

Why good: Simple API, returns output directly, destructuring works for array outputs (images)

// BAD: Not pinning version in production
const output = await replicate.run("community-user/experimental-model", {
  input: { prompt: "hello" },
});

Why bad: Community models without version pinning can change behavior unexpectedly when authors push updates

See: examples/core.md for version pinning, predictions.create() + wait(), and progress callbacks

Pattern 3: Streaming

Use replicate.stream() for real-time SSE output from language models.

const stream = replicate.stream("meta/meta-llama-3-70b-instruct", {
  input: {
    prompt: "Write a short poem about TypeScript.",
    max_tokens: 512,
  },
});

for await (const event of stream) {
  if (event.event === "output") {
    process.stdout.write(event.data);
  }
}

Why good: Progressive output for better UX, event-based with typed event and data fields

// BAD: Using replicate.run() for user-facing LLM output
const output = await replicate.run("meta/meta-llama-3-70b-instruct", {
  input: { prompt: "Write a long essay..." },
});
// User waits for entire generation to complete before seeing anything

Why bad: No progressive feedback, user sees a blank screen for seconds

See: examples/streaming-webhooks.md for event types, error handling, cancellation

Pattern 4: Model Versioning

Models are referenced as owner/model (latest version) or owner/model:sha256hash (pinned version).

// Development: use latest version for convenience
const output = await replicate.run("stability-ai/sdxl", {
  input: { prompt: "a cat" },
});

// Production: pin to a specific version for reproducibility
const VERSION_HASH =
  "39ed52f2a78e934b3ba6e2a89f5b1c712de7dfea535525255b1aa35c5565e08b";
const output = await replicate.run(`stability-ai/sdxl:${VERSION_HASH}`, {
  input: { prompt: "a cat" },
});

Why good: Pinned version guarantees identical behavior, hash is immutable

See: examples/core.md for listing model versions, getting version details

Pattern 5: File Handling

Models that output files return FileOutput objects implementing ReadableStream.

import { writeFile } from "node:fs/promises";

const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: { prompt: "a sunset over mountains" },
});

// FileOutput has .url() and .blob() methods
console.log(output.url()); // Underlying URL

// Save to disk
const blob = await output.blob();
const buffer = Buffer.from(await blob.arrayBuffer());
await writeFile("./output.png", buffer);

// File inputs: pass URLs, Buffers, or ReadStreams
import { readFile } from "node:fs/promises";

const imageBuffer = await readFile("./input.png");

const output = await replicate.run("some-user/image-model", {
  input: {
    image: imageBuffer, // Auto-uploaded (max 100 MiB)
  },
});

Why good: FileOutput is a ReadableStream, works with Node.js stream APIs, .url() for the underlying URL

// BAD: Treating file output as a plain URL string
const [output] = await replicate.run("black-forest-labs/flux-schnell", {
  input: { prompt: "hello" },
});
const url = output; // WRONG: output is a FileOutput object, not a string

Why bad: FileOutput is an object, not a string -- use .url() to get the URL

See: examples/core.md for file uploads, large file handling, encoding strategies

Pattern 6: Async Predictions with Webhooks

Use replicate.predictions.create() for background jobs with webhook notifications.

const prediction = await replicate.predictions.create({
  model: "owner/model", // OR version: "sha256hash" for pinned version
  input: { prompt: "a painting of a cat" },
  webhook: "https://my.app/webhooks/replicate",
  webhook_events_filter: ["completed"],
});

console.log(prediction.id); // Use to track status
console.log(prediction.status); // "starting"

// Webhook signature validation (CRITICAL for security)
import { validateWebhook } from "replicate";

async function handleWebhook(request: Request): Promise<Response> {
  const secret = process.env.REPLICATE_WEBHOOK_SIGNING_SECRET;
  const isValid = await validateWebhook(request, secret);

  if (!isValid) {
    return new Response("Invalid signature", { status: 401 });
  }

  const prediction = await request.json();
  // Process prediction.output safely
  return new Response("OK", { status: 200 });
}

Why good: Decoupled processing, secure signature validation, filtered events reduce noise

See: examples/streaming-webhooks.md for webhook event types, polling alternative

Pattern 7: Deployments

Deployments give you a private, fixed endpoint with custom hardware and scaling.

// Create a prediction on a deployment (no cold start if min_instances > 0)
const prediction = await replicate.deployments.predictions.create(
  "my-org/my-deployment",
  {
    input: { prompt: "hello world" },
  },
);

const result = await replicate.wait(prediction);
console.log(result.output);

Why good: Predictable latency with min_instances, private endpoint, custom hardware selection

See: examples/deployments-training.md for creating/managing deployments, training API

Pattern 8: Error Handling

Catch API errors with status codes. The SDK auto-retries on 429 and 5xx errors (5 retries by default with exponential backoff).

try {
  const output = await replicate.run("owner/model", {
    input: { prompt: "hello" },
  });
} catch (error) {
  if (error instanceof Error) {
    console.error(`Replicate error: ${error.message}`);

    // Check for specific HTTP status codes in the error
    if ("status" in error) {
      const status = (error as { status: number }).status;
      if (status === 401) {
        throw new Error("Invalid API token. Check REPLICATE_API_TOKEN.");
      }
      if (status === 422) {
        console.error("Invalid input parameters");
      }
      if (status === 429) {
        console.error(
          "Rate limited -- SDK auto-retries (5 attempts) exhausted",
        );
      }
    }
  }
  throw error;
}

Why good: Checks error type, handles specific status codes, re-throws unexpected errors

See: examples/core.md for full error handling example with status code handling

</patterns>

Performance Optimization

Cold Start Mitigation

Frequent model with varying load   -> Use deployments with min_instances >= 1
One-off batch jobs                  -> Use predictions.create() with webhooks (no waiting)
Popular public models               -> Usually warm, replicate.run() is fine
Custom/niche models                 -> Expect 30s-5min cold start on first run

Key Optimization Patterns

Use deployments for latency-sensitive applications -- set min_instances: 1 to eliminate cold starts
Use webhooks instead of polling for async jobs -- reduces API calls and latency
Batch file inputs as URLs instead of uploading buffers -- avoids 100 MiB upload limit and is faster
Pin model versions in production -- avoids unexpected behavior changes and enables caching
Use replicate.stream() for LLMs -- progressive output feels faster than waiting for full completion
Cancel unneeded predictions with replicate.predictions.cancel() -- stops billing immediately

</performance>

<decision_framework>

Decision Framework

Which Execution Method to Use

Is this a user-facing LLM response?
+-- YES -> Use replicate.stream() for real-time SSE output
+-- NO -> Do you need the result immediately?
    +-- YES -> Use replicate.run() (blocks until complete)
    +-- NO -> Use replicate.predictions.create() + webhook
        +-- Need to poll instead? -> Use replicate.wait(prediction)

Model Reference Format

Are you in development/prototyping?
+-- YES -> Use owner/model (latest version, convenient)
+-- NO -> Are you in production?
    +-- YES -> Use owner/model:version_hash (pinned, reproducible)
    +-- Does the model change frequently?
        +-- YES -> Pin version, test updates explicitly
        +-- NO -> Either format works, prefer pinned

Deployments vs Direct API

Do you need consistent low latency?
+-- YES -> Create a deployment with min_instances >= 1
+-- NO -> Do you need custom hardware (A100, H100)?
    +-- YES -> Create a deployment with specific hardware
    +-- NO -> Use replicate.run() / replicate.stream() directly
        (Replicate auto-allocates hardware)

When to Use This SDK vs Other AI SDKs

Are you running open-source models on serverless GPUs?
+-- YES -> Use Replicate SDK
+-- NO -> Are you calling proprietary APIs (OpenAI, Anthropic)?
    +-- YES -> Not this skill's scope -- use provider-specific SDKs
    +-- NO -> Do you need to switch between multiple providers?
        +-- YES -> Not this skill's scope -- use a unified provider SDK
        +-- NO -> Do you want to self-host models?
            +-- YES -> Not this skill's scope -- consider Cog or vLLM
            +-- NO -> Replicate SDK is appropriate

</decision_framework>

<red_flags>

RED FLAGS

High Priority Issues:

Hardcoding REPLICATE_API_TOKEN in source code (security breach risk)
Treating FileOutput as a string (it is a ReadableStream object -- use .url() or .blob())
Not validating webhook signatures with validateWebhook() (allows forged webhook payloads)
Using replicate.run() for long-running models in request handlers (blocks the response, can timeout)

Medium Priority Issues:

Not pinning model versions in production (owner/model uses latest, which can change without notice)
Relying solely on default retry behavior for production (5 retries with exponential backoff may be too aggressive for some use cases)
Uploading large files as Buffer instead of hosting them at a URL (100 MiB limit on uploads)
Ignoring cold start latency for infrequently-used models (first request can take minutes)

Common Mistakes:

Confusing replicate.run() (returns output directly) with replicate.predictions.create() (returns a prediction object with status/id)
Destructuring image output incorrectly: const output = await replicate.run(...) instead of const [output] = await replicate.run(...) (image models return arrays)
Using replicate.stream() with models that do not support streaming (only language models with SSE support)
Forgetting that replicate.predictions.create() accepts either a version hash or a model string (owner/model) -- use version for pinned reproducibility, model for latest-version convenience
Not consuming the async iterator from replicate.stream() (events are lost)

Gotchas & Edge Cases:

Prediction inputs and outputs are automatically deleted after one hour -- persist outputs via webhooks or download immediately
The SDK auto-retries on 429 (rate limit) and 5xx errors -- 5 retries by default with exponential backoff. GET requests retry on 429 and 5xx; non-GET requests retry only on 429
replicate.stream() returns ServerSentEvent objects with .event ("output", "error", "done") and .data (string) properties
File uploads are limited to 100 MiB -- for larger files, host them at a URL and pass the URL as input
Browser usage is not supported -- the SDK requires a server-side environment (Node.js 18+, Bun, Deno, Cloudflare Workers)
webhook_events_filter accepts ["start", "output", "logs", "completed"] -- use ["completed"] unless you need intermediate status updates
The Prefer: wait header enables sync mode on the HTTP API (up to 60s), but replicate.run() already handles this automatically
Community models may disappear or change without warning -- pin versions and maintain fallbacks for critical workflows
replicate.wait() polls the API until the prediction completes -- use webhooks for production to avoid polling overhead
FileOutput.url() returns the underlying URL, but these URLs are temporary -- download or persist the file before it expires

</red_flags>

<critical_reminders>

CRITICAL REMINDERS

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST never hardcode API tokens -- always use environment variables via process.env.REPLICATE_API_TOKEN)

(You MUST handle FileOutput objects for models that return files -- do not assume outputs are plain strings or URLs)

(You MUST validate webhooks using validateWebhook() from the replicate package -- never trust unverified webhook payloads)

(You MUST account for cold starts when running infrequently-used models -- use deployments for latency-sensitive applications)

(You MUST specify model versions (owner/model:version) in production to ensure reproducible results -- unversioned references use the latest, which can change)

Failure to follow these rules will produce insecure, unreliable, or unpredictable AI integrations.

</critical_reminders>

Adoption

agents-inc/ai-infrastructure-replicate

$ install --global

Security Scan Results

SKILL.md

Replicate SDK Patterns

CRITICAL: Before Using This Skill

Examples Index

Philosophy

Core Patterns

Pattern 1: Client Setup

Pattern 2: Running Predictions

Pattern 3: Streaming

Pattern 4: Model Versioning

Pattern 5: File Handling

Pattern 6: Async Predictions with Webhooks

Pattern 7: Deployments

Pattern 8: Error Handling

Performance Optimization

Cold Start Mitigation

Key Optimization Patterns

Decision Framework

Which Execution Method to Use

Model Reference Format

Deployments vs Direct API

When to Use This SDK vs Other AI SDKs

RED FLAGS

CRITICAL REMINDERS

Related Skills

agents-inc/api-search-xquik

agents-inc/api-search-xquik

agents-inc/web-maps-mapbox

agents-inc/web-maps-leaflet

agents-inc/ai-infrastructure-replicate

$ install --global

Security Scan Results

SKILL.md

Replicate SDK Patterns

CRITICAL: Before Using This Skill

Examples Index

Philosophy

Core Patterns

Pattern 1: Client Setup

Pattern 2: Running Predictions

Pattern 3: Streaming

Pattern 4: Model Versioning

Pattern 5: File Handling

Pattern 6: Async Predictions with Webhooks

Pattern 7: Deployments

Pattern 8: Error Handling

Performance Optimization

Cold Start Mitigation

Key Optimization Patterns

Decision Framework

Which Execution Method to Use

Model Reference Format

Deployments vs Direct API

When to Use This SDK vs Other AI SDKs

RED FLAGS

CRITICAL REMINDERS

Related Skills

agents-inc/api-search-xquik

agents-inc/api-search-xquik

agents-inc/web-maps-mapbox

agents-inc/web-maps-leaflet