Replicate prediction is slow — how to speed it up?

Replicate prediction speed: cold starts are the biggest source of latency (30s–3min for large models). Solutions: (1) use Replicate Deployments to keep a warm instance; (2) for image generation, smaller models like SDXL-Lightning are faster than SDXL; (3) check model page for latency benchmarks in the Examples tab; (4) call the prediction API and poll status rather than using synchronous webhooks for long-running predictions.

Replicate API returning 404 model not found — how to fix?

Replicate 404: the version hash in your URL is outdated. Get the latest version hash from replicate.com/owner/model → Versions tab → copy the full hash. Format: replicate.run/v1/predictions with body { version: 'OWNER/MODEL:HASH', input: {...} }. You can also use the model without a hash and let Replicate use the latest version automatically.

How do I fix Replicate cold start latency?

Cold starts happen when a model has not been used recently and the container needs to boot. Fix options: (1) Replicate Deployments — keeps minimum 1 warm instance, eliminates cold starts for production traffic; (2) warm-up ping — call the model with minimal input every few minutes (not cost-effective for low-traffic use cases); (3) use models with smaller cold start times (smaller parameter count, faster init); (4) accept cold starts and use async prediction polling for non-real-time use cases.

Free 5 min read

Replicate Not Working?

Replicate prediction stuck (cold start), API 401 error, model version 404, billing failed, or webhook not firing? Check live status and fix it fast.

Replicate — live status

Updated every 5 minutes. Full history at prismix.dev/service/replicate.

Full status →

What's wrong? Diagnose fast

⏳

Prediction slow / cold start

Cold start = container boot after idle. Small models: 5–30s. Large models (70B+): 1–3 min. Solution: Replicate Deployments keeps a warm instance. Accept cold starts for low-traffic use cases.

🔑

API 401 error

Token must start with r8_. Header: Authorization: Bearer r8_TOKEN. Generate at replicate.com/account/api-tokens. Check token not revoked. Org tokens require org membership.

🔍

Model 404 / not found

Version hash in your URL is outdated. Get latest hash from replicate.com/owner/model → Versions tab. Or omit version hash to use latest automatically.

💸

Billing / 402 error

402 = payment failed or credits exhausted. Check replicate.com/account/billing. Pay-as-you-go — no subscription. Add payment method under Billing. Replicate charges per GPU-second.

📣

Webhook not firing

Webhook URL must be publicly reachable from Replicate. Use ngrok or similar for local testing. Webhook receives POST with prediction object. Verify URL returns 200 — Replicate retries on non-200.

📦

Cog / custom model deploy issues

For Cog model pushes: confirm cog.yaml has correct python_version and run commands. Run cog build locally before push. Check replicate.com/deployments for build logs. GPU type selection affects cost.

Understanding Replicate cold starts

Model type	Cold start time	Solution
Small image models (SD 1.5, SDXL)	5–15s	Accept or use Deployment
SDXL-Lightning, SD Turbo	3–10s	Fast enough for most use cases
Llama 3 8B (text)	15–30s	Use Deployment for production
Llama 3 70B (text)	60–180s	Deployment or async polling required
Video gen (Mochi, Kling)	30–120s	Async polling, long timeout
Custom Cog model	Varies (image size)	Optimize image size, Deployment

Replicate API quick reference

Run a prediction (async)

# Create prediction
curl -X POST https://api.replicate.com/v1/predictions \
  -H "Authorization: Bearer r8_YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "version": "stability-ai/sdxl:39ed52f2319f9f807d0e...4d",
    "input": {"prompt": "a photorealistic cat on a mountain"}
  }'
# Returns: { "id": "abc123", "status": "starting", ... }

# Poll until done
curl https://api.replicate.com/v1/predictions/abc123 \
  -H "Authorization: Bearer r8_YOUR_TOKEN"
# status: "starting" | "processing" | "succeeded" | "failed"

Get latest model version

curl https://api.replicate.com/v1/models/stability-ai/sdxl/versions \
  -H "Authorization: Bearer r8_YOUR_TOKEN"
# Returns array of versions sorted newest-first

Step-by-step fix

1

Check live Replicate status

Visit prismix.dev/service/replicate. If Replicate is operational and your prediction is slow, it's a cold start — not a platform issue. Large models (70B+) can take up to 3 minutes to warm up.
2

Fix slow predictions / cold starts

For production traffic: use Replicate Deployments (replicate.com/deployments) to keep at least one warm instance. For development/low traffic: accept cold starts and implement async polling with a 5-second poll interval. For time-sensitive apps: choose smaller, faster models (SDXL-Lightning instead of SDXL, Llama 3 8B instead of 70B).
3

Fix API 401 authentication

Confirm: (1) token starts with r8_; (2) header is Authorization: Bearer r8_TOKEN; (3) token has not been revoked. Generate a fresh token at replicate.com/account/api-tokens. If using an organization's model: confirm you have been added to the org.
4

Fix model 404 / version not found

The version hash embedded in your code is outdated. Go to replicate.com/owner/model → Versions tab → copy the latest version SHA. Update your code. Alternative: omit the version hash in your API call and Replicate will use the latest version automatically (less deterministic but always current).
5

Fix billing errors

Check replicate.com/account/billing. Replicate is pay-as-you-go — charged per GPU-second of compute. If payment failed: update the payment method under Billing. If you hit the spending limit: increase it in account settings. Estimated cost per model run is shown on each model page under "Run time and cost".

🔔

Get alerted when Replicate goes down

Star Replicate on Prismix and get emailed the moment status changes. Free, no credit card.