skills/cloud-run/SKILL.md
Use when deploying containers to Google Cloud Run, editing service.yaml, using gcloud run, configuring Cloud Run Jobs, scaling, concurrency, traffic splitting, cold starts, networking, or serverless Dockerfiles.
npx skillsauth add cofin/flow cloud-runInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Cloud Run is a fully managed serverless platform for running containerized applications. It automatically scales from zero to N based on incoming requests and charges only for resources used during request processing.
gcloud builds submit --tag gcr.io/PROJECT/IMAGE:TAGgcloud run deploy SERVICE --image=IMAGE_URL --region=REGIONgcloud run services update-traffic SERVICE --to-latest| Setting | Flag | Recommendation |
|---|---|---|
| CPU | --cpu=N | 1-8 vCPUs; start with 1 |
| Memory | --memory=NGi | 256Mi-32Gi; match to workload |
| Concurrency | --concurrency=N | 80 default; lower for memory-heavy handlers |
| Min instances | --min-instances=N | 1+ for production to avoid cold starts |
| Max instances | --max-instances=N | Set a ceiling to control costs |
| Timeout | --timeout=N | Up to 3600s for services, 86400s for jobs |
| CPU allocation | --cpu-throttling=false | Use for WebSockets, background tasks |
| Feature | Services | Jobs |
|---------|----------|------|
| Purpose | HTTP request handling | Batch/scheduled tasks |
| Scaling | Auto-scales with traffic | Runs to completion |
| Timeout | Up to 60 minutes | Up to 24 hours |
| Command | gcloud run deploy | gcloud run jobs deploy |
gcloud run deploy SERVICE \
--gpu=1 \
--gpu-type=nvidia-l4 \
--cpu=8 \
--memory=32Gi \
--concurrency=4
Minimum: 4 CPU, 16 GiB. Recommended: 8 CPU, 32 GiB. Set --concurrency explicitly — no GPU-based autoscaling. See references/gpu.md for RTX PRO 6000 Blackwell, driver details, and ML inference patterns.
Direct VPC Egress — route to AlloyDB/Cloud SQL private IPs without VPC connector overhead:
gcloud run deploy SERVICE \
--vpc-egress=private-ranges-only \
--network=NETWORK \
--subnet=SUBNET
Secret mounting:
--set-secrets=KEY=SECRET_NAME:latest
Env var separator trick — use ^||^ when values contain commas (e.g., JSON arrays in CORS origins):
--set-env-vars=^||^CORS_ORIGINS=["https://app.example.com","https://api.example.com"]||OTHER_KEY=value
CORS origin reconciliation workflow:
gcloud run services describe)gcloud secrets versions add SECRET_NAME --data-file=-IAP setup summary:
gcloud iap oauth-brands create --application_title=APP --support_email=EMAILgcloud projects add-iam-policy-binding PROJECT --member=serviceAccount:[email protected] --role=roles/run.invoker--member=user:EMAIL --role=roles/iap.httpsResourceAccessorroles/iap.httpsResourceAccessor before enabling IAPSee references/iap.md for full IAP configuration.
<workflow>Use multi-stage builds (base, builder, runner). Install dependencies in the builder stage, copy only the runtime artifacts to the runner stage. Run as a non-root user. Use tini as PID 1 for proper signal handling.
Use Cloud Build (gcloud builds submit) or a CI pipeline to build and push to Artifact Registry or Container Registry. Tag images with the git SHA for traceability.
Deploy with gcloud run deploy, setting CPU, memory, concurrency, and min/max instances. Use --no-traffic for initial test deployments, then shift traffic with --to-latest or percentage-based splits.
Use --allow-unauthenticated for public APIs. For internal services, use IAM-based auth. Set up IAP (Identity-Aware Proxy) for user-facing apps that need Google login. Use VPC Connector for access to private resources.
Set --min-instances=1 in production. Enable --cpu-boost for faster startup. Lazy-load heavy dependencies in application code. Pre-compile bytecode for Python.
--min-instances=1 for latency-sensitive production services; use --cpu-boost for faster startup--max-instances to prevent runaway scaling and unexpected billing spikes--concurrency to match your application's per-instance capacity — too high causes memory pressure, too low wastes resources--vpc-egress=private-ranges-only gives direct routing to AlloyDB/Cloud SQL private IPs with lower latency and no connector overhead--concurrency explicitly for GPU workloads — Cloud Run cannot auto-scale on GPU utilization; the default of 80 will OOM a GPU instanceBefore delivering configurations, verify:
--memory and --cpu are explicitly set in the deploy command--min-instances is set for production services--max-instances is set to prevent unbounded scaling--allow-unauthenticated)Minimal Dockerfile and deploy command for a Python web service:
# Dockerfile
FROM python:3.13-slim-bookworm AS builder
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev --no-install-project
COPY src ./src
RUN uv sync --frozen --no-dev
FROM python:3.13-slim-bookworm AS runner
RUN apt-get update && apt-get install -y --no-install-recommends tini \
&& rm -rf /var/lib/apt/lists/*
RUN useradd --create-home appuser
USER appuser
COPY --from=builder /app /app
ENV PATH="/app/.venv/bin:$PATH"
ENTRYPOINT ["tini", "--"]
CMD ["uvicorn", "myapp.main:app", "--host", "0.0.0.0", "--port", "8080"]
EXPOSE 8080
Deploy command:
# Build and push
gcloud builds submit --tag gcr.io/my-project/myapp:latest
# Deploy with production settings
gcloud run deploy myapp \
--image=gcr.io/my-project/myapp:latest \
--region=us-central1 \
--cpu=1 \
--memory=512Mi \
--concurrency=80 \
--min-instances=1 \
--max-instances=10 \
--cpu-boost \
[email protected] \
--allow-unauthenticated
</example>
Note: No Gemini CLI extension exists for Cloud Run — this skill provides unique value for Cloud Run deployments, GPU workloads, and production networking patterns not covered by other tooling.
For detailed guides and configuration examples, refer to the following documents in references/:
development
Use when tracing execution paths, mapping dependencies, understanding unfamiliar code, following data flow, investigating end-to-end behavior, debugging call chains, or deciding which files to read next.
development
Use when reviewing authentication, authorization, user input, secrets, API keys, database queries, file uploads, session management, external API calls, OWASP risks, or data handling attack surface.
testing
Use when analyzing tradeoffs, comparing approaches, weighing options, assessing risks, stress-testing conclusions, identifying blind spots, or applying multiple viewpoints to a decision.
development
Use when reviewing hot paths, slow code, database queries, N+1 risks, memory usage, loops, I/O, caching strategy, concurrency, latency-sensitive paths, or resource efficiency.