.claude/skills/railway-observability/SKILL.md
Railway.com built-in metrics, monitoring dashboards, alerting (Pro plan), and external OTEL integration with Grafana. Use when setting up monitoring, creating dashboards, configuring alerts, integrating Prometheus/Loki/Tempo, deploying Grafana stack, or analyzing Railway service metrics.
npx skillsauth add adaptationio/skrillz railway-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Comprehensive guide for Railway.com observability including built-in metrics, customizable dashboards, alerting (Pro plan), and external OTEL integration with Grafana/Prometheus/Loki/Tempo.
Railway provides multi-tier observability capabilities:
Keywords: metrics, monitoring, observability, dashboard, alerts, Grafana, Prometheus, Loki, Tempo, OTEL, Alloy, Railway
Navigate to your Railway project:
Railway Dashboard → Project → Service → Metrics Tab
What you see:
Add and customize metric widgets:
Metrics Tab → Add Widget → Select Metric Type
Available widgets:
Customization:
Configure alerts for threshold violations:
Service Settings → Alerts → Create Alert Rule
Alert types:
Notification channels:
Export metrics to external systems:
Service Settings → Observability → OTEL Integration
Configure environment variables:
OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4318
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Bearer <token>
OTEL_SERVICE_NAME=my-railway-service
See references/otel-integration.md for complete setup.
Use Railway template for complete observability:
# Option 1: Deploy via Railway Dashboard
# Template ID: 8TLSQD (Grafana Stack)
# Includes: Grafana, Prometheus, Loki, Tempo, Alloy
# Option 2: Deploy via script
.claude/skills/railway-observability/scripts/deploy-grafana-stack.sh
Stack components:
Railway provides instant metrics without configuration.
Navigate to metrics:
Project → Service → Metrics
Available metrics:
Retention: 30 days for all metrics
Create personalized monitoring views.
Add widgets:
Best practices:
Set up proactive monitoring.
Create alert rule:
Service → Settings → Alerts → New Rule
Alert configuration:
Metric: CPU Usage
Condition: Greater than 80%
Duration: 5 minutes
Notification: Slack webhook
Webhook payload example:
{
"service": "backend-production",
"metric": "cpu_usage",
"threshold": 80,
"current": 87.5,
"timestamp": "2025-11-26T10:30:00Z"
}
See references/dashboard-widgets.md for all alert types.
Send metrics to external systems.
Configure Alloy collector:
# Use template from templates/alloy-config.river
# Deploy as Railway service
# Configure OTEL endpoints
Environment setup:
# In your Railway service
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
Verify integration:
# Check Alloy logs
railway logs -s alloy
# Should show: "Successfully received OTLP metrics"
See references/otel-integration.md for complete guide.
Full monitoring solution on Railway.
Deploy stack:
# Run deployment script
cd .claude/skills/railway-observability/scripts
./deploy-grafana-stack.sh
# Or deploy manually via Railway Dashboard
# Template: 8TLSQD (Grafana Stack)
Stack includes:
Access Grafana:
URL: https://<grafana-service>.up.railway.app
Username: admin
Password: (set during deployment)
metrics-reference.md - Complete metrics catalogdashboard-widgets.md - Widget configuration guideotel-integration.md - External integration setupdeploy-grafana-stack.sh - Deploy observability stackalloy-config.river - Grafana Alloy collector config| Metric | Description | Units | Retention | |--------|-------------|-------|-----------| | CPU | % of allocated cores | Percentage | 30 days | | Memory | RAM usage | MB/GB | 30 days | | Disk | Storage consumption | GB | 30 days | | Network I/O | Ingress/egress traffic | MB/s | 30 days |
No configuration required - Metrics collected automatically for all services.
Drag-and-drop widgets:
Multi-replica support:
Time range options:
Threshold alerts:
Notification channels:
# Email
[email protected]
# Discord webhook
https://discord.com/api/webhooks/...
# Slack webhook
https://hooks.slack.com/services/...
# Custom webhook
https://your-api.com/alerts
Alert states:
Supported protocols:
Signal types:
Collector options:
One-click deployment:
Railway Dashboard → New Project → Deploy Template → Search "8TLSQD"
Services deployed:
Configuration:
Purpose: Receive OTLP signals from Railway services and forward to Grafana stack.
Configuration (templates/alloy-config.river):
// OTLP receiver
otelcol.receiver.otlp "default" {
grpc {
endpoint = "0.0.0.0:4317"
}
http {
endpoint = "0.0.0.0:4318"
}
output {
metrics = [otelcol.exporter.prometheus.default.input]
logs = [otelcol.exporter.loki.default.input]
traces = [otelcol.exporter.otlp.tempo.input]
}
}
// Prometheus exporter
otelcol.exporter.prometheus "default" {
forward_to = [prometheus.remote_write.railway.receiver]
}
// Loki exporter
otelcol.exporter.loki "default" {
forward_to = [loki.write.railway.receiver]
}
// Tempo exporter
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
}
}
Pre-built dashboards:
Custom dashboards:
# Install OTEL SDK
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
# Configure OTEL (in Railway service)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4318
export OTEL_SERVICE_NAME=nodejs-backend
export OTEL_METRICS_EXPORTER=otlp
export OTEL_LOGS_EXPORTER=otlp
export OTEL_TRACES_EXPORTER=otlp
# Auto-instrumentation
node --require @opentelemetry/auto-instrumentations-node/register app.js
View in Grafana:
# Pro Plan: Service → Alerts → New Rule
Name: High Memory Alert
Metric: Memory Usage
Condition: Greater than 512 MB
Duration: 10 minutes
Notification: Slack webhook
Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
Slack notification:
{
"text": "🚨 High Memory Alert",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Service*: backend-production\n*Memory*: 567 MB (> 512 MB threshold)\n*Duration*: 12 minutes"
}
}
]
}
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
# Configure OTLP exporter
exporter = OTLPMetricExporter(
endpoint="http://alloy:4318/v1/metrics"
)
# Create meter provider
provider = MeterProvider(metric_readers=[
PeriodicExportingMetricReader(exporter, export_interval_millis=60000)
])
metrics.set_meter_provider(provider)
# Create custom metrics
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("api_requests_total")
response_time = meter.create_histogram("api_response_time_seconds")
# Record metrics
request_counter.add(1, {"endpoint": "/api/users", "method": "GET"})
response_time.record(0.125, {"endpoint": "/api/users"})
View in Grafana:
Explore → Prometheus → Metrics Browser → api_requests_total
Check service status:
railway status -s <service-name>
Verify metrics enabled:
Requirements:
Debug checklist:
1. Verify Pro plan active
2. Check threshold configuration
3. Confirm duration setting
4. Test webhook URL manually
5. Check Railway dashboard for alert status
Common problems:
Verify Alloy receiving data:
# Check Alloy logs
railway logs -s alloy
# Look for:
# ✅ "OTLP receiver started"
# ✅ "Received X metric points"
# ❌ "Connection refused" = endpoint issue
# ❌ "Unauthorized" = auth issue
See references/otel-integration.md for detailed troubleshooting.
# Railway Dashboard
https://railway.app/project/<project-id>/service/<service-id>/metrics
# Via Railway CLI
railway status -s <service-name>
railway metrics -s <service-name>
| Metric | Warning | Critical | |--------|---------|----------| | CPU | 70% | 90% | | Memory | 75% | 90% | | Disk | 80% | 95% | | Network | 80% bandwidth | 95% bandwidth |
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=my-service
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
Template ID: 8TLSQD
Components: Grafana, Prometheus, Loki, Tempo, Alloy
Deployment: Railway Dashboard → New Project → Deploy Template
Cost: ~$20-30/month (depends on usage)
Updated: November 26, 2025 Template ID: 8TLSQD (Grafana Stack) Retention: 30 days (built-in metrics)
development
Setup secure web-based terminal access to WSL2 from mobile/tablet via ttyd + ngrok/Cloudflare/Tailscale. One-command install, start, stop, status. Use when you need remote terminal access, web terminal, browser-based shell, or mobile access to WSL2 environment.
development
Complete development workflows where Claude writes the code while Gemini and Codex provide research, planning, reviews, and different perspectives. Claude remains the main developer. Use for complex projects requiring expert planning and multi-perspective reviews.
development
Systematic progress tracking for skill development. Manages task states (pending/in_progress/completed), updates in real-time, reports progress, identifies blockers, and maintains momentum. Use when tracking skill development, coordinating work, or reporting progress.
testing
Comprehensive testing workflow orchestrating functional testing, example validation, integration testing, and usability assessment. Sequential workflow for complete skill testing from examples through scenarios to integration validation. Use when conducting thorough testing, pre-deployment validation, ensuring skill functionality, or comprehensive quality checks.