specwright/templates/skills/dev-team/devops/observability-management/SKILL.md
# Observability Management Skill > Template for Monitoring, Logging, and Alerting Specialists > Version: 1.0.0 > Created: 2026-01-09 ## Skill Purpose Implement comprehensive observability with monitoring, structured logging, distributed tracing, and proactive alerting to maintain system reliability and quickly diagnose issues. ## When to Activate This Skill **Activate when:** - Setting up application monitoring - Implementing structured logging - Creating alerting rules - Debugging producti
npx skillsauth add michsindlinger/specwright specwright/templates/skills/dev-team/devops/observability-managementInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Template for Monitoring, Logging, and Alerting Specialists Version: 1.0.0 Created: 2026-01-09
Implement comprehensive observability with monitoring, structured logging, distributed tracing, and proactive alerting to maintain system reliability and quickly diagnose issues.
Activate when:
Delegation from main agent:
@agent:[AGENT_NAME] "Set up application performance monitoring"
@agent:[AGENT_NAME] "Implement structured logging with request tracing"
@agent:[AGENT_NAME] "Create alerting rules for critical errors"
@agent:[AGENT_NAME] "Add health check endpoints"
Enable App Platform Metrics:
# app.yaml
name: myapp-production
services:
- name: web
# ... other config ...
# Health check endpoint
health_check:
http_path: /health
initial_delay_seconds: 30
period_seconds: 10
timeout_seconds: 5
success_threshold: 1
failure_threshold: 3
# Built-in metrics
# - CPU usage
# - Memory usage
# - Request rate
# - Response time
# - Error rate
Access Metrics:
# Via CLI
doctl apps list-alerts <app-id>
doctl apps get-metrics <app-id>
# Via API
curl -X GET \
-H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \
https://api.digitalocean.com/v2/apps/<app-id>/metrics
Health Check Endpoint:
# config/routes.rb
Rails.application.routes.draw do
get '/health', to: 'health#show'
# ... other routes
end
# app/controllers/health_controller.rb
class HealthController < ApplicationController
skip_before_action :authenticate_user!
def show
checks = {
database: database_check,
redis: redis_check,
storage: storage_check,
timestamp: Time.current.iso8601
}
if checks.values.all?
render json: { status: 'healthy', checks: checks }, status: :ok
else
render json: { status: 'unhealthy', checks: checks }, status: :service_unavailable
end
end
private
def database_check
ActiveRecord::Base.connection.execute('SELECT 1')
true
rescue StandardError => e
Rails.logger.error("Database health check failed: #{e.message}")
false
end
def redis_check
Redis.current.ping == 'PONG'
rescue StandardError => e
Rails.logger.error("Redis health check failed: #{e.message}")
false
end
def storage_check
ActiveStorage::Blob.service.exist?('health_check')
true
rescue StandardError => e
Rails.logger.error("Storage health check failed: #{e.message}")
false
end
end
Structured Logging:
# config/environments/production.rb
config.log_formatter = Logger::Formatter.new
# Use JSON formatter
config.log_formatter = proc do |severity, datetime, progname, msg|
{
timestamp: datetime.iso8601,
severity: severity,
message: msg,
pid: Process.pid,
hostname: Socket.gethostname
}.to_json + "\n"
end
# Log to stdout (12-factor app)
config.logger = ActiveSupport::Logger.new(STDOUT)
config.log_level = :info
# config/initializers/lograge.rb
Rails.application.configure do
config.lograge.enabled = true
config.lograge.formatter = Lograge::Formatters::Json.new
config.lograge.custom_options = lambda do |event|
{
user_id: event.payload[:user_id],
request_id: event.payload[:request_id],
ip: event.payload[:ip],
params: event.payload[:params].except('controller', 'action'),
exception: event.payload[:exception]&.first,
exception_message: event.payload[:exception]&.last
}
end
end
Request ID Tracking:
# app/controllers/application_controller.rb
class ApplicationController < ActionController::Base
before_action :set_request_id
private
def set_request_id
RequestStore.store[:request_id] = request.uuid
end
end
# Use in logs
Rails.logger.info("Processing payment",
request_id: RequestStore.store[:request_id],
user_id: current_user.id,
amount: params[:amount]
)
Custom Metrics:
# app/models/concerns/metric_tracking.rb
module MetricTracking
extend ActiveSupport::Concern
included do
after_create :track_create_metric
after_update :track_update_metric
end
private
def track_create_metric
StatsD.increment("#{self.class.name.underscore}.created")
end
def track_update_metric
StatsD.increment("#{self.class.name.underscore}.updated")
end
end
# Usage in models
class Order < ApplicationRecord
include MetricTracking
def complete!
transaction do
update!(status: :completed, completed_at: Time.current)
StatsD.timing('order.completion_time', Time.current - created_at)
StatsD.increment('order.completed')
end
end
end
# Gemfile
gem 'newrelic_rpm'
# config/newrelic.yml
common: &default_settings
license_key: <%= ENV['NEW_RELIC_LICENSE_KEY'] %>
app_name: <%= ENV['NEW_RELIC_APP_NAME'] || 'My Application' %>
distributed_tracing:
enabled: true
application_logging:
enabled: true
forwarding:
enabled: true
production:
<<: *default_settings
monitor_mode: true
log_level: info
# Custom instrumentation
class OrdersController < ApplicationController
include NewRelic::Agent::Instrumentation::ControllerInstrumentation
def create
# ... order creation logic
NewRelic::Agent.record_custom_event('OrderCreated', {
order_id: @order.id,
total: @order.total,
user_id: current_user.id
})
end
add_transaction_tracer :create, category: :task
end
# Gemfile
gem 'ddtrace'
# config/initializers/datadog.rb
Datadog.configure do |c|
c.tracing.instrument :rails
c.tracing.instrument :redis
c.tracing.instrument :pg
c.tracing.instrument :http
c.env = Rails.env
c.service = ENV['DD_SERVICE'] || 'myapp'
c.version = ENV['GIT_COMMIT_SHA'] || '1.0.0'
c.tracing.analytics_enabled = true
end
# Custom metrics
Datadog::Statsd.new('localhost', 8125).tap do |statsd|
statsd.increment('orders.created', tags: ['env:production'])
statsd.gauge('users.active', User.where('last_seen_at > ?', 5.minutes.ago).count)
statsd.histogram('payment.amount', payment.amount_cents)
end
# Gemfile
gem 'sentry-ruby'
gem 'sentry-rails'
# config/initializers/sentry.rb
Sentry.init do |config|
config.dsn = ENV['SENTRY_DSN']
config.breadcrumbs_logger = [:active_support_logger, :http_logger]
config.traces_sample_rate = 0.1 # 10% of requests
config.environment = Rails.env
config.enabled_environments = %w[production staging]
config.before_send = lambda do |event, hint|
# Filter sensitive data
event.request.data.delete('password') if event.request.data
event
end
end
# Usage
begin
risky_operation
rescue StandardError => e
Sentry.capture_exception(e, extra: { user_id: current_user.id })
raise
end
# Performance monitoring
Sentry.with_scope do |scope|
scope.set_context('payment', { amount: payment.amount, method: payment.method })
Sentry.capture_message('Large payment processed', level: :warning)
end
# Add remote syslog destination
# DigitalOcean App Platform → Settings → Logs → Add destination
# Search logs via CLI
papertrail --min-time '5 minutes ago' 'error'
papertrail 'user_id:123' --json
# Gemfile
gem 'logtail-ruby'
# config/initializers/logtail.rb
if Rails.env.production?
Rails.logger = Logtail::Logger.new(ENV['LOGTAIL_SOURCE_TOKEN'])
end
# Structured logging
Rails.logger.info('Order created', {
order_id: order.id,
user_id: current_user.id,
total: order.total,
items_count: order.items.count
})
# docker-compose.yml for ELK
version: '3.9'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
volumes:
- ./logstash/pipeline:/usr/share/logstash/pipeline
ports:
- "5000:5000"
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
ports:
- "5601:5601"
depends_on:
- elasticsearch
volumes:
elasticsearch_data:
# Create alert via CLI
doctl monitoring alert create \
--type v1/insights/droplet/cpu \
--compare GreaterThan \
--value 80 \
--window 5m \
--entities droplet_id_1,droplet_id_2 \
--emails [email protected]
# App Platform alerts (via UI)
# - High error rate (> 5% for 5 minutes)
# - High response time (> 1s for 5 minutes)
# - Deployment failures
# Gemfile
gem 'pagerduty'
# config/initializers/pagerduty.rb
PAGERDUTY = Pagerduty.new(ENV['PAGERDUTY_INTEGRATION_KEY'])
# Trigger incident
class CriticalErrorNotifier
def self.notify(exception, context = {})
PAGERDUTY.trigger(
description: exception.message,
details: {
exception: exception.class.name,
backtrace: exception.backtrace.first(5),
context: context
},
severity: 'critical'
)
end
end
# Usage
begin
critical_operation
rescue StandardError => e
CriticalErrorNotifier.notify(e, user_id: current_user.id)
raise
end
# Gemfile
gem 'slack-notifier'
# config/initializers/slack.rb
SLACK_NOTIFIER = Slack::Notifier.new(
ENV['SLACK_WEBHOOK_URL'],
channel: '#alerts',
username: 'Production Bot'
)
# app/services/slack_alert_service.rb
class SlackAlertService
def self.error(message, details = {})
SLACK_NOTIFIER.ping(
text: message,
attachments: [{
color: 'danger',
fields: details.map { |k, v| { title: k.to_s, value: v.to_s, short: true } },
footer: Socket.gethostname,
ts: Time.current.to_i
}]
)
end
def self.deployment(version)
SLACK_NOTIFIER.ping(
text: "Deployed version #{version} to production",
attachments: [{
color: 'good',
footer: "Deployed at #{Time.current}"
}]
)
end
end
# Usage in error handler
rescue_from StandardError do |exception|
SlackAlertService.error(
"Unhandled exception: #{exception.message}",
controller: params[:controller],
action: params[:action],
user_id: current_user&.id
)
raise
end
Metrics to track:
- Request throughput (req/min)
- Average response time (ms)
- Error rate (%)
- Apdex score
- Database query time
- Cache hit rate
- Background job queue depth
- Active users
Metrics to track:
- CPU utilization (%)
- Memory usage (%)
- Disk I/O (MB/s)
- Network traffic (MB/s)
- Database connections
- Redis memory usage
- Container restart count
Custom metrics:
- Orders per hour
- Revenue per hour
- User signups per day
- Conversion rate
- Average order value
- Payment success rate
[MCP_TOOLS]
<!-- Populated during skill creation based on: 1. User's installed MCP servers 2. User's selection for this skill Recommended for this skill (examples): - APM/Monitoring services (New Relic, Datadog, Sentry) - Log aggregation services (Papertrail, Logtail) - Alerting services (PagerDuty, Slack) - Metrics databases (Prometheus, InfluxDB) Note: Skills work without MCP servers, but functionality may be limited --># New Relic
# Sign up at newrelic.com
# Add gem and configure
# Datadog
# Sign up at datadoghq.com
# Install agent
# DigitalOcean Monitoring
doctl monitoring alert list
# Papertrail
gem install papertrail-cli
papertrail --help
# Logtail
# Web-based interface
# PagerDuty CLI
npm install -g pdjs
pd incident:list
# Slack
# Use webhooks (no CLI needed)
# 1. Latency - How long requests take
# 2. Traffic - How many requests
# 3. Errors - Rate of failed requests
# 4. Saturation - How full the service is
class MetricsMiddleware
def initialize(app)
@app = app
end
def call(env)
start_time = Time.current
status, headers, response = @app.call(env)
duration = Time.current - start_time
# Latency
StatsD.timing('http.request.duration', duration * 1000, tags: ["status:#{status}"])
# Traffic
StatsD.increment('http.request.count', tags: ["status:#{status}"])
# Errors
StatsD.increment('http.request.errors') if status >= 500
# Saturation tracked separately via infrastructure metrics
[status, headers, response]
end
end
# Rate - Requests per second
# Errors - Number of failed requests
# Duration - Time per request
# Automatically tracked by most APM tools
# Custom implementation:
class REDMetrics
def self.track(controller, action, duration, status)
tags = ["controller:#{controller}", "action:#{action}"]
StatsD.increment('request.rate', tags: tags)
StatsD.increment('request.errors', tags: tags) if status >= 400
StatsD.timing('request.duration', duration, tags: tags)
end
end
# Using OpenTelemetry
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'myapp'
c.use_all # Auto-instrument Rails, Redis, PostgreSQL, HTTP
end
# Custom spans
def process_order(order)
tracer = OpenTelemetry.tracer_provider.tracer('order-processor')
tracer.in_span('process_order') do |span|
span.set_attribute('order.id', order.id)
span.set_attribute('order.total', order.total)
# Processing logic
charge_payment(order)
send_confirmation(order)
end
end
# 1. Check error logs
tail -f log/production.log | grep ERROR
# 2. Check Sentry/error tracker for patterns
# 3. Verify external service status
# 4. Check recent deployments
gh run list --limit 5
# 5. Review metrics for correlation
# 1. Check APM for slow transactions
# 2. Review database query performance
# 3. Check for N+1 queries
# 4. Verify cache hit rates
# 5. Check external API latency
# 1. Check for memory leaks
# 2. Review background job memory usage
# 3. Check for large object allocations
# 4. Verify garbage collection metrics
# 5. Consider scaling horizontally
Remember: Observability is not optional - it's the foundation of reliable systems. Invest early in comprehensive monitoring, logging, and alerting to enable fast detection and resolution of issues.
tools
Session Handoff: Erstellt eine vollständige Zusammenfassung der aktuellen Session für einen sauberen Kontextwechsel. NUR bei explizitem Aufruf (/session-handoff). NICHT automatisch auslösen. Geeignet wenn der User die Session resetten will, den Kontext aufräumen will, oder bei ~120k Tokens angelangt ist.
development
Pre-Mortem Risk Analysis: Strukturierte Prospective-Hindsight-Übung um launch-blocking Risiken vor Commitment aufzudecken. Team stellt sich vor, das Produkt sei 14 Tage nach Launch gefloppt, und arbeitet rückwärts. Klassifiziert Risiken in Tigers (echt), Paper Tigers (hypothetisch), Elephants (unausgesprochen). Nutze diesen Skill vor Build-Commitment, bei zu hoher Stakeholder-Confidence, vor Major-Releases, oder wenn das Team vage Sorgen nicht artikulieren kann. Trigger: /pre-mortem, 'pre-mortem', 'risk analysis', 'was könnte schiefgehen', 'risiken vor launch'.
testing
Six-Sigma Atomicity Validator for create-spec stories
tools
UX pattern definition guidance for navigation, user flows, interactions, and accessibility