ai/skills/investigate-datadog/SKILL.md
# Investigate Production/Staging Issues Using Datadog API ## Overview This skill guides investigation of application behavior, performance issues, errors, and metrics in **production or staging environments** using the Datadog API. Use this to analyze logs, metrics, APM traces, slow queries, and correlate events to understand system behavior. ## When to Use This Skill Activate this skill when investigating production or staging issues: - Errors, timeouts, or exceptions in production/staging -
npx skillsauth add kurko/dotfiles ai/skills/investigate-datadogInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill guides investigation of application behavior, performance issues, errors, and metrics in production or staging environments using the Datadog API. Use this to analyze logs, metrics, APM traces, slow queries, and correlate events to understand system behavior.
Activate this skill when investigating production or staging issues:
CRITICAL: All Datadog API interactions MUST be delegated to subagents to avoid context window overload.
The main agent's role is to:
Do NOT attempt to query Datadog directly from the main agent context.
Required environment variables:
$DATADOG_API_KEY: Read-only Datadog API key$DATADOG_API_KEY_ID: Datadog Application/API key IDImportant: Always use .fetch to load these variables to catch configuration errors early:
api_key = ENV.fetch("DATADOG_API_KEY")
app_key = ENV.fetch("DATADOG_API_KEY_ID")
This will raise a clear error if variables are missing, rather than silently returning nil.
The datadog_api_client gem provides programmatic access:
require "datadog_api_client"
# Configure the API client (always use .fetch)
config = DatadogAPIClient::Configuration.new.configure do |c|
c.api_key = ENV.fetch("DATADOG_API_KEY")
c.application_key = ENV.fetch("DATADOG_API_KEY_ID")
end
api_client = DatadogAPIClient::APIClient.new(config)
# Available APIs
logs_api = DatadogAPIClient::V2::LogsAPI.new(api_client)
metrics_api = DatadogAPIClient::V2::MetricsAPI.new(api_client)
Datadog APM tracks application performance including:
When investigating performance issues, check:
The application uses Glogger for structured logging with hash-based key/value pairs:
# Common patterns in the codebase
Glogger.info("Processing batch", { batch_id: batch.id, size: batch.size })
Glogger.error("Failed to create program", { organization_id: id, errors: errors.full_messages })
Glogger.warn("Bad request exception", { error: e.message, controller: controller_name })
Metrics are sent via lib/metrics/* classes:
# Example patterns from the codebase
DatadogMetric.timing("payout.completion_time", milliseconds, tags: ["vendor:stripe"])
DatadogMetric.gauge("sidekiq.queue_latency", latency_seconds, tags: ["queue:default"])
Common metric namespaces:
payout.* - Payout processing metricssidekiq.* - Background job metricspostgresql.* - Database metricstrace.* - APM tracesIMPORTANT: Only ask clarifying questions if information is NOT already provided in context.
Analyze the user's prompt for:
Time Range: Extract or infer intelligently
Scope: Determine from context
Known Context: Extract from prompt
DO NOT ask for confirmation about what to query. Use code analysis to determine what's logged and proceed with investigation.
Before querying Datadog, understand what the application logs:
Glogger calls in those classes and related codeDatadogMetric.timing, .gauge, etc.)This analysis informs precise Datadog queries without guessing.
Each subagent should receive:
Example subagent prompt (this is ONE example - subagents should use creativity):
Use the Datadog API to investigate timeout errors with these parameters:
**Primary Query**:
`@controller:"Api::V2::PayoutsController" @action:"create" @error:*timeout* env:production`
**Time Range**:
Start: 2025-12-08 12:00:00 UTC
End: 2025-12-08 18:00:00 UTC
**Required Output**:
- Total error count and timeline (hourly buckets)
- Error message patterns with frequencies
- Affected organization IDs (if logged)
- Sample logs (3-5 representative examples with full context)
**Additional Context to Gather** (use your judgment):
- Check for correlated database metrics (connections, locks)
- Look for slow query logs in APM around same timeframe
- Check if error rate correlates with traffic spikes
- Identify any patterns in affected customers
Return structured JSON with your findings and any hypotheses formed.
Environment variables:
- Use ENV.fetch("DATADOG_API_KEY")
- Use ENV.fetch("DATADOG_API_KEY_ID")
As data comes in, actively work to validate or invalidate hypotheses:
Hypothesis Formation:
Hypothesis Testing:
postgresql.locks.count metric during timeout windowInvalidation Attempts:
Isolation of Variables:
Present findings as investigation direction, not necessarily complete root cause:
## Investigation Results: [Issue Description]
### Summary
[High-level summary of findings]
### Timeline
[Key events in chronological order with evidence]
### Hypotheses Tested
✓ **Validated**: [Hypothesis supported by data]
- Evidence: [Specific log/metric data]
✗ **Invalidated**: [Hypothesis disproven]
- Counter-evidence: [Why this doesn't hold]
? **Unclear**: [Needs more investigation]
- What's needed: [Additional data sources required]
### Evidence
- **Logs**: [Patterns found, sample counts]
- **Metrics**: [Trends, spikes, correlations]
- **APM/Slow Queries**: [Performance data]
- **Correlations**: [Relationships discovered]
### Investigation Direction
[What to look at next - may not be complete root cause]
1. [Specific area to investigate further]
2. [Alternative hypothesis to test]
3. [External systems to check (if AI can't access)]
### Recommendations
[Actionable items based on findings]
1. Query logs for timeout errors in time range
2. Check APM traces for affected endpoints
3. Look for slow queries in Datadog APM
- Database query execution time
- Look for N+1 patterns
- Check for missing indexes
4. Check database metrics (connections, locks, CPU)
5. Correlate with deployment times
6. Test hypothesis: Does removing time period X invalidate the pattern?
1. Query error logs in time window
2. Group by error type and frequency
3. Identify affected resources (customers, endpoints)
4. Check for code changes (git history)
5. Look for correlated metrics (traffic, resource usage)
6. Test: Does error pattern exist before the spike? After?
1. Query slow query logs in APM
2. Identify problematic queries
3. Check for missing indexes via execution plans
4. Look for N+1 query patterns in logs
5. Check database metrics (CPU, memory, connections)
6. Correlate with application changes
# Field matching
@field:"value" # Exact match
@field:*partial* # Contains
@field:>100 # Greater than
@field:[100 TO 200] # Range
-@field:value # NOT
# Logical operators
@field:value1 OR @field:value2 # OR
@field:value AND @other:value # AND
# Common fields
env:production # Environment (production/staging)
env:staging # Staging environment
status:error # Status level
@controller:"ControllerName" # Rails controller
@action:"action_name" # Rails action
@error:*timeout* # Error patterns
@duration:>1000 # Duration in ms
# Structured fields (from Glogger)
@organization_id:123
@batch_id:"abc"
@customer_id:456
# Query time series data
metrics_api.query_timeseries({
from: start_time.to_i,
to: end_time.to_i,
query: "avg:metric.name{tag_key:tag_value}"
})
# Aggregations: avg, sum, min, max, count
# Grouping by tags: {tag:value}
All subagents MUST return:
DO NOT return:
Hypothesis: "Timeouts caused by database lock contention"
Test:
- Query: `postgresql.locks.count` during timeout window
- Result: No spike in lock count during timeouts
- Conclusion: ✗ Hypothesis invalidated
Alternative hypothesis:
- Check connection pool exhaustion
- Query: `postgresql.connections.used`
- Result: Connections at 95% capacity during timeouts
- Conclusion: ✓ Connection exhaustion more likely cause
Hypothesis: "Error spike caused by recent deployment"
Test:
- Check git log for deployments before spike
- Result: No deployments in 3 days before spike
- Conclusion: ✗ Deployment not the cause
Alternative:
- Check for external dependency changes
- Query logs for external API errors
- Result: Third-party API response time degraded
- Conclusion: ✓ External dependency issue confirmed
# If you see nil errors, check configuration:
begin
api_key = ENV.fetch("DATADOG_API_KEY")
rescue KeyError => e
puts "ERROR: #{e.message}"
puts "Please set DATADOG_API_KEY environment variable"
exit 1
end
@field:*console/datadog.rb in codebaselib/metrics/*.rb in codebaseGlogger calls in relevant classesdata-ai
Merge the current worktree branch into main and sync main back. Use when the user says "merge to main", "ship it", "merge and continue", or after completing a task in a worktree and wanting to continue with the next one.
tools
Synchronize AI agent skills, commands, configs, permissions, hooks, and instructions across Claude Code, Codex CLI, and other Agent Skills-compatible tools. Use when the user asks to pull skills from Claude into Codex, sync Codex work back to Claude, migrate agent commands, reconcile frontmatter, update permissions, or keep agent setup files in parity.
testing
Write or update UI-independent use cases for QA. Use when the user says "write use cases", "add use cases", "QA use cases", "update use cases", "compose use cases", or when starting implementation of a new feature (after plan approval). Also activates for "what should we test", "regression cases", or "use cases for QA".
documentation
Skill on how to write a task. Use when user asks you to write a task (for Asana, Linear, Jira, Notion and equivalent). Also activates when user says "create task", "write task", or similar task creation workflow requests.