.claude/skills/error-investigation/SKILL.md
AWS error investigation with multi-layer verification, CloudWatch analysis, and Lambda logging patterns. Use when debugging AWS service failures, investigating production errors, or troubleshooting Lambda functions.
npx skillsauth add awannaphasch2016/jousef-landing error-investigationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Tech Stack: AWS CLI, CloudWatch Logs, Lambda, boto3, jq
Source: Extracted from CLAUDE.md error investigation principles and AWS diagnostic patterns.
Use the error-investigation skill when:
DO NOT use this skill for:
What's failing?
├─ Lambda function?
│ ├─ Returns 200 but errors? → Check CloudWatch logs (Layer 3)
│ ├─ Timeout? → Check duration metrics + external dependencies
│ ├─ Permission denied? → Check IAM role policies
│ └─ Cold start slow? → Module-level initialization pattern
│
├─ AWS service operation?
│ ├─ DynamoDB write succeeded (200) but no data? → Check rowcount
│ ├─ S3 upload succeeded but file missing? → Check bucket policy
│ ├─ SQS message sent but not received? → Check DLQ
│ └─ Step Function succeeded but workflow incomplete? → Check state outputs
│
├─ External API call?
│ ├─ Timeout? → Check network path (security groups, VPC)
│ ├─ 403 Forbidden? → Check API key, rate limits
│ ├─ 500 Error? → Check API status page, retry logic
│ └─ Silent failure? → Inspect response payload
│
└─ Database query?
├─ INSERT affected 0 rows? → FK constraint, ENUM mismatch
├─ SELECT returns empty? → Check WHERE clause, data exists
├─ Connection timeout? → Security group, VPC routing
└─ Query slow? → Missing index, full table scan
Escalation Trigger:
/trace shows root cause/validate shows successTools Used:
/trace - Find root cause (backward trace from error)/validate - Verify fix works (test the solution)/consolidate - Update knowledge base (documentation, runbooks)/observe - Monitor for recurring issues (drift detection)/reflect - Assess if error represents pattern vs one-offWhy This Works: Error investigation fits retrying loop (find root cause, fix execution), but recurring errors trigger synchronize loop (update knowledge/documentation).
See Thinking Process Architecture - Feedback Loops for structural overview.
From CLAUDE.md:
"Execution completion ≠ Operational success. Verify actual outcomes across multiple layers, not just the absence of exceptions."
Why This Matters:
# ❌ WRONG: Assumes 200 = success
response = lambda_client.invoke(FunctionName='worker', Payload='{}')
assert response['StatusCode'] == 200 # ✗ Weak validation
# ✅ RIGHT: Multi-layer verification
response = lambda_client.invoke(FunctionName='worker', Payload='{}')
# Layer 1: Status code
assert response['StatusCode'] == 200
# Layer 2: Response payload
payload = json.loads(response['Payload'].read())
assert 'errorMessage' not in payload
# Layer 3: CloudWatch logs
logs = cloudwatch.filter_log_events(
logGroupName='/aws/lambda/worker',
filterPattern='ERROR'
)
assert len(logs['events']) == 0
Note: This is the AWS-specific application of Progressive Evidence Strengthening (CLAUDE.md Principle #2). The general pattern applies across all domains—here we show how it manifests in AWS Lambda/API debugging.
The Three Layers:
| Layer | Signal Strength | What It Tells You | What It DOESN'T Tell You | |-------|----------------|-------------------|--------------------------| | Status Code | Weakest | Service responded | Whether it succeeded | | Response Payload | Stronger | Function returned data | Whether logs show errors | | CloudWatch Logs | Strongest | What actually happened | Future issues |
Pattern:
# Layer 1: Status code (weakest)
aws lambda invoke --function-name worker --payload '{}' /tmp/response.json
echo "Exit code: $?" # 0 = AWS CLI succeeded
# Layer 2: Response payload (stronger)
if grep -q "errorMessage" /tmp/response.json; then
echo "❌ Lambda returned error"
exit 1
fi
# Layer 3: CloudWatch logs (strongest)
ERROR_COUNT=$(aws logs filter-log-events \
--log-group-name /aws/lambda/worker \
--start-time $(($(date +%s) - 120))000 \
--filter-pattern "ERROR" \
--query 'length(events)' --output text)
if [ "$ERROR_COUNT" -gt 0 ]; then
echo "❌ Found errors in CloudWatch logs"
exit 1
fi
echo "✅ All 3 layers verified"
See AWS-DIAGNOSTICS.md for AWS-specific diagnostic patterns.
From CLAUDE.md:
"Log levels are not just severity indicators—they determine whether failures are discoverable by monitoring systems."
Log Level Impact:
| Log Level | Monitored? | Alerted? | Discoverable? | |-----------|------------|----------|---------------| | ERROR | ✅ Yes | ✅ Yes | ✅ Dashboards | | WARNING | ✅ Yes | ❌ No | ⚠️ Manual review | | INFO | ⚠️ Maybe | ❌ No | ❌ Active search | | DEBUG | ❌ No | ❌ No | ❌ Hidden |
Investigation Pattern:
# Step 1: Check ERROR level first
aws logs filter-log-events \
--log-group-name /aws/lambda/worker \
--filter-pattern "ERROR"
# Step 2: If no ERRORs but operation failed → Check WARNING
aws logs filter-log-events \
--log-group-name /aws/lambda/worker \
--filter-pattern "WARNING"
# Step 3: Check both application AND service logs
# - Application logs: /aws/lambda/worker
# - Service logs: Lambda execution errors, timeouts
Why This Matters:
# ❌ BAD: Error logged at WARNING (invisible to monitoring)
try:
result = db.execute(query, params)
if result == 0:
logger.warning("INSERT failed") # ⚠️ Not monitored!
except Exception as e:
logger.warning(f"DB error: {e}") # ⚠️ Not alerted!
# ✅ GOOD: Error logged at ERROR (visible to monitoring)
try:
result = db.execute(query, params)
if result == 0:
logger.error("INSERT failed - 0 rows affected") # ✅ Monitored
raise ValueError("Insert operation failed")
except Exception as e:
logger.error(f"DB error: {e}") # ✅ Alerted
raise
From CLAUDE.md:
"AWS Lambda pre-configures logging before your code runs. Never use
logging.basicConfig()in Lambda handlers—it's a no-op."
The Problem:
# ❌ This does NOTHING in Lambda
import logging
logging.basicConfig(level=logging.INFO) # No-op!
logger = logging.getLogger(__name__)
logger.info("Invisible in CloudWatch") # Filtered out
Why It Fails:
basicConfig() only works if root logger has NO handlersThe Solution:
# ✅ Works in both Lambda and local dev
import logging
root_logger = logging.getLogger()
if root_logger.handlers: # Lambda (already configured)
root_logger.setLevel(logging.INFO)
else: # Local dev (needs configuration)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Visible in CloudWatch") # ✅ Works
See LAMBDA-LOGGING.md for comprehensive Lambda logging patterns.
Symptom: Function completes, returns 200, but errors in logs.
Investigation Steps:
# 1. Invoke function
aws lambda invoke \
--function-name worker \
--payload '{"ticker": "NVDA19"}' \
/tmp/response.json
# 2. Check response (Layer 2)
cat /tmp/response.json
# Output: {"result": {...}} # Looks fine
# 3. Check CloudWatch logs (Layer 3)
aws logs tail /aws/lambda/worker --since 1m --filter-pattern "ERROR"
# Output:
# [ERROR] 2024-01-15 10:23:45 INSERT affected 0 rows for NVDA19
# [ERROR] 2024-01-15 10:23:46 FK constraint violation: symbol not found
Root Cause: Silent database failure (0 rowcount), logged at ERROR but caught exception.
Fix:
# Before:
def store_report(symbol, report):
try:
self.db.execute(query, params)
return True # ❌ Always returns True
except Exception as e:
logger.error(f"DB error: {e}")
return True # ❌ Still returns True!
# After:
def store_report(symbol, report):
rowcount = self.db.execute(query, params)
if rowcount == 0:
logger.error(f"INSERT affected 0 rows for {symbol}")
return False # ✅ Returns False on failure
return True
Symptom: logger.info() calls not appearing in CloudWatch.
Investigation Steps:
# 1. Check current log level
aws logs filter-log-events \
--log-group-name /aws/lambda/worker \
--start-time $(($(date +%s) - 300))000 \
--filter-pattern "INFO"
# No results (but INFO logs exist in code)
# 2. Check root logger configuration
# Add to Lambda handler:
import logging
print(f"Root logger level: {logging.getLogger().level}")
print(f"Root logger handlers: {logging.getLogger().handlers}")
Root Cause: Root logger set to WARNING, filters out INFO.
Fix:
# handler.py (entry point)
import logging
# Configure logging at module level
root_logger = logging.getLogger()
if root_logger.handlers: # Lambda environment
root_logger.setLevel(logging.INFO) # ✅ Set root logger level
else: # Local development
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def lambda_handler(event, context):
logger.info("Handler invoked") # Now visible
# ...
See LAMBDA-LOGGING.md#troubleshooting for complete debugging guide.
Symptom: Lambda times out after long execution (600s+), logs show "PDF generation..." but no completion message.
Investigation Steps:
# 1. Check execution duration pattern
aws logs filter-log-events \
--log-group-name /aws/lambda/pdf-worker \
--filter-pattern "Duration:" \
--query 'events[*].message' \
| grep -o "Duration: [0-9]*" \
| sort -n
# Look for pattern:
# - First 5 requests: Duration: 2-3s
# - Last 5 requests: Duration: 600s+ (timeout)
# 2. Check for connection timeout errors
aws logs filter-log-events \
--log-group-name /aws/lambda/pdf-worker \
--filter-pattern "ConnectTimeoutError" \
--query 'events[*].message'
# Output:
# botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL:
# "https://bucket.s3.region.amazonaws.com/..."
# 3. Analyze timeline (deterministic vs random)
aws logs tail /aws/lambda/pdf-worker --since 30m | \
grep -E "START RequestId|✅ PDF job completed|ConnectTimeoutError" | \
awk '{print $1, $2, $NF}' | sort
# Deterministic pattern (first N succeed, last M fail) = infrastructure bottleneck
# Random pattern (scattered failures) = performance issue
Root Cause Analysis:
# 4. Check VPC configuration
aws ec2 describe-vpc-endpoints \
--filters "Name=vpc-id,Values=vpc-xxx" \
"Name=service-name,Values=com.amazonaws.region.s3"
# If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway)
# 5. Verify NAT Gateway routing
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-xxx" \
--query 'RouteTables[*].Routes[?GatewayId!=`local`]'
# If route 0.0.0.0/0 → nat-xxx → NAT Gateway saturated with concurrent connections
Root Cause: NAT Gateway connection saturation. When N concurrent Lambdas upload to S3:
Fix:
# terraform/s3_vpc_endpoint.tf
resource "aws_vpc_endpoint" "s3" {
vpc_id = data.aws_vpc.default.id
service_name = "com.amazonaws.${var.aws_region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = data.aws_route_tables.vpc_route_tables.ids
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = "*"
Action = "s3:*"
Resource = "*"
}]
})
}
Why This Works:
Verification:
# 1. Deploy VPC endpoint
cd terraform && terraform apply
# 2. Verify endpoint created
terraform output s3_vpc_endpoint_state # Should be "available"
# 3. Test full workflow
aws stepfunctions start-execution \
--state-machine-arn <pdf-workflow-arn> \
--input '{"report_date":"2026-01-05"}'
# 4. Monitor for 100% success rate
aws logs tail /aws/lambda/pdf-worker --follow
# Expected: All PDFs complete in 2-3s, no timeouts
Critical Insight: Execution Time ≠ Hang Location
Pattern Recognition:
See Bug Hunt Report for complete investigation.
Symptom: put_item() returns 200, but item not in table.
Investigation Steps:
# 1. Check response
response = table.put_item(Item={'ticker': 'NVDA19', 'data': {...}})
print(f"HTTP Status: {response['ResponseMetadata']['HTTPStatusCode']}")
# Output: 200
# 2. Verify item exists
response = table.get_item(Key={'ticker': 'NVDA19'})
print(response.get('Item'))
# Output: None (no item!)
# 3. Check for conditional write
response = table.put_item(
Item={'ticker': 'NVDA19', 'data': {...}},
ConditionExpression='attribute_not_exists(ticker)' # ← Condition failed?
)
Root Cause: Conditional expression failed silently.
Fix:
# Before:
response = table.put_item(Item=item) # ❌ No verification
# After:
try:
response = table.put_item(Item=item)
# Verify write
verify = table.get_item(Key={'ticker': item['ticker']})
if 'Item' not in verify:
logger.error(f"Item not found after put_item: {item['ticker']}")
raise ValueError("DynamoDB write verification failed")
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
logger.warning(f"Conditional write failed: {item['ticker']}")
else:
logger.error(f"DynamoDB error: {e}")
raise
When to apply: Distributed system errors (Lambda, Aurora, S3, SQS, Step Functions)
Problem: Code looks correct locally but fails in AWS due to unverified execution boundaries
Common boundary-related error patterns:
# Error: KeyError: 'AURORA_HOST'
# Symptom: Lambda invocation fails immediately
# Root cause: Boundary violation (code → runtime)
# Code expects: os.environ['AURORA_HOST']
# Runtime provides: No such variable
# Verification:
aws lambda get-function-configuration \
--function-name [PROJECT_NAME]-worker-dev \
--query 'Environment.Variables'
# Compare with: Code's os.environ accesses
grep "os.environ" src/lambda_handler.py
# Error: Unknown column 'pdf_s3_key' in 'field list'
# Symptom: INSERT query fails in production
# Root cause: Boundary violation (code → database)
# Code sends: INSERT INTO reports (symbol, pdf_s3_key)
# Aurora has: No pdf_s3_key column
# Verification:
mysql> SHOW COLUMNS FROM precomputed_reports;
# Compare with: Code's INSERT statements
grep "INSERT INTO" src/data/aurora/precompute_service.py
# Error: Task timed out after 30.00 seconds
# Symptom: Lambda stops mid-execution
# Root cause: Configuration mismatch (code requirements vs entity config)
# Code requires: 60s API call + 45s processing = 105s total
# Lambda configured: 30s timeout
# Verification:
aws lambda get-function-configuration \
--function-name [PROJECT_NAME]-worker-dev \
--query '{Timeout:Timeout, Memory:MemorySize}'
# Analyze code execution time:
grep "requests.get.*timeout" src/ -r # External API timeouts
# Sum: timeout values + processing overhead
# Error: AccessDeniedException: User is not authorized to perform: s3:PutObject
# Symptom: S3 upload fails
# Root cause: Permission boundary violation (principal → resource)
# Code tries: s3.put_object(Bucket='reports', Key='file.pdf')
# IAM role allows: Only s3:GetObject (read-only)
# Verification:
aws iam get-role-policy \
--role-name [PROJECT_NAME]-worker-role-dev \
--policy-name S3Access
# Compare with: Code's boto3 operations
grep "s3.*put_object\|s3.*upload" src/ -r
# Error: API Gateway timeout after 30 seconds
# Symptom: Client sees timeout, Lambda still processing
# Root cause: Usage doesn't match intention (sync Lambda used for async work)
# Entity designed for: Synchronous API (< 30s response)
# Code uses it for: Long-running report generation (60s)
# Verification:
# Check Terraform comments
cat terraform/lambdas.tf | grep -B 5 -A 10 "api-handler"
# Check Lambda invocation type
aws lambda get-function-configuration \
--function-name api-handler \
--query 'Timeout'
# Compare: API Gateway 30s limit vs Lambda timeout
Boundary verification workflow for AWS errors:
1. Identify error type → Map to boundary category
- Missing env var → Process boundary (code → runtime)
- Schema mismatch → Data boundary (code → database)
- Timeout → Configuration boundary (requirements → entity config)
- Permission denied → Permission boundary (principal → resource)
- API Gateway timeout → Intention boundary (usage → design)
2. Identify physical entities involved
- WHICH Lambda (name, ARN)
- WHICH Aurora cluster (endpoint, database)
- WHICH S3 bucket (name, region)
- WHICH IAM role (name, policies)
3. Verify contract at boundary
- Code expectations → Infrastructure reality
- Use aws cli to inspect actual configuration
- Compare code requirements vs entity properties
4. Apply Progressive Evidence Strengthening
- Layer 1 (Surface): Error message
- Layer 2 (Content): CloudWatch logs
- Layer 3 (Observability): AWS resource configuration
- Layer 4 (Ground Truth): Test actual execution
Integration with investigation workflow:
See: Execution Boundary Checklist for systematic AWS boundary verification
Related:
# Check all three layers
aws lambda invoke --function-name worker --payload '{}' /tmp/response.json
# Layer 1: Exit code
echo "Exit code: $?"
# Layer 2: Response payload
cat /tmp/response.json | jq .
# Layer 3: CloudWatch logs
aws logs tail /aws/lambda/worker --since 5m --filter-pattern "ERROR"
Questions:
# Get full error details
aws logs filter-log-events \
--log-group-name /aws/lambda/worker \
--start-time $(($(date +%s) - 3600))000 \
--filter-pattern "ERROR" \
--query 'events[*].[timestamp,message]' \
--output table
# Get surrounding context (±5 lines)
aws logs filter-log-events \
--log-group-name /aws/lambda/worker \
--filter-pattern "ERROR" \
| jq -r '.events[0].message' \
| grep -C 5 "ERROR"
# When did errors start?
aws logs filter-log-events \
--log-group-name /aws/lambda/worker \
--filter-pattern "ERROR" \
--query 'events[0].timestamp' \
--output text
# What deployed around that time?
gh run list --limit 10
# What changed in code?
git log --since="2 hours ago" --oneline
See AWS-DIAGNOSTICS.md for service-specific diagnostic patterns.
| Symptom | Likely Cause | Investigation | |---------|--------------|---------------| | 200 OK but errors in logs | Silent failure | Check rowcount, verify writes | | INFO logs not showing | Root logger level = WARNING | Set root logger to INFO | | Timeout | Cold start, external API slow | Check duration metrics | | Permission denied | IAM policy missing | Simulate permissions | | 0 rows affected | FK constraint, ENUM mismatch | Check constraints |
.claude/skills/error-investigation/
├── SKILL.md # This file (entry point)
├── AWS-DIAGNOSTICS.md # AWS-specific diagnostic patterns
└── LAMBDA-LOGGING.md # Lambda logging configuration guide
tools
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.
testing
Write comprehensive tests following project conventions (tiers, patterns, anti-patterns). Use when writing tests, improving test coverage, fixing failing tests, or reviewing test quality.
content-media
Clone and customize existing templates (landing pages, dashboards, admin panels) with style extraction, config-driven content, and theme customization
development
Create high-converting B2B landing pages using psychological section sequencing. Use when building landing pages for services, agencies, consultants, or B2B products. Provides 14-section framework optimized for conversion psychology.