.claude/skills/ecs-troubleshooting/SKILL.md
ECS troubleshooting and debugging guide covering task failures, service issues, networking problems, and performance diagnostics. Use when diagnosing ECS issues, debugging task failures (STOPPED, PENDING), resolving networking problems, investigating IAM/permissions errors, troubleshooting container health checks, or analyzing ECS service health.
npx skillsauth add adaptationio/skrillz ecs-troubleshootingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete guide to diagnosing and resolving common ECS issues.
# Check service status
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'
# List stopped tasks (failures)
aws ecs list-tasks \
--cluster production \
--service-name my-service \
--desired-status STOPPED
# Describe stopped task
aws ecs describe-tasks \
--cluster production \
--tasks <task-arn> \
--query 'tasks[0].{status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'
# View recent logs
aws logs tail /ecs/my-app --since 1h --follow
# Execute into container (debug)
aws ecs execute-command \
--cluster production \
--task <task-id> \
--container my-app \
--interactive \
--command "/bin/sh"
Tasks immediately stop after starting or fail to start.
import boto3
ecs = boto3.client('ecs')
def diagnose_stopped_task(cluster: str, task_arn: str):
"""Diagnose why a task stopped"""
response = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])
task = response['tasks'][0]
print(f"Task Status: {task['lastStatus']}")
print(f"Stop Code: {task.get('stopCode', 'N/A')}")
print(f"Stopped Reason: {task.get('stoppedReason', 'N/A')}")
for container in task['containers']:
print(f"\nContainer: {container['name']}")
print(f" Status: {container['lastStatus']}")
print(f" Exit Code: {container.get('exitCode', 'N/A')}")
print(f" Reason: {container.get('reason', 'N/A')}")
1. Essential container failed
stoppedReason: "Essential container in task exited"
Solution: Check container logs for application errors
aws logs tail /ecs/my-app --since 30m
2. Task failed to start
stoppedReason: "Task failed to start"
Solution: Check execution role permissions
# Verify execution role can pull image
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name ecr-access
3. CannotPullContainerError
reason: "CannotPullContainerError: Error response from daemon"
Solutions:
aws ecr describe-images --repository-name my-app4. OutOfMemoryError
reason: "OutOfMemoryError: Container killed due to memory usage"
exitCode: 137
Solution: Increase memory in task definition
memory = 2048 # Increase from current value
5. Exit Code 1 (Application Error)
exitCode: 1
Solution: Check application logs for errors
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "ERROR"
Tasks stuck in PENDING state, not transitioning to RUNNING.
def diagnose_pending_tasks(cluster: str, service: str):
"""Check why tasks are stuck in PENDING"""
# List pending tasks
pending = ecs.list_tasks(
cluster=cluster,
serviceName=service,
desiredStatus='RUNNING'
)
for task_arn in pending['taskArns']:
task = ecs.describe_tasks(cluster=cluster, tasks=[task_arn])['tasks'][0]
if task['lastStatus'] == 'PENDING':
print(f"Task {task_arn.split('/')[-1]} is PENDING")
# Check attachments for ENI issues
for attachment in task.get('attachments', []):
print(f" Attachment: {attachment['type']} - {attachment['status']}")
for detail in attachment.get('details', []):
print(f" {detail['name']}: {detail['value']}")
1. No available capacity
Service my-service was unable to place a task because no container instance met all of its requirements
Solutions for Fargate:
2. ENI provisioning issues
Attachment status: PRECREATED
Solutions:
3. Image pull taking too long
Container image: pulling
Solutions:
# Check service events
aws ecs describe-services \
--cluster production \
--services my-service \
--query 'services[0].events[:10]'
1. "service my-service is unable to place a task"
Check task placement constraints and capacity.
2. "service my-service has reached a steady state"
Service is healthy - tasks are running as expected.
3. "service my-service was unable to place a task because no container instance met all requirements"
For Fargate: Check CPU/memory configurations are valid combinations.
Deployment never reaches COMPLETED state.
def check_deployment_status(cluster: str, service: str):
"""Check deployment progress"""
response = ecs.describe_services(cluster=cluster, services=[service])
svc = response['services'][0]
for deployment in svc['deployments']:
print(f"\nDeployment: {deployment['id']}")
print(f" Status: {deployment['status']}")
print(f" Rollout State: {deployment['rolloutState']}")
print(f" Tasks: {deployment['runningCount']}/{deployment['desiredCount']}")
if deployment['rolloutState'] == 'IN_PROGRESS':
reason = deployment.get('rolloutStateReason', '')
print(f" Reason: {reason}")
1. Health check failures
rolloutStateReason: "ECS deployment circuit breaker: tasks failed to start"
Solutions:
healthCheckGracePeriodSeconds2. Insufficient capacity
rolloutStateReason: "Service my-service was unable to place a task"
Solutions:
maximumPercent to allow more headroomFor private subnets:
# Option 1: NAT Gateway
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public.id
}
# Option 2: VPC Endpoints (recommended)
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
}
Service-to-service communication fails.
# Check security group rules
aws ec2 describe-security-groups \
--group-ids sg-12345 \
--query 'SecurityGroups[0].IpPermissions'
# Allow traffic between ECS tasks
resource "aws_security_group_rule" "ecs_to_ecs" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_group_id = aws_security_group.ecs_tasks.id
source_security_group_id = aws_security_group.ecs_tasks.id
}
Target group app-tg: 0 healthy, 3 unhealthy
# Check target health
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn>
1. Wrong health check path
health_check {
path = "/health" # Must match application endpoint
}
2. Container not listening on expected port
# Verify inside container
aws ecs execute-command --cluster production --task <task-id> \
--container my-app --interactive --command "netstat -tlnp"
3. Security group blocking ALB
# Allow ALB to reach ECS tasks
resource "aws_security_group_rule" "alb_to_ecs" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_group_id = aws_security_group.ecs_tasks.id
source_security_group_id = aws_security_group.alb.id
}
CannotPullContainerError: Error response from daemon: pull access denied
resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
role = aws_iam_role.ecs_task_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# For cross-account ECR
resource "aws_iam_role_policy" "cross_account_ecr" {
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
]
Resource = "arn:aws:ecr:*:OTHER_ACCOUNT:repository/*"
}]
})
}
ResourceInitializationError: unable to pull secrets
resource "aws_iam_role_policy" "secrets_access" {
role = aws_iam_role.ecs_task_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = "arn:aws:secretsmanager:*:*:secret:my-app/*"
},
{
Effect = "Allow"
Action = ["ssm:GetParameters"]
Resource = "arn:aws:ssm:*:*:parameter/my-app/*"
},
{
Effect = "Allow"
Action = ["kms:Decrypt"]
Resource = aws_kms_key.secrets.arn
}
]
})
}
SessionManagerPlugin is not found
or
Execute command is disabled
1. Enable execute command on service
resource "aws_ecs_service" "app" {
enable_execute_command = true
}
2. Add SSM permissions to task role
resource "aws_iam_role_policy" "ssm_exec" {
role = aws_iam_role.ecs_task.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
]
Resource = "*"
}]
})
}
import boto3
cloudwatch = boto3.client('cloudwatch')
def get_service_metrics(cluster: str, service: str):
"""Get CPU and memory metrics"""
response = cloudwatch.get_metric_statistics(
Namespace='AWS/ECS',
MetricName='CPUUtilization',
Dimensions=[
{'Name': 'ClusterName', 'Value': cluster},
{'Name': 'ServiceName', 'Value': service}
],
StartTime=datetime.utcnow() - timedelta(hours=1),
EndTime=datetime.utcnow(),
Period=300,
Statistics=['Average', 'Maximum']
)
for point in sorted(response['Datapoints'], key=lambda x: x['Timestamp']):
print(f"{point['Timestamp']}: Avg={point['Average']:.1f}%, Max={point['Maximum']:.1f}%")
1. Right-size tasks
# Increase resources
cpu = "1024" # from 512
memory = "2048" # from 1024
2. Enable auto-scaling
resource "aws_appautoscaling_policy" "cpu" {
target_tracking_scaling_policy_configuration {
target_value = 70.0
}
}
1. Large container image
2. Slow application startup
startPeriod in health check3. Slow secret/config loading
# Find errors in last hour
aws logs filter-events \
--log-group-name /ecs/my-app \
--start-time $(date -d '-1 hour' +%s000) \
--filter-pattern "ERROR"
# Find OOM kills
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "OutOfMemory"
# Find slow requests
aws logs filter-events \
--log-group-name /ecs/my-app \
--filter-pattern "[timestamp, level, duration>1000, ...]"
-- Top errors by count
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by @message
| sort errorCount desc
| limit 10
-- Average response time
fields @timestamp, responseTime
| stats avg(responseTime) as avgTime, max(responseTime) as maxTime by bin(5m)
| Symptom | First Check | Common Cause |
|---------|-------------|--------------|
| Task STOPPED | stoppedReason | Container crash, OOM |
| Task PENDING | Attachments | ENI/network issues |
| Deployment stuck | Health checks | ALB health check failing |
| Cannot pull image | Execution role | Missing ECR permissions |
| Cannot connect | Security groups | Wrong SG rules |
development
Setup secure web-based terminal access to WSL2 from mobile/tablet via ttyd + ngrok/Cloudflare/Tailscale. One-command install, start, stop, status. Use when you need remote terminal access, web terminal, browser-based shell, or mobile access to WSL2 environment.
development
Complete development workflows where Claude writes the code while Gemini and Codex provide research, planning, reviews, and different perspectives. Claude remains the main developer. Use for complex projects requiring expert planning and multi-perspective reviews.
development
Systematic progress tracking for skill development. Manages task states (pending/in_progress/completed), updates in real-time, reports progress, identifies blockers, and maintains momentum. Use when tracking skill development, coordinating work, or reporting progress.
testing
Comprehensive testing workflow orchestrating functional testing, example validation, integration testing, and usability assessment. Sequential workflow for complete skill testing from examples through scenarios to integration validation. Use when conducting thorough testing, pre-deployment validation, ensuring skill functionality, or comprehensive quality checks.