skills/async-messaging/SKILL.md
Event-driven architecture and asynchronous messaging patterns. Use when: designing pub/sub systems, implementing SQS/SNS/EventBridge workflows, building event-driven microservices, adding dead letter queues, designing message schemas, choosing between synchronous and asynchronous communication, auditing an existing system for messaging gaps, or implementing saga patterns. Covers AWS messaging services, message schema design, ordering guarantees, idempotency, error handling, and event sourcing.
npx skillsauth add michaelsvanbeek/personal-agent-skills async-messagingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Every message will be delivered at least once. Some will be delivered more than once. Some will arrive out of order. Design consumers to be idempotent and tolerant of duplicates and reordering.
| Use async when | Use sync when | |---------------|--------------| | Consumer doesn't need immediate response | Caller needs the result to proceed | | Work can be deferred (emails, reports, analytics) | User is waiting for the response (API request) | | Producer and consumer scale independently | Both sides are in the same process | | Retries and DLQ are needed for reliability | Simple request-response suffices | | Multiple consumers need the same event | Only one consumer exists | | Spiky workloads need buffering | Load is steady and predictable |
| Service | Pattern | Use for | |---------|---------|---------| | SQS | Point-to-point queue | Background jobs, task distribution, buffering | | SNS | Pub/sub fan-out | Notifying multiple subscribers of an event | | EventBridge | Event bus with rules | Cross-service events, scheduled triggers, third-party integrations | | Step Functions | Orchestration | Multi-step workflows with branching, retries, and state | | Kinesis | Streaming | High-throughput ordered event processing, analytics |
# serverless.yml
resources:
Resources:
OrderQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: ${self:service}-${sls:stage}-orders
VisibilityTimeout: 300 # 6x Lambda timeout
MessageRetentionPeriod: 1209600 # 14 days
RedrivePolicy:
deadLetterTargetArn: !GetAtt OrderDLQ.Arn
maxReceiveCount: 3 # Move to DLQ after 3 failures
OrderDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: ${self:service}-${sls:stage}-orders-dlq
MessageRetentionPeriod: 1209600
functions:
processOrder:
handler: handlers/orders.process
timeout: 50
events:
- sqs:
arn: !GetAtt OrderQueue.Arn
batchSize: 10
functionResponseType: ReportBatchItemFailures
Key settings:
VisibilityTimeout: At least 6x the Lambda timeout to prevent duplicate processing during retries.maxReceiveCount: 3 is a good default — retries transient failures without infinite loops.ReportBatchItemFailures: Return partial failures so only failed messages retry.resources:
Resources:
OrderEventsTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: ${self:service}-${sls:stage}-order-events
# Each subscriber gets its own queue
InventoryQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: ${self:service}-${sls:stage}-inventory
EmailQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: ${self:service}-${sls:stage}-email
InventorySubscription:
Type: AWS::SNS::Subscription
Properties:
TopicArn: !Ref OrderEventsTopic
Protocol: sqs
Endpoint: !GetAtt InventoryQueue.Arn
FilterPolicy:
event_type:
- order.placed
- order.cancelled
resources:
Resources:
OrderPlacedRule:
Type: AWS::Events::Rule
Properties:
EventBusName: ${self:service}-${sls:stage}
EventPattern:
source:
- orders
detail-type:
- OrderPlaced
Targets:
- Arn: !GetAtt FulfillmentQueue.Arn
Id: fulfillment
- Arn: !GetAtt AnalyticsQueue.Arn
Id: analytics
Use EventBridge when: Events cross service boundaries, you need content-based routing, or you want a central event bus.
Every message follows the same envelope:
{
"id": "msg_01HXYZ123ABC",
"source": "orders",
"type": "order.placed",
"timestamp": "2025-01-15T12:00:00Z",
"version": "1.0",
"data": {
"order_id": "ord_456",
"customer_id": "cust_789",
"total_cents": 4999
},
"metadata": {
"correlation_id": "req_abc123",
"trace_id": "trace_def456"
}
}
id: Globally unique message ID (ULID or UUID). Used for idempotency.source: Service that produced the event.type: Dot-separated event name (domain.action).version: Schema version. Increment on breaking changes.data: Event payload. Contains only the facts — not commands or instructions.metadata: Correlation and tracing IDs for observability.Every consumer must handle duplicate messages safely:
from typing import Any
def process_order(message: dict[str, Any]) -> None:
"""Process an order event idempotently."""
message_id = message["id"]
# Check if already processed
if is_processed(message_id):
logger.info("Duplicate message, skipping", extra={"message_id": message_id})
return
# Process the event
order = message["data"]
create_fulfillment(order["order_id"])
# Mark as processed (with TTL matching message retention)
mark_processed(message_id, ttl_days=14)
| Option | Best for | Notes |
|--------|----------|-------|
| DynamoDB table with TTL | Lambda consumers | Fast, serverless, auto-cleanup |
| Database unique constraint | Services with existing DB | Use INSERT ... ON CONFLICT DO NOTHING |
| Redis SET with TTL | High-throughput consumers | Fast but volatile |
resources:
Resources:
IdempotencyTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: ${self:service}-${sls:stage}-idempotency
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: message_id
AttributeType: S
KeySchema:
- AttributeName: message_id
KeyType: HASH
TimeToLiveSpecification:
AttributeName: expires_at
Enabled: true
When processing SQS batches, report individual failures so only failed messages retry:
from typing import Any
def handler(event: dict[str, Any], context: Any) -> dict[str, Any]:
"""SQS Lambda handler with partial batch failure reporting."""
failed_ids: list[str] = []
for record in event["Records"]:
try:
message = json.loads(record["body"])
process_message(message)
except Exception:
logger.exception("Failed to process message", extra={
"message_id": record["messageId"],
})
failed_ids.append(record["messageId"])
return {
"batchItemFailures": [
{"itemIdentifier": msg_id} for msg_id in failed_ids
],
}
| Service | Ordering | Use when | |---------|----------|----------| | SQS Standard | Best-effort (no guarantee) | Ordering doesn't matter | | SQS FIFO | Strict within message group | Order-sensitive within an entity (per-customer, per-order) | | Kinesis | Strict within shard (partition key) | High-throughput ordered streams | | EventBridge | No ordering guarantee | Event routing, not sequencing |
import json
sqs = boto3.client("sqs")
def publish_order_event(order_id: str, event: dict[str, Any]) -> None:
"""Publish to FIFO queue with order-level ordering."""
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps(event),
MessageGroupId=order_id, # All events for this order are ordered
MessageDeduplicationId=event["id"], # Prevent duplicates within 5-minute window
)
Set alarms on DLQ depth:
resources:
Resources:
DLQAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ${self:service}-${sls:stage}-dlq-depth
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Dimensions:
- Name: QueueName
Value: !GetAtt OrderDLQ.QueueName
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlertTopic
After fixing the root cause, replay DLQ messages back to the main queue:
# AWS CLI: move messages from DLQ back to main queue
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789:orders-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789:orders
When a workflow spans multiple services, use a saga to coordinate:
Each service emits an event on completion. The next service reacts to it.
OrderPlaced → PaymentProcessed → InventoryReserved → ShipmentCreated
↓ (failure) ↓ (failure)
OrderCancelled PaymentRefunded
Pros: Decoupled, no central coordinator. Cons: Hard to trace, compensating actions spread across services.
A central coordinator drives the workflow:
{
"StartAt": "ProcessPayment",
"States": {
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:process-payment",
"Next": "ReserveInventory",
"Catch": [{
"ErrorEquals": ["PaymentFailed"],
"Next": "CancelOrder"
}]
},
"ReserveInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:reserve-inventory",
"Next": "CreateShipment",
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "RefundPayment"
}]
},
"CreateShipment": { "Type": "Succeed" },
"RefundPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:refund-payment",
"Next": "CancelOrder"
},
"CancelOrder": { "Type": "Fail" }
}
}
Pros: Centralized visibility, clear failure handling. Cons: Single coordinator is a dependency.
Use orchestration for: Multi-step workflows with branching, retries, and compensating actions. Use choreography for: Simple fan-out where each consumer acts independently.
| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| No DLQ on any queue | Failed messages silently lost after max retries | Always attach a DLQ; alarm on depth > 0 |
| Processing messages without idempotency | Duplicates cause double-charges, double-sends | Use message ID deduplication with idempotency table |
| Synchronous HTTP in event handler | Coupling, latency, cascading failures | Use message queues between services |
| Unbounded batch size | Lambda timeout on large batches | Set batchSize ≤ 10 for SQS; tune for workload |
| No visibility timeout tuning | Messages re-appear while still processing | Set visibility timeout to 6x Lambda timeout |
| Tight coupling via message content | Consumer breaks when producer changes payload | Use schema versioning and tolerate unknown fields |
| No monitoring on DLQ | Failed messages accumulate unnoticed | Alarm on ApproximateNumberOfMessagesVisible > 0 |
| Processing order-dependent events on standard queue | Race conditions, inconsistent state | Use FIFO queue with message group ID per entity |
| No correlation ID in events | Cannot trace requests across services | Include correlation_id in message metadata |
When auditing an existing system for messaging patterns:
ReportBatchItemFailures for SQS+Lambda)development
TypeScript coding standards and type safety conventions. Use when: creating TypeScript files, defining interfaces and types, writing type-safe code, reviewing TypeScript for type correctness, auditing a codebase for type safety gaps, eliminating any or ts-ignore usage, or improving strict-mode compliance. Covers strict typing, avoiding any and ts-ignore, discriminated unions, Zod runtime validation, immutability patterns, and proper type definitions.
testing
Writing clear, actionable tickets in any issue tracker (Jira, Linear, GitHub Issues, ServiceNow, etc.). Use when: creating epics, stories, tasks, bugs, or spikes; writing acceptance criteria; decomposing work for a sprint; linking dependencies between tickets; auditing backlog items for clarity; or coaching a team on ticket quality. Covers title conventions, description templates, acceptance criteria, decomposition rules, dependency linking, and org-specific pluggable configuration.
development
Testing strategy, patterns, and evaluation for software and LLM/AI systems. Use when: writing tests, choosing test boundaries, designing test data, structuring test suites, evaluating LLM outputs, building evaluation pipelines, setting coverage thresholds, auditing test coverage gaps in existing projects, or improving test quality and structure.
development
Writing effective status updates for different audiences and cadences. Use when: writing a weekly status update, preparing a monthly summary, drafting a quarterly review, sending updates to leadership, sharing progress with stakeholders, or improving the clarity and impact of team communications. Covers weekly, monthly, and quarterly formats tailored for upward, lateral, and downward communication.