skills/background-job-queue-design/SKILL.md
Use when designing a background-job system, choosing between BullMQ / Sidekiq / RQ / Temporal / SQS, deciding queue-vs-workflow, sizing concurrency vs rate limits, building dead-letter queues, or making handlers idempotent. Triggers: jobs running twice on retry, lost jobs after worker crash, DLQ filling up, Redis OOM from job backlog, exactly-once requested, "do we need Temporal?", visibility timeout / lockDuration confusion, exponential backoff vs jitter, fan-out fan-in workflows. NOT for outbound webhook publishing (different concerns), receiver-side webhook handling (different concerns), event-streaming/Kafka topology, or in-process async (event loop only).
npx skillsauth add curiositech/windags-skills background-job-queue-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A background-job system has three failure modes you have to design against from day one: lost jobs, duplicate execution, and runaway retries. Pick the wrong primitive and you'll spend the next year fixing all three. The recurring industry consensus, across BullMQ, Sidekiq, SQS, and Temporal docs alike, is at-least-once is the only honest delivery semantic — design every handler to be idempotent and stop chasing exactly-once. (Caduh — Queues 101)
Jump to your fire:
setTimeout / cron / homegrown queues to a real broker.| You have | Reach for | Reason |
|---|---|---|
| Short, side-effecting jobs (send email, resize image) on Node | BullMQ | Redis-native, mature DLQ, flow patterns, ~~5k jobs/sec/node typical |
| Same on Ruby | Sidekiq (Pro for reliability) | Same shape; Pro super_fetch adds reliable fetch via LMOVE (Sidekiq wiki) |
| Multi-step workflows w/ external API fan-out, hours-to-days durations | Temporal | Durable execution; replays from event history (Temporal docs) |
| AWS-native, simple fanout, don't want to run Redis | SQS + Lambda | Managed at-least-once; visibility timeout + DLQ first-class (AWS SQS DLQ) |
| Cross-region, multi-tenant isolation | Temporal or SQS | Single Redis won't survive multi-region cleanly |
| Need exactly-once across DB + email + payments | None — design for at-least-once + idempotent handlers | "Exactly-once" inside the broker doesn't extend across systems (Caduh) |
The hidden criterion: how complex is the recovery story? A simple BullMQ queue with retries is fine for "send a receipt." A 12-step purchase workflow that pauses for 24 hours waiting for a webhook is a Temporal workflow, not 12 chained queue jobs.
The mandatory property. Every popular queue's docs lead with this. From the BullMQ idempotent-jobs page: "it should not make a difference to the final state of the system if a job successfully completes on its first attempt, or if it fails initially and succeeds when retried." (BullMQ idempotent-jobs)
Practical patterns:
// 1. Idempotency key derived from the business event, not the queue.
// Use the upstream event_id if you have one (Stripe event.id, Shopify hook id).
const job = await queue.add(
'send-receipt',
{ orderId, amount },
{
jobId: `send-receipt:${orderId}`, // BullMQ dedups by jobId in the queue.
attempts: 5,
backoff: { type: 'exponential', delay: 1000 },
removeOnComplete: { age: 3600, count: 1000 },
removeOnFail: { age: 86400 },
}
);
// 2. Inside the handler, dedupe at the side-effect site too.
async function processSendReceipt(job) {
const { orderId, amount } = job.data;
// If we already sent, do nothing.
const sent = await db.queryOne(
'INSERT INTO email_sends (idempotency_key, order_id) VALUES ($1, $2) ON CONFLICT DO NOTHING RETURNING id',
[`send-receipt:${orderId}`, orderId]
);
if (!sent) return { skipped: 'already-sent' };
await emailProvider.send({ to: ..., subject: ..., body: render(amount) });
}
The jobId dedupes adds while a duplicate is still queued. The DB unique constraint dedupes across worker crashes, retries, and replays. Both are required; jobId alone is insufficient because BullMQ removes completed jobs and the dedup window evaporates.
When a worker pulls a job and crashes, what happens?
| System | Default behavior | Recovery |
|---|---|---|
| BullMQ | Job has a lockDuration (default 30s). Stalled-job checker re-queues after expiry (BullMQ production) | Workers MUST extend the lock or finish within lockDuration; otherwise it runs twice |
| Sidekiq OSS | Pop is non-atomic; if worker dies between pop and process, job is lost | Upgrade to Sidekiq Pro super_fetch (Redis 6.2+ LMOVE to a private working queue) |
| Sidekiq Pro super_fetch | LMOVE to per-process working list. Heartbeat expires at 60s; orphan check sweeps and re-enqueues (Sidekiq wiki) | Built-in |
| SQS | Visibility timeout (default 30s, max 12h). Message reappears after timeout if not deleted | Set timeout > p99 handler latency; extend with ChangeMessageVisibility for long jobs |
| Temporal | Activities have heartbeat + retry policies. Workflow history survives worker death | Replay from history; you don't think about this |
The trap: lockDuration / visibility timeout shorter than handler p99 → job runs twice. Longer than acceptable recovery time → crashed work waits too long. Measure your p99 first, set the timeout to ~3x that, and have the handler heartbeat / extend the lock for genuinely-long work.
Not every error should retry forever. From SQS docs: "Set the maximum receives... and the redrive policy to send messages to the DLQ once threshold is exceeded." (AWS SQS DLQ)
// Classify before throwing. Permanent errors should NOT retry.
class PermanentError extends Error { constructor(m: string) { super(m); this.name = 'PermanentError'; } }
class TransientError extends Error { constructor(m: string) { super(m); this.name = 'TransientError'; } }
async function process(job) {
try {
const user = await db.user(job.data.userId);
if (!user) throw new PermanentError('user-not-found'); // → DLQ on first try
if (user.suspended) throw new PermanentError('user-suspended');
await externalApi.call(user);
} catch (e) {
if (e instanceof PermanentError) {
// Move to DLQ immediately — retrying won't help.
job.discard(); // BullMQ: prevents retry
throw e;
}
throw e; // Network/timeout → retry per attempts policy
}
}
DLQ then needs:
grafana-dashboard-builder).Exponential backoff alone isn't enough. If 1000 jobs all fail at 12:00:00 because a downstream service blipped, plain exponential backoff has all 1000 retry at exactly 12:00:01, 12:00:03, 12:00:07 — a thundering herd that may keep the downstream service down. Add full jitter:
// AWS-style "full jitter": delay = random(0, exp_backoff)
const baseDelay = Math.min(2 ** attempt * 1000, 30_000);
const delay = Math.random() * baseDelay;
For the BullMQ recommended baseline, the going-to-production doc suggests retry intervals between 1s and 20s with retryStrategy. (BullMQ production)
Per-worker concurrency runs N jobs in parallel from one process. Multiple workers multiply that. Without a queue-level limiter, you can't cap aggregate calls to a downstream:
// BullMQ — global rate limit across all workers on this queue.
const queue = new Queue('outbound-emails', {
connection,
limiter: { max: 100, duration: 1000 }, // 100 emails/sec across the whole fleet
});
The right shape: many concurrent workers + a queue limiter for downstream contract limits. Sidekiq has equivalent rate limiters; SQS uses Lambda concurrency or per-message-rate.
The single most-emphasized BullMQ production gotcha: maxmemory-policy noeviction on the Redis instance. Anything else (allkeys-lru, volatile-lru) will silently evict queued jobs under pressure. (BullMQ production)
Also:
removeOnComplete aggressive (e.g. { age: 3600, count: 1000 }).removeOnFail even more aggressive once you've moved to a DLQ pattern.appendfsync everysec is the documented sweet spot for BullMQ. (BullMQ production)maxRetriesPerRequest: null to prevent ioredis from throwing during transient disconnects. (BullMQ production)enableOfflineQueue: false so producer fails fast instead of buffering.process.on('SIGTERM', async () => {
await worker.close(); // stop accepting; finish in-flight jobs.
await connection.quit();
process.exit(0);
});
process.on('SIGINT', async () => { /* same */ });
The BullMQ production guide explicitly notes: close workers before stopping the process; default stalling timeout is ~30s. (BullMQ production)
Useful framing from Temporal's docs: "Unlike message queues which move data between services, Temporal orchestrates entire processes... knows where you are in a workflow, what's completed, what's pending, and what needs to retry." (Temporal blog)
flowchart TD
A[Background work] --> B{Single side-effect or pipeline?}
B -->|Single| C[BullMQ / Sidekiq / SQS]
B -->|Multi-step| D{Steps span seconds or hours+ ?}
D -->|Seconds| E[BullMQ flows / Sidekiq batches]
D -->|Hours+ or human-in-loop| F[Temporal / DBOS / Restate]
C --> G{Need cross-region or multi-tenant isolation?}
G -->|Yes| H[SQS or Temporal — Redis-single-cluster won't]
G -->|No| I[Run with Redis cluster + AOF + noeviction]
Symptom: During a Redis blip or restart, the same job runs twice and creates duplicate side effects (charge, email, ticket).
Diagnosis: Redis is a cache; it can be evicted, restarted, or partitioned. SET-NX dedup that lives only in Redis is best-effort.
Fix: Idempotency key with a DB unique constraint on the side-effect record (e.g. email_sends.idempotency_key). The DB transaction that creates the side-effect IS the dedup primitive.
Symptom: Stalled-job log entries; same job processed by two workers. Customers complain.
Diagnosis: Default lockDuration is 30s; handler sometimes takes 45s; the second worker picks it up.
Fix: Measure p99, set lockDuration to ~3x p99, and call job.extendLock() inside long handlers. Or split the work.
Symptom: A downstream blip becomes a downstream outage; 10k jobs hammer the recovering service in lockstep. Diagnosis: Without jitter, retries are correlated. Fix: Full-jitter backoff. Cap maximum delay to bound DLQ time-to-failure.
maxmemory-policy allkeys-lru on RedisSymptom: Queue mysteriously loses jobs under load. No errors. (BullMQ production)
Diagnosis: Redis evicted job keys to make room.
Fix: CONFIG SET maxmemory-policy noeviction AND assert it at startup. Workers should refuse to start otherwise.
Symptom: Charged customer twice; sent two of the same email. Engineer is sure "the broker is exactly-once." Diagnosis: Even a FIFO/exactly-once broker doesn't extend its guarantee to your DB + email vendor + Stripe + ledger. Fix: Treat the queue as at-least-once. Idempotency at every side-effect boundary.
Symptom: "We have 3,200 jobs in the DLQ. We don't know what's in there." DLQ becomes an unmonitored graveyard. Diagnosis: Built the DLQ, didn't build the operator tool. Fix: Replay-by-id, replay-by-time-range, drain-with-confirmation. CLI or admin UI. Tested.
Symptom: BullMQ worker process hangs; heartbeats stop; lock expires; job re-runs.
Diagnosis: CPU-bound work in the same process as the queue heartbeat; event loop blocks for > lockDuration.
Fix: Spawn a child process or worker thread for CPU-bound work. Or run a sandboxed processor (BullMQ supports per-job process sandboxing).
lockDuration / visibility timeout ≥ 3× measured p99 handler latency, AND long handlers extend the lock.PermanentError class or equivalent) bypass retry and go straight to DLQ.maxmemory-policy=noeviction, AOF enabled (appendfsync everysec), asserted on worker startup. (BullMQ production)removeOnComplete and removeOnFail configured. Job count metric trended in grafana-dashboard-builder.worker.close() before exit.queue.name, job.id, job.attempts, job.outcome (see opentelemetry-instrumentation).webhook-receiver-design.python-asyncio-pitfalls for Python event-loop concurrency.tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.