plugins/backend-toolkit/skills/background-jobs/SKILL.md
Run work off the request thread reliably — queue design, retries with exponential backoff + jitter, dead-letter queues, concurrency control, and idempotent handlers. Use when an operation is slow/external, when jobs fail silently, or when retries cause duplicates. Not for write+event transactional reliability — the dual-write problem (use async-messaging Outbox) or webhook-receiver specifics (use webhook-design).
npx skillsauth add jaykim88/claude-ai-engineering background-jobsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Move slow, external, or deferrable work off the request path into a durable queue with proper retries, dead-lettering, and idempotent handlers — so failures are visible and recoverable, not silent.
Universal — queue design, retry/backoff, DLQ, concurrency control, and idempotent handlers are job-processing principles; BullMQ is the default implementation.
Decide what belongs in a job
Make every handler idempotent
resilience-patterns idempotency)Configure retries: exponential backoff + jitter, capped
attempts + backoff: { type: 'exponential' } (+ jitter)Dead-letter queue for poison messages
Control concurrency — and apply backpressure
5b. Set a visibility timeout / heartbeat for long jobs
5c. Retain only what you need (job table cleanup)
removeOnComplete (or equivalent) with a retention count/age; archive critical job history elsewhere if neededMake jobs observable
observability-setup)Validate (validation loop)
| ❌ Anti-pattern | ✅ Correct |
|---|---|
| Non-idempotent handler (retry double-charges) | Idempotent handler (dedupe key) |
| Retry forever on any failure | Capped attempts → DLQ; retry only transient errors |
| Failed jobs silently dropped | DLQ + alert + replay path |
| Unbounded concurrency exhausting DB pool | Tuned concurrency + rate limits |
| Blocking the request on slow work | Enqueue, return fast |
| No visibility timeout / heartbeat → job stuck "in progress" forever on worker crash | Visibility timeout slightly above p99; heartbeat for long jobs |
| Completed jobs accumulating in Redis until OOM | removeOnComplete retention + archive critical history elsewhere |
| Producer faster than consumers; queue grows unbounded | Backpressure policy (slow producer / 429 / shed low-priority) |
| Tier | Examples | Action SLA | |---|---|---| | Critical | Non-idempotent job double-charging on retry; failed payment jobs silently dropped | Fix immediately | | Major | No DLQ (infinite retry or data loss); no backoff (retry storm) | Fix this sprint | | Minor | Concurrency untuned; missing queue-depth metric | Schedule within 2 sprints |
feat(jobs): add <job> with DLQ + backoff / fix(jobs): make <handler> idempotentdefaultJobOptions: { attempts: 5, backoff: { type: 'exponential', delay: 1000 } } (+ jitter via custom strategy)failed after attempts exhausted; BullMQ failed eventsnew Worker(name, fn, { concurrency: N, limiter: { max, duration } })jobId for natural dedupacks_late=True, max_retries, autoretry_for) + dead-letter via routingasync-messaging — the outbox relay and event consumers run as jobsresilience-patterns — job retries reuse backoff + idempotencywebhook-design — received webhooks are processed as jobsdefaultJobOptions, cap attempts, and make handlers idempotent — at-least-once execution means a job can run more than once on retry.development
Design webhooks correctly on both sides — sending (HMAC signing, retries with backoff, at-least-once) and receiving (verify signature on raw body, enqueue + 200 fast, dedupe on event id). Use when adding webhook delivery or consuming a provider's webhooks. Not for internal service-to-service events (use async-messaging) or general outbound-call retry policy (use resilience-patterns).
testing
Use transactions and isolation levels correctly — keep them short, no network calls inside, explicit isolation, retry on serialization conflicts, and choose optimistic vs pessimistic locking. Use when a write spans multiple tables, when concurrent updates corrupt data, or when designing money/inventory flows. Not for cross-service event delivery (use async-messaging Outbox) or schema-level constraints (use schema-design).
development
Backend testing pyramid — unit for pure logic, integration against a real DB (Testcontainers), and consumer-driven contract testing (Pact) for service boundaries. Use before a feature, after a bug fix, or when services break each other on deploy. Not for load testing (use performance-profiling) or security testing (use backend-security-audit).
data-ai
Design a relational schema — normalize to 3NF then denormalize with justification, choose the right Postgres index type per data shape, enforce constraints at the DB. Use when modeling a new domain, when queries are slow, or before a migration. Not for diagnosing slow queries (use query-optimization) or shipping the change without downtime (use migration-strategy).