skills/background-job-orchestrator/SKILL.md
Expert in background job processing with Bull/BullMQ (Redis), Celery, and cloud queues. Implements retries, scheduling, priority queues, and worker management. Use for async task processing, email campaigns, report generation, batch operations. Activate on "background job", "async task", "queue", "worker", "BullMQ", "Celery". NOT for real-time WebSocket communication, synchronous API calls, or simple setTimeout operations.
npx skillsauth add curiositech/windags-skills background-job-orchestratorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert in designing and implementing production-grade background job systems that handle long-running tasks without blocking API responses.
✅ Use for:
❌ NOT for:
Does this task:
├── Take >5 seconds? → Background job
├── Need to retry on failure? → Background job
├── Run on a schedule? → Background job (cron pattern)
├── Block user interaction? → Background job
├── Process in batches? → Background job
└── Return immediately? → Keep synchronous
When to use:
Why BullMQ over Bull:
When to use:
Alternatives:
When to use:
Novice thinking: "Retry 3 times, then fail silently"
Problem: Failed jobs disappear with no visibility or recovery path.
Correct approach:
// BullMQ with dead letter queue
const queue = new Queue('email-queue', {
connection: redis,
defaultJobOptions: {
attempts: 3,
backoff: {
type: 'exponential',
delay: 2000
},
removeOnComplete: 100, // Keep last 100 successful
removeOnFail: false // Keep all failed for inspection
}
});
// Monitor failed jobs
const failedJobs = await queue.getFailed();
Timeline:
Symptom: API endpoint waits for job completion
Problem:
// ❌ WRONG - Blocks API response
app.post('/send-email', async (req, res) => {
await sendEmail(req.body.to, req.body.subject);
res.json({ success: true });
});
Why wrong: Timeout, poor UX, wastes server resources
Correct approach:
// ✅ RIGHT - Queue and return immediately
app.post('/send-email', async (req, res) => {
const job = await emailQueue.add('send', {
to: req.body.to,
subject: req.body.subject
});
res.json({
success: true,
jobId: job.id,
status: 'queued'
});
});
// Separate worker processes the job
worker.process('send', async (job) => {
await sendEmail(job.data.to, job.data.subject);
});
Problem: Job runs twice → duplicate charges, double emails
Why it happens:
Correct approach:
// ✅ Idempotent job with deduplication key
await queue.add('charge-payment', {
userId: 123,
amount: 50.00
}, {
jobId: `payment-${orderId}`, // Prevents duplicates
attempts: 3
});
// In worker: Check if already processed
worker.process('charge-payment', async (job) => {
const { userId, amount } = job.data;
// Check idempotency
const existing = await db.payments.findOne({
jobId: job.id
});
if (existing) {
return existing; // Already processed
}
// Process payment
const result = await stripe.charges.create({...});
// Store idempotency record
await db.payments.create({
jobId: job.id,
result
});
return result;
});
Problem: Overwhelm third-party APIs or exhaust quotas
Symptom: "Rate limit exceeded" errors from Sendgrid, Stripe, etc.
Correct approach:
// BullMQ rate limiting
const queue = new Queue('api-calls', {
limiter: {
max: 100, // Max 100 jobs
duration: 60000 // Per 60 seconds
}
});
// Or: Priority-based rate limits
await queue.add('send-email', data, {
priority: user.isPremium ? 1 : 10,
rateLimiter: {
max: user.isPremium ? 1000 : 100,
duration: 3600000 // Per hour
}
});
Problem: Single worker can't keep up with queue depth
Symptom: Queue backs up, jobs delayed hours/days
Correct approach:
// Horizontal scaling with multiple workers
const worker = new Worker('email-queue', async (job) => {
await processEmail(job.data);
}, {
connection: redis,
concurrency: 5 // Process 5 jobs concurrently per worker
});
// Run multiple worker processes (PM2, Kubernetes, etc.)
// Each worker processes concurrency * num_workers jobs
Monitoring:
// Set up alerts for queue depth
setInterval(async () => {
const waiting = await queue.getWaitingCount();
if (waiting > 1000) {
alert('Queue depth exceeds 1000, scale workers!');
}
}, 60000);
// Queue setup
const emailQueue = new Queue('email-campaign', { connection: redis });
// Enqueue batch
async function sendCampaign(userIds: number[], template: string) {
const jobs = userIds.map(userId => ({
name: 'send',
data: { userId, template },
opts: {
attempts: 3,
backoff: { type: 'exponential', delay: 5000 }
}
}));
await emailQueue.addBulk(jobs);
}
// Worker with retry logic
const worker = new Worker('email-campaign', async (job) => {
const { userId, template } = job.data;
const user = await db.users.findById(userId);
const email = renderTemplate(template, user);
try {
await sendgrid.send({
to: user.email,
subject: email.subject,
html: email.body
});
} catch (error) {
if (error.code === 'ECONNREFUSED') {
throw error; // Retry
}
// Invalid email, don't retry
console.error(`Invalid email for user ${userId}`);
}
}, {
connection: redis,
concurrency: 10
});
// Daily report at 9 AM
await queue.add('daily-report', {
type: 'sales',
recipients: ['[email protected]']
}, {
repeat: {
pattern: '0 9 * * *', // Cron syntax
tz: 'America/New_York'
}
});
// Worker generates and emails report
worker.process('daily-report', async (job) => {
const { type, recipients } = job.data;
const data = await generateReport(type);
const pdf = await createPDF(data);
await emailQueue.add('send', {
to: recipients,
subject: `Daily ${type} Report`,
attachments: [{ filename: 'report.pdf', content: pdf }]
});
});
// Multi-stage job with progress tracking
await videoQueue.add('transcode', {
videoId: 123,
formats: ['720p', '1080p', '4k']
}, {
attempts: 2,
timeout: 3600000 // 1 hour timeout
});
worker.process('transcode', async (job) => {
const { videoId, formats } = job.data;
for (let i = 0; i < formats.length; i++) {
const format = formats[i];
// Update progress
await job.updateProgress((i / formats.length) * 100);
// Transcode
await ffmpeg.transcode(videoId, format);
}
await job.updateProgress(100);
});
// Client polls for progress
app.get('/videos/:id/status', async (req, res) => {
const job = await queue.getJob(req.params.jobId);
res.json({
state: await job.getState(),
progress: job.progress
});
});
// Queue health dashboard
async function getQueueMetrics() {
const [waiting, active, completed, failed, delayed] = await Promise.all([
queue.getWaitingCount(),
queue.getActiveCount(),
queue.getCompletedCount(),
queue.getFailedCount(),
queue.getDelayedCount()
]);
return {
waiting, // Jobs waiting to be processed
active, // Jobs currently processing
completed, // Successfully completed
failed, // Failed after retries
delayed, // Scheduled for future
health: waiting < 1000 && failed < 100 ? 'healthy' : 'degraded'
};
}
// Development: Monitor jobs visually
import { createBullBoard } from '@bull-board/api';
import { BullMQAdapter } from '@bull-board/api/bullMQAdapter';
import { ExpressAdapter } from '@bull-board/express';
const serverAdapter = new ExpressAdapter();
createBullBoard({
queues: [
new BullMQAdapter(emailQueue),
new BullMQAdapter(videoQueue)
],
serverAdapter
});
app.use('/admin/queues', serverAdapter.getRouter());
// Visit http://localhost:3000/admin/queues
□ Dead letter queue configured
□ Retry strategy with exponential backoff
□ Job timeout limits set
□ Rate limiting for third-party APIs
□ Idempotency keys for critical operations
□ Worker concurrency tuned (CPU cores * 2)
□ Horizontal scaling configured (multiple workers)
□ Queue depth monitoring with alerts
□ Failed job inspection workflow
□ Job data doesn't contain PII in logs
□ Redis persistence enabled (AOF or RDB)
□ Graceful shutdown handling (SIGTERM)
| Scenario | Use Background Jobs? | |----------|---------------------| | Send welcome email on signup | ✅ Yes - can take 2-5 seconds | | Charge credit card | ⚠️ Maybe - depends on payment provider latency | | Generate PDF report (30 seconds) | ✅ Yes - definitely background | | Fetch user profile from DB | ❌ No - milliseconds, keep synchronous | | Process video upload (5 minutes) | ✅ Yes - always background | | Validate form input | ❌ No - synchronous validation | | Daily cron job | ✅ Yes - use repeatable jobs | | Real-time chat message | ❌ No - use WebSockets |
| Feature | BullMQ | Celery | AWS SQS | |---------|--------|--------|---------| | Language | Node.js | Python | Any (HTTP API) | | Backend | Redis | Redis/RabbitMQ/SQS | Managed | | Priorities | ✅ | ✅ | ✅ | | Rate Limiting | ✅ | ❌ | ✅ (via attributes) | | Repeat/Cron | ✅ | ✅ (celery-beat) | ❌ (use EventBridge) | | UI Dashboard | Bull Board | Flower | CloudWatch | | Workflows | ❌ | ✅ (chains, groups) | ❌ | | Learning Curve | Medium | Medium | Low | | Cost | Redis hosting | Redis hosting | $0.40/million requests |
/references/bullmq-patterns.md - Advanced BullMQ patterns and examples/references/celery-workflows.md - Celery chains, groups, and chords/references/job-observability.md - Monitoring, alerting, and debuggingscripts/setup_bullmq.sh - Initialize BullMQ with Redisscripts/queue_health_check.ts - Queue metrics dashboardscripts/retry_failed_jobs.ts - Bulk retry failed jobsThis skill guides: Background job implementation | Queue architecture | Retry strategies | Worker scaling | Job observability
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.