skills/daemon-development/SKILL.md
Build daemon/background processes that start on boot, run continuously, and manage their own lifecycle. Covers macOS launchd (plist files, agents vs daemons), Linux systemd (unit files), Windows services, process supervision, logging, health checks, graceful shutdown, auto-restart, and AI-powered daemons that manage LLM API connections and rate limits. Activate on: "daemon", "background process", "launchd", "systemd", "service file", "plist", "launch agent", "launch daemon", "auto-start", "always running", "process supervisor", "pm2", "background service", "boot service", "AI daemon", "long-running process". NOT for: container orchestration (use devops-automator), cron jobs that run and exit (use task-scheduler), web server deployment (use backend-architect).
npx skillsauth add curiositech/windags-skills daemon-developmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build long-running background processes that start reliably, run continuously, recover from failures, and shut down gracefully. Expert-level daemon architecture across macOS launchd, Linux systemd, and AI-powered services.
Platform Detected?
├─ macOS
│ ├─ Must run at boot (no user login): LaunchDaemon → /Library/LaunchDaemons/
│ ├─ User session required (GUI/files): LaunchAgent → ~/Library/LaunchAgents/
│ └─ System-wide user service: LaunchAgent → /Library/LaunchAgents/
├─ Linux
│ ├─ systemd available: systemd unit file → /etc/systemd/system/
│ ├─ Legacy SysV: init.d script (rare, avoid if possible)
│ └─ Container: s6 or built-in supervision
└─ Cross-platform dev
├─ Node.js app: pm2 for development, systemd/launchd for production
└─ Other languages: Direct systemd/launchd implementation
Daemon Startup Behavior?
├─ Simple process (doesn't fork)
│ ├─ systemd: Type=simple
│ └─ launchd: Standard plist (no special keys)
├─ Signals readiness when ready
│ ├─ systemd: Type=notify + sd_notify("READY=1")
│ └─ launchd: N/A (use health check instead)
├─ Forks child process (legacy)
│ ├─ systemd: Type=forking + PIDFile (avoid)
│ └─ launchd: Not supported (rewrite to not fork)
└─ Socket-activated
├─ systemd: Type=simple + [Socket] section
└─ launchd: Sockets dict in plist
Failure Recovery Strategy?
├─ Critical service (must always run)
│ ├─ systemd: Restart=always, RestartSec=5
│ └─ launchd: KeepAlive=true, ThrottleInterval=10
├─ Crash recovery only
│ ├─ systemd: Restart=on-failure
│ └─ launchd: KeepAlive={SuccessfulExit=false}
├─ Manual restart preferred
│ ├─ systemd: Restart=no
│ └─ launchd: KeepAlive=false
└─ Rate-limited restart
├─ systemd: StartLimitBurst=5, StartLimitIntervalSec=60
└─ launchd: ThrottleInterval=30 (built-in)
LLM API Connection Pattern?
├─ Single provider, token bucket
│ ├─ Token estimation: prompt_tokens + max_completion_tokens
│ ├─ Bucket refill: tokens_per_minute from provider limits
│ └─ Overflow: Queue requests with priority
├─ Multi-provider failover
│ ├─ Circuit breaker per provider (3 failures = 30s timeout)
│ ├─ Rate limit per provider independently
│ └─ Failover order: primary → secondary → queue
├─ Streaming responses
│ ├─ Reserve tokens optimistically
│ ├─ Adjust on actual_tokens in real-time
│ └─ Handle mid-stream rate limits gracefully
└─ Batch processing
├─ Group similar requests to maximize throughput
└─ Split large batches if they hit rate limits
SIGTERM Received?
├─ Web server daemon
│ ├─ 1. server.close() - stop accepting new connections
│ ├─ 2. Wait for active requests (timeout: TimeoutStopSec-5s)
│ ├─ 3. Close database connections
│ └─ 4. exit(0)
├─ Queue worker daemon
│ ├─ 1. Stop polling for new jobs
│ ├─ 2. Finish current job (timeout protection)
│ ├─ 3. Flush any pending state
│ └─ 4. exit(0)
├─ AI daemon
│ ├─ 1. Stop accepting new LLM requests
│ ├─ 2. Drain in-flight requests (respect provider timeouts)
│ ├─ 3. Save rate limit state to disk
│ └─ 4. Close provider connections, exit(0)
└─ Database/stateful daemon
├─ 1. Checkpoint/flush transactions
├─ 2. Close client connections gracefully
├─ 3. Release file locks
└─ 4. exit(0)
journalctl -u service shows start/crash/start pattern every few secondsRestartSec=10 (systemd) or ThrottleInterval=15 (launchd), implement startup validationps aux shows <defunct> processes, parent daemon still running but degradedps axo pid,ppid,stat,comm | grep Z shows zombie childrenwait() or waitpid(), use signal(SIGCHLD, SIG_IGN) if children are fire-and-forgetlsof -p <daemon_pid> | wc -l grows continuously, eventual EMFILE errorsLimitNOFILE in systemdSystemMaxUse, add log level controlsScenario: Building an AI daemon that processes user requests through OpenAI API, needs 99.9% uptime with graceful rate limit handling.
1. Initial Architecture Decision
# Decision: Multi-provider with circuit breakers
# Primary: OpenAI GPT-4, Secondary: Anthropic Claude, Tertiary: Local model
# systemd unit file choice
Type=notify # Daemon signals when fully initialized
Restart=on-failure
RestartSec=10
2. Rate Limiting Implementation
// Token bucket per provider (expert catches: different providers = different limits)
const rateLimiters = {
openai: new TokenBucket({ tokensPerMinute: 40000, burstCapacity: 8000 }),
anthropic: new TokenBucket({ tokensPerMinute: 25000, burstCapacity: 5000 }),
};
// Novice mistake: Request-based limiting
// Expert insight: LLM APIs are token-based, not request-based
async processRequest(req: LLMRequest) {
const estimatedTokens = this.estimateTokens(req.prompt, req.maxTokens);
await this.rateLimiters.openai.acquire(estimatedTokens);
// ... proceed with API call
}
3. Circuit Breaker Configuration
// Expert trade-off: Aggressive vs Conservative failover
const circuitBreaker = new CircuitBreaker({
failureThreshold: 3, // Conservative: 5+ for stable APIs, 3 for flaky ones
timeout: 30000, // Aggressive: 15s, Conservative: 60s
resetTimeout: 60000, // How long to wait before retry
});
// Decision point: When to fail over?
if (error.status === 429) {
// Rate limited: backoff on primary, don't fail over yet
await this.exponentialBackoff(provider, error.retryAfter);
} else if (error.status >= 500) {
// Server error: immediate failover to secondary
this.circuitBreaker.recordFailure('openai');
}
4. Graceful Shutdown Pattern
// Expert catches: Race conditions in shutdown
let shutdownInProgress = false;
process.on('SIGTERM', async () => {
if (shutdownInProgress) return; // Idempotent shutdown
shutdownInProgress = true;
console.log('SIGTERM received, draining connections...');
// 1. Stop accepting new requests
server.close();
// 2. Wait for in-flight requests (with timeout)
const drainTimeout = setTimeout(() => {
console.log('Drain timeout, force exit');
process.exit(1);
}, 25000); // systemd TimeoutStopSec=30, so exit by 25s
await Promise.all([
this.drainActiveRequests(),
this.flushRateLimitState(), // Save token bucket state to disk
]);
clearTimeout(drainTimeout);
process.exit(0);
});
5. Trade-offs and Decision Results
RestartSec=5 for quick recovery vs RestartSec=30 to avoid thrashingThis skill is NOT for:
devops-automatortask-schedulerbackend-architectbackground-job-orchestratordatabase-architectDelegate to other skills when:
backend-architect handles service mesh, load balancingdevops-automator handles deployment automationbackground-job-orchestrator handles queue-specific patternsalways-on-agent-architecture handles AI-specific lifecycle needstools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.