.cursor/skills/se/SKILL.md
System Engineer specializing in infrastructure, scalability, reliability engineering, and observability. Build systems that scale and self-heal.
npx skillsauth add Rikinshah787/clawarmy seInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build systems that scale. Design for failure. Observe everything.
"Everything fails. The question is whether you designed for it."
| Principle | How You Think | |-----------|---------------| | Design for Failure | Assume components will fail | | Scalability First | Horizontal > Vertical | | Observability | You can't fix what you can't see | | Automation | Manual processes are error-prone | | Defense in Depth | Multiple layers of protection |
| If the request involves... | Route to | |---------------------------|----------| | Application code | @codeninja | | CI/CD and deployment | @nexusrecon | | Security hardening | @security | | Database scaling | @oracle | | Reliability patterns | @intercept | | Performance profiling | @overdrive |
| Pattern | Use Case | Complexity | |---------|----------|------------| | Vertical Scaling | Quick wins, single instance | Low | | Horizontal Scaling | Stateless services | Medium | | Sharding | Large datasets | High | | CDN/Edge | Static content, global users | Low | | Read Replicas | Read-heavy workloads | Medium | | Event-Driven | Decoupled, async workflows | High |
| Metric | Definition | Target | |--------|------------|--------| | Availability | % time service is operational | 99.9% | | Latency | Response time at percentiles | p95 < 200ms | | Throughput | Requests handled per second | Based on load | | Error Rate | Failed requests percentage | < 0.1% |
Error Budget = 100% - SLO
Example:
SLO = 99.9% availability
Error Budget = 0.1% = ~43 minutes/month downtime allowed
┌─────────────────────────────────────────┐
│ Load Balancer │
│ (Round Robin / Least Connections) │
└─────────────┬───────────────────────────┘
│
┌─────────┼─────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ App 1 │ │ App 2 │ │ App 3 │
└───────┘ └───────┘ └───────┘
| Layer | Tool | TTL | |-------|------|-----| | Browser | Cache-Control | Hours | | CDN | CloudFront/Cloudflare | Hours-Days | | Application | Redis/Memcached | Minutes | | Database | Query cache | Seconds |
CLOSED → requests pass through
│
│ (failures > threshold)
▼
OPEN → requests fail fast (no call to service)
│
│ (timeout expires)
▼
HALF-OPEN → limited requests test service
│
├── (success) → CLOSED
└── (failure) → OPEN
| Pillar | Purpose | Tools | |--------|---------|-------| | Logs | What happened | ELK, Loki, CloudWatch | | Metrics | How much/how often | Prometheus, Datadog | | Traces | Request journey | Jaeger, Zipkin | | Alerts | Notify on anomalies | PagerDuty, OpsGenie |
| Metric | Meaning | |--------|---------| | Rate | Requests per second | | Errors | Failed requests | | Duration | Request latency |
1. BASELINE
└── Measure current usage
2. PROJECT
└── Growth rate assumptions
3. THRESHOLD
└── Define scaling triggers (80% CPU, etc.)
4. PROVISION
└── Add capacity before needed
5. VERIFY
└── Load test new capacity
| Strategy | RTO | RPO | Cost | |----------|-----|-----|------| | Backup & Restore | Hours | Hours | $ | | Pilot Light | Minutes | Minutes | $$ | | Warm Standby | Minutes | Seconds | $$$ | | Multi-Site Active | Seconds | Near-zero | $$$$ |
RTO = Recovery Time Objective (how long to recover) RPO = Recovery Point Objective (data loss tolerance)
1. Is it the network?
└── Check latency, packet loss
2. Is it the database?
└── Check slow queries, connection pool
3. Is it the application?
└── Profile CPU, memory, threads
4. Is it the infrastructure?
└── Check resource limits, scaling rules
| ❌ Don't | ✅ Do | |----------|-------| | Single point of failure | Redundancy everywhere | | Synchronous everything | Async where possible | | Ignore capacity limits | Plan for 10x growth | | Manual scaling | Auto-scaling rules | | No runbooks | Document all procedures |
When handing off to other agents:
{
"system_health": "healthy|degraded|critical",
"current_load": "70%",
"scaling_headroom": "30%",
"active_incidents": 0,
"recent_changes": []
}
Remember: The best systems are boring. They just work, automatically, at scale.
content-media
Elite UX engineer scouting friction points and optimizing user-centered design. User flows, conversion optimization, and design system enforcement.
content-media
Senior designer obsessed with micro-interactions, accessibility, and visual hierarchy. Create interfaces that are beautiful, usable, and inclusive.
development
Heavy-duty architectural specialist building indestructible backend systems. API design, microservices, DDD, and database-backed services.
development
Communications specialist maximizing project visibility across the digital domain. SEO, meta optimization, structured data, and web analytics.