plugins/devops-sre/skills/on-call/on-call-best-practices/SKILL.md
Manage on-call rotations with sustainable practices, fair scheduling, and effective handoffs. Use this skill when setting up on-call, improving on-call experience, or managing rotations. Activate when: on-call, pagerduty, rotation, schedule, handoff, on-call burden, being paged, night pages, weekend on-call, on-call fatigue.
npx skillsauth add latestaiagents/agent-skills on-call-best-practicesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Sustainable on-call that protects engineers and keeps systems reliable.
"On-call should be a learning opportunity, not a punishment."
Primary On-Call (24/7):
- First responder for all pages
- 1 week shifts (max)
- Clear handoff process
Secondary On-Call (24/7):
- Backup if primary unavailable
- Can be shadow for training
- Steps in if primary overloaded
Business Hours Escalation:
- Subject matter experts
- Available for complex issues
- Not paged at night
Week Mon Tue Wed Thu Fri Sat Sun
───────────────────────────────────────────────────────
Jan 6 Alice Alice Alice Alice Alice Alice Alice
Jan 13 Bob Bob Bob Bob Bob Bob Bob
Jan 20 Carol Carol Carol Carol Carol Carol Carol
Jan 27 Dave Dave Dave Dave Dave Dave Dave
Feb 3 Alice ...
Secondary follows same pattern, offset by 1 week
| Guideline | Recommendation | |-----------|----------------| | Shift length | 1 week max, shorter if high volume | | Gap between shifts | 2+ weeks minimum | | Consecutive nights | Comp time if >2 pages | | Holidays | Volunteer-based, compensated | | Team size | 4+ people for sustainable rotation |
| Severity | Acknowledge | Respond | Escalate If | |----------|-------------|---------|-------------| | SEV1 | 5 min | Immediate | No ack in 5 min | | SEV2 | 15 min | 30 min | No ack in 15 min | | SEV3 | 1 hour | 4 hours | Business hours | | SEV4 | Best effort | Next day | N/A |
During your shift:
✓ Phone charged and with you
✓ Laptop accessible within 15 min
✓ Reliable internet access
✓ Not impaired (alcohol, etc.)
✓ Able to focus if paged
You are NOT expected to:
✗ Be at your desk 24/7
✗ Respond instantly to Slack
✗ Fix everything yourself
✗ Work normal hours + on-call
## On-Call Handoff
**Outgoing:** @alice
**Incoming:** @bob
**Date:** 2026-01-13 09:00 UTC
### Active Issues
- [ ] INC-123: Monitoring elevated error rate (context: ...)
- [ ] Deployment in progress: api-service v2.3.4
### Watch Items
- Payment processor maintenance tonight 02:00-04:00 UTC
- New monitoring rolled out, may be noisy
### Recent Incidents
- INC-121: Resolved, postmortem scheduled Friday
- INC-122: Resolved, no action needed
### Runbook Updates
- Updated: database/connection-pool-reset (added step 3)
- Outdated: search/reindex (needs review)
### Notes
- Had 3 pages this week, all during business hours
- Nothing woke me up at night
- Good luck! 🍀
1. Walk through active issues
2. Highlight anything unusual
3. Share context not in writing
4. Confirm contact info current
5. Test page to verify setup
| Metric | Healthy | Action Needed | |--------|---------|---------------| | Pages/week | <5 | Review alert thresholds | | Night pages/week | <1 | Investigate or fix root causes | | MTTA | <5 min | Check notification settings | | Time to resolve | <30 min avg | Improve runbooks | | % actionable | >80% | Reduce noisy alerts |
1. Fix the root cause
- Every incident should have action items
- Track action item completion
2. Improve detection
- Catch issues before they page
- Add canary deployments
3. Automate remediation
- Auto-restart crashed services
- Auto-scale on high load
- Self-healing infrastructure
4. Improve runbooks
- Clear, tested procedures
- One-click remediation where possible
5. Reduce noise
- Tune alert thresholds
- Add deduplication
- Use proper severity levels
Recommended compensation models:
1. Stipend Model
- Fixed amount per on-call week
- Example: $500/week on-call
2. Per-Page Model
- Base stipend + per-page bonus
- Example: $200/week + $50/page
3. Comp Time Model
- Time off for night/weekend pages
- Example: 2 hours off per night page
4. Combined Model
- Stipend + comp time for disruption
- Most engineer-friendly
✓ Clear escalation paths
✓ Secondary on-call backup
✓ Manager support for difficult situations
✓ Mental health resources
✓ Training and shadowing for new on-callers
✓ Blameless postmortem culture
Week 1: Observe
- Shadow primary on-call
- Read all runbooks
- Review recent incidents
Week 2: Assisted
- Take some pages with backup
- Primary available immediately
- Debrief after each incident
Week 3: Primary with Safety Net
- Primary on-call
- Experienced shadow
- Extended escalation time
Week 4+: Full Primary
- Normal on-call duties
- Standard escalation paths
□ Access to all systems
□ Can reach all tools (VPN, etc.)
□ Know escalation paths
□ Reviewed all runbooks
□ Understand SLO/SLA
□ Know who SMEs are
□ Have done a test page
□ Know how to declare incident
1. Sustainable rotation size (4+ people)
2. Enforce gap between shifts
3. Comp time for disruption
4. Regular feedback loops
5. Continuously reduce burden
6. Leadership does on-call too
- Talk to your manager
- Request temporary rotation skip
- Ask for additional support
- Suggest rotation improvements
- It's okay to ask for help
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.