compliance/continuity/business-continuity/SKILL.md
Develop business continuity plans and impact analysis. Implement BCP testing and communication procedures. Use when building organizational resilience.
npx skillsauth add bagelhole/devops-security-agent-skills business-continuityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Develop and maintain business continuity capabilities including Business Impact Analysis, communication plans, recovery procedures, and testing schedules for organizational resilience.
bcp_phases:
1_governance:
actions:
- Obtain executive sponsorship and funding
- Assign BCP coordinator and team
- Define BCP scope and policy
- Establish BCP committee with cross-functional representation
deliverables:
- BCP policy statement
- BCP team charter and roster
- Scope document
2_analysis:
actions:
- Conduct Business Impact Analysis (BIA)
- Perform risk assessment for continuity threats
- Identify critical business processes and dependencies
- Determine recovery priorities and resource requirements
deliverables:
- BIA report
- Risk assessment report
- Critical process inventory
3_strategy:
actions:
- Select recovery strategies for each critical process
- Identify alternate work arrangements (remote, alternate site)
- Define technology recovery strategies (DR plan)
- Establish vendor and supply chain contingencies
deliverables:
- Recovery strategy document
- Technology recovery plan
- Alternate site arrangements
4_plan_development:
actions:
- Write detailed recovery procedures
- Develop communication plans (internal and external)
- Create emergency response procedures
- Document roles, responsibilities, and contact information
deliverables:
- Business Continuity Plan document
- Communication plan
- Emergency response procedures
- Contact lists and call trees
5_testing:
actions:
- Develop test plan and schedule
- Conduct exercises (tabletop, functional, full-scale)
- Evaluate results and identify gaps
- Update plans based on lessons learned
deliverables:
- Test plan
- Exercise reports
- Updated BCP based on findings
6_maintenance:
actions:
- Review and update BCP annually (minimum)
- Update after significant organizational changes
- Refresh BIA when business processes change
- Maintain training and awareness program
deliverables:
- Annual BCP review record
- Updated BIA (if changes occurred)
- Training completion records
bia_template:
process_assessment:
process_name: ""
process_owner: ""
department: ""
description: ""
criticality_classification:
mission_critical:
max_tolerable_downtime: "0-4 hours"
description: "Failure causes immediate, severe impact to customers or revenue"
examples:
- Payment processing
- Authentication and authorization
- Core API serving customer requests
- Order fulfillment
essential:
max_tolerable_downtime: "4-24 hours"
description: "Failure causes significant degradation but not complete loss"
examples:
- Customer support systems
- Reporting and dashboards
- Email and notifications
- Billing and invoicing
important:
max_tolerable_downtime: "1-3 days"
description: "Failure causes inconvenience and workarounds are available"
examples:
- Internal collaboration tools
- Analytics and BI platforms
- HR self-service systems
- Knowledge base
non_essential:
max_tolerable_downtime: "3-7 days"
description: "Failure has minimal operational impact"
examples:
- Development and test environments
- Training platforms
- Archive systems
impact_categories:
financial:
revenue_loss_per_hour: ""
penalty_or_fine_risk: ""
recovery_cost_estimate: ""
operational:
affected_employees: ""
affected_customers: ""
workaround_available: "yes/no"
workaround_description: ""
reputational:
customer_visibility: "high/medium/low"
media_attention_risk: "high/medium/low"
regulatory_reporting_required: "yes/no"
legal_regulatory:
compliance_impact: ""
contractual_sla_breach: "yes/no"
sla_penalty_details: ""
dependencies:
technology:
- system: ""
rto: ""
rpo: ""
dr_strategy: ""
people:
- role: ""
minimum_staff: ""
remote_capable: "yes/no"
vendors:
- vendor: ""
service: ""
sla: ""
alternative: ""
facilities:
- location: ""
alternative: ""
recovery_requirements:
rto: ""
rpo: ""
minimum_recovery_level: "Description of minimum acceptable service"
full_recovery_target: "Time to full normal operations"
communication_plan:
activation_criteria:
- Event affecting multiple critical systems
- Physical facility unavailable
- Pandemic or workforce availability crisis
- Major vendor/partner outage
- Cybersecurity incident with operational impact
internal_communication:
executive_notification:
who: "CEO, CTO, CFO, VP Engineering, VP Operations"
when: "Within 15 minutes of BCP activation"
method: "Phone call (primary), SMS (secondary)"
message_template: |
BUSINESS CONTINUITY EVENT ACTIVATED
Incident: [Brief description]
Impact: [Systems/processes affected]
Status: [Current state]
Next update: [Time]
Bridge call: [Number/link]
team_notification:
who: "All affected department leads and their teams"
when: "Within 30 minutes of BCP activation"
method: "Slack/Teams (primary), Email (secondary), SMS (tertiary)"
message_template: |
BCP ACTIVATED - [Event Type]
What happened: [Description]
What is affected: [Systems/services]
What to do: [Immediate actions for your team]
Status updates: [Channel/frequency]
Questions: Contact [BCP coordinator]
all_staff_notification:
who: "All employees"
when: "Within 1 hour of BCP activation"
method: "Email, Slack/Teams announcement, intranet"
content: "Situation summary, impact on work, expectations"
status_updates:
frequency: "Every 2 hours during active event, daily after stabilization"
channel: "Dedicated Slack channel, email distribution list"
content: "Current status, actions taken, next steps, timeline"
external_communication:
customers:
who: "Affected customers"
when: "Within 2 hours of BCP activation (if customer-facing impact)"
method: "Status page update, email, in-app notification"
message_template: |
We are currently experiencing [issue description].
Impact: [What customers may notice]
Status: We are actively working to resolve this.
Updates: Follow our status page at status.example.com
ETA: [Estimated resolution time or "investigating"]
regulatory:
who: "Applicable regulatory bodies"
when: "Per regulatory requirements (e.g., 72 hours for GDPR breach)"
method: "Formal notification per regulatory procedure"
media:
who: "Press inquiries"
when: "Only if media attention occurs"
method: "Prepared statement through communications team"
rule: "All media inquiries routed to designated spokesperson"
vendors_partners:
who: "Critical vendors and business partners"
when: "Within 4 hours if partner services affected"
method: "Direct contact via relationship manager"
contact_lists:
maintenance: "Updated quarterly"
storage: "Accessible offline (printed, mobile app, cloud-independent)"
includes:
- BCP team members (name, role, phone, email, alternate phone)
- Executive team
- Department leads
- Key vendor contacts
- Regulatory contacts
- Legal counsel
- Insurance broker
- PR/communications firm
recovery_procedures:
immediate_response:
step_1: "Incident commander assesses situation and declares BCP activation"
step_2: "Notify BCP team and establish command structure"
step_3: "Activate communication plan"
step_4: "Assess damage and determine scope of disruption"
step_5: "Initiate appropriate recovery procedures based on scenario"
scenario_specific:
data_center_or_region_outage:
- Activate DR failover procedures
- Redirect traffic to DR region
- Verify service restoration
- Communicate status to stakeholders
- Plan return to primary when available
cybersecurity_incident:
- Engage incident response team
- Contain the threat (isolate affected systems)
- Assess data impact and potential breach
- Activate forensic investigation
- Restore from known-good backups if needed
- Notify legal and regulatory as required
pandemic_workforce_disruption:
- Activate remote work procedures
- Verify VPN and remote access capacity
- Redistribute critical functions if staff unavailable
- Implement shift rotations to maintain coverage
- Assess vendor ability to maintain service levels
key_vendor_failure:
- Assess impact on dependent business processes
- Activate vendor contingency plan
- Engage alternate vendor if available
- Implement manual workarounds as needed
- Communicate impact to affected stakeholders
facility_unavailable:
- Account for all personnel safety
- Activate alternate work site arrangements
- Redirect mail and deliveries
- Set up temporary communication channels
- Assess timeline for facility restoration
stabilization:
- Monitor recovered services continuously
- Address any residual issues
- Begin planning return to normal operations
- Continue stakeholder communication
- Document all actions and decisions
return_to_normal:
- Develop return-to-normal plan
- Execute failback procedures (if DR was activated)
- Verify data consistency and integrity
- Restore standard operating procedures
- Conduct post-event review
- Update BCP based on lessons learned
testing_schedule:
tabletop_exercise:
frequency: "Quarterly"
duration: "2-3 hours"
participants: "BCP team, department leads, executive sponsor"
format: "Facilitated discussion of a scenario"
scenarios_to_rotate:
- Major cloud provider region outage
- Ransomware attack on production systems
- Key employee unavailability (bus factor scenario)
- Critical vendor goes out of business
- Office building inaccessible
output: "Exercise report with findings and action items"
functional_exercise:
frequency: "Semi-annually"
duration: "4-8 hours"
participants: "BCP team, IT operations, affected departments"
format: "Execute specific recovery procedures without full disruption"
examples:
- "Activate remote work for one department for a day"
- "Failover a non-production database and verify application connectivity"
- "Execute communication plan and verify contact list accuracy"
- "Restore a critical system from backup in an isolated environment"
output: "Functional test report with measured recovery times"
full_scale_exercise:
frequency: "Annually"
duration: "1-2 days"
participants: "All BCP team members, IT, communications, management"
format: "Simulate a major disruption and execute full recovery"
includes:
- "Activate BCP command structure"
- "Execute DR failover for production systems"
- "Activate communication plan"
- "Operate from alternate arrangements for set period"
- "Execute failback and return to normal"
output: "Full exercise report with comprehensive metrics and lessons learned"
testing_metrics:
- "Time to activate BCP command structure"
- "Time to complete communication notifications"
- "Contact list accuracy (% reachable)"
- "Actual RTO vs. target RTO per system"
- "Actual RPO vs. target RPO per system"
- "Number of issues identified"
- "Number of runbook corrections needed"
bcp_maintenance_checklist:
quarterly:
- [ ] Contact lists verified and updated
- [ ] Tabletop exercise conducted
- [ ] BCP team roster reviewed
- [ ] Vendor contact information verified
- [ ] Communication channels tested
semi_annually:
- [ ] Functional exercise conducted
- [ ] Recovery procedures reviewed for accuracy
- [ ] Technology dependencies verified
- [ ] Vendor continuity capabilities confirmed
annually:
- [ ] Full-scale exercise conducted
- [ ] Business Impact Analysis refreshed
- [ ] Risk assessment updated
- [ ] BCP document fully reviewed and updated
- [ ] Executive review and sign-off obtained
- [ ] Training completed for all BCP team members
- [ ] Lessons learned from all exercises incorporated
triggered_by_change:
- [ ] New critical business process added
- [ ] Major organizational restructuring
- [ ] Technology platform migration
- [ ] New regulatory requirement
- [ ] Significant vendor change
- [ ] Actual disruption event (post-event update)
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.