.archive/ops-team/skills/ops-platform-onboarding/SKILL.md
Structured workflow for onboarding new services to the internal platform including infrastructure provisioning, observability setup, and documentation.
npx skillsauth add lerianstudio/ring ops-platform-onboardingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill defines the structured process for onboarding services to the internal developer platform. Use it to ensure consistent, compliant service deployments.
| Phase | Focus | Output | |-------|-------|--------| | 1. Requirements | Gather service requirements | Requirements doc | | 2. Golden Path Selection | Choose deployment pattern | Selected template | | 3. Infrastructure Provisioning | Create service resources | Infrastructure ready | | 4. Observability Setup | Configure monitoring | Dashboards/alerts | | 5. Security Configuration | Apply security controls | Security validated | | 6. Documentation | Complete service docs | Runbook ready | | 7. Handoff | Transfer to service team | Ownership confirmed |
## Service Onboarding Request
**Service Name:** [name]
**Team:** [owning team]
**Requested By:** [name]
**Target Date:** YYYY-MM-DD
### Service Information
| Attribute | Value |
|-----------|-------|
| Service type | [API / Worker / Batch / Frontend] |
| Language/runtime | [Go / Node.js / Python / etc.] |
| Criticality | [Tier 1/2/3/4] |
| External traffic | [Yes / No] |
| Data sensitivity | [PII / Financial / Public] |
### Resource Requirements
| Resource | Requirement | Notes |
|----------|-------------|-------|
| CPU | [cores] | [peak/average] |
| Memory | [GB] | [peak/average] |
| Storage | [GB] | [type: SSD/HDD] |
| Database | [type] | [shared/dedicated] |
| Cache | [type] | [shared/dedicated] |
### Dependencies
| Dependency | Type | SLA Required |
|------------|------|--------------|
| [service] | Internal | [Yes/No] |
| [external] | External | [Yes/No] |
### Compliance Requirements
- [ ] SOC2
- [ ] PCI-DSS
- [ ] GDPR
- [ ] HIPAA
- [ ] Other: ____________
| Golden Path | Use Case | Includes | |-------------|----------|----------| | api-service | REST/GraphQL APIs | ALB, EKS, RDS, ElastiCache | | worker-service | Background processing | SQS, EKS, auto-scaling | | batch-job | Scheduled jobs | EventBridge, Lambda/Fargate | | frontend-app | Static sites, SPAs | CloudFront, S3, API Gateway | | data-pipeline | ETL, streaming | Kinesis, Glue, S3 |
| Requirement | api-service | worker-service | batch-job | |-------------|-------------|----------------|-----------| | HTTP traffic | Yes | No | No | | Queue processing | Optional | Yes | Optional | | Scheduled runs | No | No | Yes | | Real-time | Yes | Near-real-time | No | | Auto-scaling | Yes | Yes | N/A |
## Golden Path Selection
**Service:** [name]
**Selected Path:** [api-service / worker-service / etc.]
### Rationale
1. Service type [X] matches [golden path] pattern
2. Traffic requirements of [X] supported by [features]
3. Compliance requirements met by built-in [controls]
### Customizations Required
| Standard Component | Customization | Reason |
|--------------------|---------------|--------|
| [component] | [change] | [why] |
### Approval
- [ ] Platform team reviewed
- [ ] Security team reviewed (if customizations)
- [ ] Architecture team reviewed (if non-standard)
# Example service provisioning
module "service" {
source = "platform/service-template"
service_name = var.service_name
team = var.team
environment = var.environment
golden_path = "api-service"
# Compute
cpu_request = "500m"
memory_request = "512Mi"
replicas_min = 2
replicas_max = 10
# Database
database_enabled = true
database_class = "db.t3.medium"
# Tags
tags = {
Team = var.team
Environment = var.environment
CostCenter = var.cost_center
}
}
# Verify namespace
kubectl get namespace [service-name]
# Verify compute
kubectl get deployment -n [service-name]
# Verify database
aws rds describe-db-instances --db-instance-identifier [service-db]
# Verify DNS
dig [service-name].internal.example.com
Standard service dashboard includes:
| Panel | Metrics | |-------|---------| | Request rate | requests/sec, by status code | | Error rate | 5xx rate, 4xx rate | | Latency | p50, p95, p99 | | Saturation | CPU, memory utilization | | Dependencies | Upstream/downstream health |
| Alert | Condition | Severity | Response | |-------|-----------|----------|----------| | High error rate | 5xx > 1% for 5m | Critical | Page on-call | | High latency | p99 > 1s for 5m | Warning | Alert team | | Low availability | uptime < 99.9% | Critical | Page on-call | | Resource saturation | CPU > 85% for 10m | Warning | Alert team |
## Service Level Objectives
**Service:** [name]
**SLO Version:** 1.0
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Successful requests / total requests |
| Latency | p99 < 500ms | Request duration percentile |
| Error rate | < 0.1% | 5xx responses / total responses |
### Error Budget
- Monthly budget: 43.2 minutes downtime
- Current consumption: [X]%
- Actions if budget exceeded: [escalation process]
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: service-policy
namespace: [service-name]
spec:
podSelector:
matchLabels:
app: [service-name]
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- port: 5432
## Security Configuration Review
**Service:** [name]
**Reviewer:** @security-team
| Control | Status | Notes |
|---------|--------|-------|
| mTLS enabled | PASS | Istio strict mode |
| Network policies | PASS | Ingress/egress restricted |
| Secrets management | PASS | Using Vault |
| Least privilege IAM | PASS | Scoped to required resources |
| Vulnerability scanning | PASS | Trivy in CI/CD |
| Document | Purpose | Template | |----------|---------|----------| | Service Overview | What the service does | README.md | | Runbook | Operational procedures | runbook.md | | Architecture | Design decisions | architecture.md | | API Docs | Interface documentation | OpenAPI spec | | On-call Guide | Incident handling | oncall.md |
## [Service Name] Runbook
### Service Overview
[Brief description of what the service does]
### Quick Reference
| Item | Value |
|------|-------|
| Repository | [link] |
| Dashboard | [link] |
| Logs | [query link] |
| On-call | [PagerDuty service] |
### Common Operations
#### Restart Service
```bash
kubectl rollout restart deployment/[service] -n [namespace]
kubectl scale deployment/[service] -n [namespace] --replicas=X
kubectl logs -l app=[service] -n [namespace] --tail=100
| Symptom | Possible Cause | Resolution | |---------|----------------|------------| | High latency | DB connection pool | Scale DB or optimize queries | | 5xx errors | Dependency down | Check upstream services | | OOM kills | Memory leak | Investigate heap, restart |
| Level | Contact | When | |-------|---------|------| | L1 | [team Slack channel] | First response | | L2 | [on-call engineer] | Cannot resolve in 15m | | L3 | [service owner] | Critical/extended outage |
---
## Phase 7: Handoff
### Handoff Checklist
- [ ] Service owner identified and trained
- [ ] On-call rotation set up
- [ ] Access provisioned to team
- [ ] Documentation reviewed by team
- [ ] Shadowing session completed
- [ ] Ownership officially transferred
### Handoff Template
```markdown
## Service Handoff Confirmation
**Service:** [name]
**Date:** YYYY-MM-DD
**Platform Team:** @[name]
**Service Owner:** @[name]
### Completed Items
- [x] Infrastructure provisioned and documented
- [x] Observability configured
- [x] Security controls applied
- [x] Runbook created and reviewed
- [x] On-call rotation configured
- [x] Training session completed
### Outstanding Items
| Item | Owner | Due Date |
|------|-------|----------|
| [item] | [owner] | YYYY-MM-DD |
### Acknowledgment
By signing below, the service owner confirms:
1. Receipt of all documentation
2. Understanding of operational procedures
3. Acceptance of on-call responsibilities
**Service Owner:** _________________ Date: _______
**Platform Team:** _________________ Date: _______
| Rationalization | Why It's WRONG | Required Action | |-----------------|----------------|-----------------| | "Skip documentation, code is self-explanatory" | On-call != developers | Complete runbook | | "We'll add observability later" | Blind deployments = incidents | Observability on day 1 | | "Golden path doesn't fit exactly" | Customizations add complexity | Justify every deviation | | "Security can come later" | Later = never for security | Security from start | | "Team can figure it out" | Assumptions cause outages | Complete handoff process |
For platform onboarding tasks, dispatch:
Task tool:
subagent_type: "ring:platform-engineer"
prompt: |
SERVICE ONBOARDING REQUEST
Service: [name]
Team: [team]
Type: [API/Worker/Batch]
Requirements: [summary]
Golden Path: [if known]
development
Analyzes a Go service using lib-commons v2/v3 and generates a visual migration report showing every change needed to upgrade to lib-commons v4. Produces an interactive HTML page (via ring:visualize) and optionally generates refactoring tasks for ring:dev-cycle.
documentation
Patterns and structure for writing functional documentation including guides, conceptual explanations, tutorials, and best practices documentation.
development
Patterns and structure for writing API reference documentation including endpoint descriptions, request/response schemas, and error documentation.
documentation
Voice and tone guidelines for technical documentation. Ensures consistent, clear, and human writing across all documentation.