skills/vp-engineering/SKILL.md
VP Engineering perspective - org design (team topologies), process improvement, cross-team dependencies, engineering culture, OKRs, incident management maturity, platform strategy, DX optimization, release management at scale
npx skillsauth add vibeeval/vibecosystem vp-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Type | Purpose | Characteristics | Size | |------|---------|----------------|------| | Stream-aligned | Deliver user/business value | Full-stack, autonomous, owns entire feature slice | 5-8 | | Platform | Reduce cognitive load for stream teams | Internal products, self-service APIs/tools | 3-6 | | Enabling | Help teams adopt new capabilities | Coaching, not doing; temporary engagement | 2-3 | | Complicated subsystem | Deep specialist expertise | ML, payments, security, real-time systems | 2-4 |
| Mode | Description | When to Use | |------|-------------|-------------| | Collaboration | Teams work together closely | New capability discovery, high uncertainty | | X-as-a-Service | One team provides, other consumes | Well-defined API/platform capability | | Facilitating | One team coaches another | Skill transfer, technology adoption |
## Engineering Organization: [Company Name]
### Team Map
Stream-aligned Teams:
├── Team Alpha: [product area] (5 people)
│ Owns: [service/feature list]
│ Stack: [tech stack]
├── Team Beta: [product area] (6 people)
│ Owns: [service/feature list]
│ Stack: [tech stack]
└── Team Gamma: [product area] (5 people)
Owns: [service/feature list]
Stack: [tech stack]
Platform Team:
└── Team Platform (4 people)
Provides: CI/CD, observability, developer portal
Interaction: X-as-a-Service
Enabling Team:
└── Team Enable (2 people)
Focus: [current initiative - e.g., Kubernetes migration]
Interaction: Facilitating (rotates every quarter)
Complicated Subsystem:
└── Team ML (3 people)
Owns: ML pipeline, model serving, feature store
Interaction: Collaboration with stream teams
### Cognitive Load Assessment
| Team | Intrinsic (domain) | Extraneous (tools) | Total | Status |
|------|-------------------|--------------------|----|--------|
| Alpha | 6/10 | 3/10 | 9/10 | At capacity |
| Beta | 5/10 | 4/10 | 9/10 | At capacity |
| Platform | 7/10 | 2/10 | 9/10 | At capacity |
| Anti-Pattern | Symptom | Fix | |-------------|---------|-----| | Conway's Law violation | Architecture doesn't match team structure | Align teams to desired architecture | | Shared services bottleneck | Every team waits for "core team" | Split into platform + self-service | | Matrix management | Unclear ownership, split loyalty | Single reporting line per IC | | Too many meetings | "Alignment" overhead > execution | Reduce interaction surface, use async | | Hero culture | One person knows everything | Document, pair, rotate on-call |
| Level | Name | Characteristics | |-------|------|----------------| | 1 | Initial | Ad-hoc, no process, firefighting | | 2 | Managed | Basic scrum/kanban, inconsistent | | 3 | Defined | Consistent process, metrics tracked | | 4 | Measured | Data-driven decisions, predictable delivery | | 5 | Optimizing | Continuous improvement, experiments |
## Process Improvement Cycle
### 1. Observe (1 sprint)
- Shadow team ceremonies
- Measure cycle time, WIP, defects
- Interview team members (1-on-1)
### 2. Diagnose
| Problem | Root Cause | Impact |
|---------|-----------|--------|
| [symptom] | [why] | [what it costs] |
### 3. Hypothesize
"If we [change], then [expected outcome], measured by [metric]"
### 4. Experiment (2-3 sprints)
- Implement ONE change at a time
- Measure baseline vs. new
- Collect team feedback
### 5. Evaluate
- Did the metric improve?
- Did the team feel the improvement?
- Any unintended side effects?
### 6. Adopt or Revert
- Improvement verified: document and standardize
- No improvement: revert and try next hypothesis
| Problem | Fix | Metric | |---------|-----|--------| | Missed deadlines | Smaller stories, better estimation | Story completion rate | | Too much WIP | WIP limits (Kanban) | Cycle time | | Unclear requirements | Refinement meetings, acceptance criteria | Defect rate | | Deployment fear | Feature flags, canary deploys | Deploy frequency | | Slow code reviews | SLA (24h max), small PRs | Review turnaround | | Meeting overload | No-meeting days, async updates | Focus time % |
## Cross-Team Dependencies: [Quarter]
### Dependency Matrix
| Providing Team | Consuming Team | Dependency | Type | Status | Risk |
|---------------|---------------|-----------|------|--------|------|
| Platform | Alpha | Auth service v2 | Blocking | In progress | Medium |
| Alpha | Beta | User API | Non-blocking | Available | Low |
| ML | Gamma | Rec engine | Blocking | Not started | High |
### Dependency Types
- **Blocking:** Must be completed before consumer can start
- **Non-blocking:** Can work in parallel with mocked interface
- **Soft:** Nice to have, workaround exists
### Visualization
Team Alpha ──blocks──→ Team Beta (User API)
Team Platform ──blocks──→ Team Alpha (Auth v2)
Team ML ──blocks──→ Team Gamma (Rec engine) ← HIGH RISK
| Strategy | When to Use | |----------|-------------| | Contract-first | Define API contract, both teams implement independently | | Embedded engineer | Loan an engineer from providing team | | Shared interface | Agree on interface, mock until ready | | Prioritize differently | Move blocking work to top of providing team's backlog | | Decouple | Feature flags, adapter pattern, event-driven | | Eliminate | Redesign to remove dependency entirely |
| Anti-Pattern | Neden Yanlis | Dogru Yol | |-------------|-------------|-----------| | Hidden dependencies | Discovered too late | Map dependencies in planning | | Dependency as excuse | "Blocked by Team X" for weeks | Escalate immediately, find alternatives | | Hub team (everything flows through one) | Bottleneck | Distribute ownership, self-service | | Cross-team code ownership | Slow PRs, merge conflicts | Clear ownership boundaries |
## Engineering Culture: [Company Name]
### Our Values (with behaviors)
1. **Ownership**
- Do: Take responsibility end-to-end (build, deploy, monitor)
- Don't: "Not my code" / "That's ops problem"
- Measure: On-call engagement, post-incident participation
2. **Craft**
- Do: Write tests, review thoughtfully, refactor proactively
- Don't: "Ship now, fix later" (unless P0)
- Measure: Code review quality, tech debt ratio
3. **Transparency**
- Do: Share context, document decisions, default to public channels
- Don't: Hoarding information, private DMs for team decisions
- Measure: Documentation coverage, team survey
4. **Learning**
- Do: Blameless retros, share mistakes, invest in growth
- Don't: Blame individuals, hide failures
- Measure: Retro action items completed, conference talks
5. **Speed**
- Do: Small PRs, feature flags, iterate quickly
- Don't: Big bang releases, analysis paralysis
- Measure: Lead time, deploy frequency
| Practice | Frequency | Owner | Goal | |----------|-----------|-------|------| | Blameless post-mortems | Per incident | Engineering managers | Learn from failures | | Engineering all-hands | Monthly | VP Engineering | Alignment, wins, direction | | Tech talks / brown bags | Biweekly | Rotating engineers | Knowledge sharing | | Hack days / hackathon | Quarterly | Engineering leads | Innovation, morale | | Architecture review | Biweekly | Architects | Consistency, quality | | 1-on-1s | Weekly | Managers | Growth, retention | | Skip-level 1-on-1s | Monthly | VP/Director | Pulse check, escalation | | Engineering blog | Monthly+ | Rotating authors | Employer branding | | Open source contributions | Continuous | Anyone | Community, recruitment |
## Engineering OKRs: Q[X] [Year]
### Objective 1: Accelerate delivery velocity
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR1.1: Reduce lead time from code to production | < 4 hours | 2 days | [on/off track] |
| KR1.2: Increase deploy frequency | 5x/day | 2x/week | [on/off track] |
| KR1.3: Reduce change failure rate | < 5% | 12% | [on/off track] |
### Objective 2: Improve developer experience
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR2.1: Developer satisfaction score | > 4.2/5 | 3.6/5 | [on/off track] |
| KR2.2: Reduce CI build time | < 5 min | 12 min | [on/off track] |
| KR2.3: New hire productive in < 2 weeks | 90% | 60% | [on/off track] |
### Objective 3: Strengthen reliability
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR3.1: Achieve 99.95% uptime | 99.95% | 99.8% | [on/off track] |
| KR3.2: Reduce MTTR to < 30 min | 30 min | 2 hours | [on/off track] |
| KR3.3: Zero P0 incidents from known issues | 0 | 3/quarter | [on/off track] |
| Anti-Pattern | Neden Yanlis | Dogru Yol | |-------------|-------------|-----------| | Feature-based OKRs | "Ship feature X" is a task, not an outcome | Focus on outcomes ("Reduce churn by 10%") | | Too many OKRs | Diluted focus | 3 objectives, 3-4 KRs each max | | Binary KRs | No progress signal | Quantitative, measurable, with baseline | | No alignment | Disconnected from company OKRs | Cascade from company → engineering → team | | Set and forget | No mid-quarter check | Weekly tracking, monthly review |
| Level | Characteristics | Actions | |-------|----------------|---------| | 1: Reactive | No process, ad-hoc response, hero-driven | Document basic runbooks, assign on-call | | 2: Organized | On-call rotation, basic alerting, Slack channel | Add severity classification, escalation paths | | 3: Systematic | Incident commander role, structured comms, SLOs | Add blameless post-mortems, action item tracking | | 4: Proactive | Error budgets, chaos engineering, SLO dashboards | Game days, automated remediation | | 5: Predictive | ML-based anomaly detection, self-healing | Continuous improvement, near-zero MTTR |
## Incident Response Structure
### Roles
| Role | Responsibility |
|------|---------------|
| Incident Commander (IC) | Coordinates response, makes decisions |
| Technical Lead | Diagnoses and fixes the issue |
| Communications Lead | Stakeholder updates, status page |
| Scribe | Documents timeline and actions |
### Severity Levels
| Level | Definition | Response Time | IC Required | Status Page | Exec Notify |
|-------|-----------|--------------|-------------|-------------|-------------|
| SEV-1 | Full outage | 5 min | Yes | Yes | Immediately |
| SEV-2 | Major degradation | 15 min | Yes | Yes | Within 1h |
| SEV-3 | Minor impact | 1 hour | No | Optional | No |
| SEV-4 | No user impact | Next business day | No | No | No |
### Communication Cadence
| SEV | Internal Update | External Update | Exec Update |
|-----|----------------|----------------|-------------|
| SEV-1 | Every 15 min | Every 30 min | Every 30 min |
| SEV-2 | Every 30 min | Every 1h | Every 2h |
| SEV-3 | Every 2h | If customer-facing | None |
## Platform Team Charter
### Mission
Reduce cognitive load on stream-aligned teams by providing self-service
infrastructure, tooling, and abstractions.
### Principles
1. Treat internal teams as customers
2. Self-service > ticket-based requests
3. Paved roads, not mandates
4. Measure developer experience, not just uptime
### Product Areas
| Area | What We Provide | Maturity |
|------|----------------|----------|
| CI/CD | Build pipelines, deploy automation | Mature |
| Observability | Logging, metrics, tracing, dashboards | Growing |
| Developer portal | Service catalog, docs, templates | Early |
| Infrastructure | K8s, databases, caching, queues | Mature |
| Security | Secret management, vulnerability scanning | Growing |
### Success Metrics
| Metric | Target | Current |
|--------|--------|---------|
| Time to onboard new service | < 1 day | 1 week |
| Developer satisfaction (platform) | > 4.0/5 | 3.5/5 |
| Self-service adoption rate | > 80% | 50% |
| Support tickets per team per month | < 5 | 12 |
### Roadmap (Next 2 Quarters)
| Quarter | Initiative | Impact |
|---------|-----------|--------|
| Q1 | Internal developer portal | Reduce onboarding time 50% |
| Q1 | Standardized service template | Consistent microservices |
| Q2 | Golden path for new services | < 1 hour to first deploy |
| Q2 | Self-service database provisioning | Remove DBA bottleneck |
| Anti-Pattern | Symptom | Fix | |-------------|---------|-----| | Building for no one | Platform features nobody asked for | Customer interviews, usage metrics | | Mandatory adoption | Teams forced to use half-baked tools | Make it so good they want to use it | | Ticket-based everything | Slow provisioning, frustrated teams | Self-service APIs and UIs | | No documentation | Teams can't use platform without help | Treat docs as product | | Ivory tower | Platform team disconnected from users | Embed with stream teams periodically |
| Metric | How to Measure | Target | |--------|---------------|--------| | Dev environment setup | Time from clone to running | < 15 min | | CI build time | Pipeline duration (p50/p95) | < 5 min (p50) | | Code review turnaround | PR open to first review | < 4 hours | | Deploy to production | Merge to live | < 1 hour | | Incident notification | Alert to human eyes | < 5 min | | Documentation freshness | % docs updated in last 90 days | > 80% | | On-call burden | Pages per week per person | < 2 | | Context switching | Interruptions per focus block | < 1 |
## DX Improvement Plan
### Quick Wins (< 1 week each)
- [ ] Pre-configured dev containers / devbox
- [ ] One-command project setup script
- [ ] PR template with checklist
- [ ] Slack bot for deploy status
- [ ] Auto-assign code reviewers
### Medium Term (1-4 weeks)
- [ ] Reduce CI build time by 50%
- [ ] Local development matches production (docker-compose)
- [ ] API documentation auto-generated from code
- [ ] Error messages link to runbooks
- [ ] Feature flag self-service UI
### Long Term (1-3 months)
- [ ] Internal developer portal (Backstage/custom)
- [ ] Self-service infrastructure provisioning
- [ ] Automated dependency updates (Renovate)
- [ ] Golden path templates for new services
- [ ] DX survey and tracking dashboard
## Developer Experience Survey (Quarterly)
Rate 1-5 (1 = terrible, 5 = excellent):
### Development
1. How easy is it to set up your local dev environment?
2. How reliable is your local dev environment?
3. How fast is your CI/CD pipeline?
4. How easy is it to find and understand documentation?
### Collaboration
5. How efficient is your code review process?
6. How well does cross-team collaboration work?
7. How effective are your team's meetings?
### Operations
8. How manageable is on-call?
9. How good are your monitoring and alerting tools?
10. How confident are you in deploying to production?
### Growth
11. How supported do you feel in your career growth?
12. How much time do you spend on meaningful work vs. toil?
### Open Ended
13. What is the biggest time-waster in your day?
14. If you could change one thing about engineering, what would it be?
| Strategy | When to Use | Complexity | |----------|-------------|------------| | Continuous deployment | Mature CI/CD, high test confidence | Low (automated) | | Release train | Multi-team, coordinated releases | Medium | | Feature flags | Decouple deploy from release | Medium | | Blue-green deploy | Zero-downtime requirement | Medium | | Canary release | Gradual rollout, risk mitigation | High | | Ring deployment | Internal -> beta -> GA | High |
## Release Checklist: v[X.Y.Z]
### Pre-Release (T-2 days)
- [ ] All feature branches merged to release branch
- [ ] Release branch passes all tests
- [ ] Cross-team integration tests passing
- [ ] Dependent services compatible (API contracts)
- [ ] Database migrations tested
- [ ] Feature flags configured for new features
- [ ] Rollback plan documented
### Release Day (T-0)
- [ ] Release branch deployed to staging
- [ ] QA sign-off on staging
- [ ] Monitoring dashboards reviewed (baseline)
- [ ] On-call team briefed
- [ ] Canary deployment initiated
- [ ] Canary metrics monitored (error rate, latency, business KPIs)
- [ ] Full rollout completed
- [ ] Post-deploy verification
### Post-Release (T+1)
- [ ] Metrics compared to baseline
- [ ] No regression in error rates or latency
- [ ] Customer support briefed on changes
- [ ] Release notes published
- [ ] Feature flags cleaned up (remove old)
- [ ] Retrospective scheduled (if issues occurred)
| Metric | What It Measures | Target | |--------|-----------------|--------| | Release frequency | How often we ship | Weekly or more | | Release lead time | Code complete to production | < 1 day | | Release success rate | Releases without rollback | > 95% | | Rollback rate | How often we revert | < 5% | | Hotfix frequency | Emergency fixes needed | < 1/month | | Feature flag cleanup | Stale flags removed | Within 30 days |
| Anti-Pattern | Neden Yanlis | Dogru Yol | |-------------|-------------|-----------| | "Big bang" releases | High risk, hard to debug | Small, frequent releases | | Release branch lives too long | Merge conflicts, integration hell | Short-lived, merge daily | | Manual release process | Error-prone, slow | Fully automated pipeline | | No rollback plan | Stuck with broken release | Always have rollback procedure | | Feature flags never cleaned | Combinatorial explosion | Clean up within 30 days | | Friday deployments | Nobody around for issues | Deploy Mon-Thu, observe Fri | | No release notes | Users/support confused | Automated changelog generation |
development
Goal-based workflow orchestration - routes tasks to specialist agents based on user goals
tools
Wiring Verification
development
Connection management, room patterns, reconnection strategies, message buffering, and binary protocol design.
development
Screenshot comparison QA for frontend development. Takes a screenshot of the current implementation, scores it across multiple visual dimensions, and returns a structured PASS/REVISE/FAIL verdict with concrete fixes. Use when implementing UI from a design reference or verifying visual correctness.