skills/ml-system-design-interview/SKILL.md
Coaches end-to-end ML system design interviews covering inference pipelines, recommendation systems, RAG, feature stores, and monitoring. Use for L6+ design rounds, ML architecture whiteboarding, system design practice, serving tradeoff analysis. Activate on "ML system design", "ML interview", "recommendation system design", "RAG architecture", "feature store design", "model serving". NOT for coding interviews, behavioral questions, ML theory quizzes, or paper implementations.
npx skillsauth add curiositech/windags-skills ml-system-design-interviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
End-to-end ML pipeline design coaching for staff+ engineers. Covers the full arc from problem definition through production monitoring -- the scope expected at L6+ interviews at top-tier ML organizations.
Query: What inference pattern should we use?
1. Input Predictability?
├─ Finite/enumerable → BATCH PREDICTION
│ └─ Example: recommend videos for all users nightly
└─ Infinite/dynamic → ONLINE INFERENCE
└─ Go to #2
2. Latency Requirement?
├─ <100ms → ONLINE INFERENCE (cached features)
│ └─ Example: search ranking, fraud detection
├─ 100ms-1s → NEAR-REAL-TIME
│ └─ Example: feed ranking with fresh user signals
└─ >1s → STREAMING COMPUTATION
└─ Example: complex multi-modal analysis
3. Feature Freshness?
├─ Static features only → Precompute embeddings
├─ Daily updates acceptable → Feature store + cache
└─ Real-time required → Online feature computation
Given a vague prompt like "design a recommendation system":
1. Business Objective Clarification
├─ Revenue optimization? → Focus on purchase conversion metrics
├─ Engagement optimization? → Focus on time-spent, retention metrics
└─ Exploration/discovery? → Focus on diversity, serendipity metrics
2. Scale Assessment
├─ <1M users → Simple collaborative filtering acceptable
├─ 1-100M users → Need distributed training, caching layer
└─ >100M users → Multi-stage retrieval mandatory (retrieval→ranking)
3. Cold Start Criticality
├─ High new user/item churn → Content-based features essential
├─ Stable catalog → Collaborative filtering sufficient
└─ Mixed → Hybrid approach with exploration bonus
Start simple, justify complexity upgrades:
1. Baseline Decision
├─ Tabular features → XGBoost/LightGBM
├─ Text/sequences → BERT/RoBERTa fine-tuned
└─ Multi-modal → Pre-trained encoders + MLP
2. Production Upgrade Triggers
├─ Baseline hits accuracy ceiling → Deep architecture
├─ Cold start problems → Content-based + collaborative hybrid
├─ Real-time features critical → Online learning/streaming updates
└─ Multi-objective conflicts → Multi-task architecture
3. Scaling Triggers
├─ Inference latency >SLA → Model distillation/quantization
├─ Training time >acceptable → Distributed training
└─ Serving cost >budget → Cheaper architecture or caching
Symptoms: New users see irrelevant recommendations, bounce immediately, never generate engagement signals for learning. Root Cause: System relies entirely on collaborative filtering without content-based fallback. Detection Rule: If new user retention <50% of established user retention, you have cold start problems. Fix: Implement content-based features (item metadata, user demographics), exploration bonus for new users, and onboarding questionnaire to bootstrap preferences.
Symptoms: Model performs well offline but poorly in production. Offline metrics improve but online metrics stagnate. Root Cause: Training data differs from serving data (features computed differently, timing gaps, selection bias). Detection Rule: If offline metric improves >5% but online metric shows <1% gain, suspect training-serving skew. Fix: Log exact serving features for training, implement feature validation pipeline, use identical preprocessing code paths.
Symptoms: Optimizing for clicks but business wants purchases. Model achieves high CTR but low conversion. Root Cause: Proxy metric (clicks) diverges from business metric (revenue) due to clickbait or low-intent traffic. Detection Rule: If primary metric improves but business KPI degrades or stagnates, your metrics are misaligned. Fix: Multi-objective optimization with business metric as constraint, or switch to business metric despite noisiness.
Symptoms: Model performance is good but p99 latency gradually increases over time, eventually hitting SLA violations. Root Cause: Feature computation becomes more expensive, model complexity grows, cache hit rates degrade. Detection Rule: If p99 latency increases >20% over 3 months without architecture changes, you have latency creep. Fix: Implement latency budgeting, feature computation timeout, model complexity governance, and cache performance monitoring.
Symptoms: Model performance silently degrades over weeks/months. No alerts fire until business impact is severe. Root Cause: Input data distribution shifts but model continues making confident predictions on out-of-distribution examples. Detection Rule: If model confidence remains high (>0.8) but accuracy drops >10%, suspect distribution shift. Fix: Population Stability Index monitoring, feature distribution alerts, model uncertainty calibration, and automatic retraining triggers.
Context: Design real-time fraud detection for payment processing. 10M transactions/day, must decide approve/deny in <100ms, false positive cost $50/transaction, fraud loss $500/transaction on average.
Requirements Analysis:
Serving Architecture Decision:
Option 1: Online ML model only
├─ Latency: 50-80ms (acceptable)
├─ Accuracy: High (full feature set)
└─ Risk: Model failure = no fraud protection
Option 2: Rules + ML model
├─ Latency: 30ms rules + 50ms ML = 80ms
├─ Accuracy: Rules catch obvious fraud, ML handles edge cases
└─ Risk: Rule maintenance overhead
DECISION: Option 2 - Rules as first line, ML for nuanced cases
REASONING: 80ms meets SLA, rules provide failure safety, ML adds precision
Feature Strategy:
Model Selection:
Final Architecture: Rules Engine (30ms) → If uncertain → XGBoost Model (50ms) → Decision Total latency: 80ms (within 100ms SLA)
Context: Build RAG system for company knowledge base. 50k documents, 1000 employees, must handle complex multi-hop questions about policies, procedures, and project history.
Requirements Analysis:
Retrieval Strategy Decision:
Question: How to find relevant documents for complex queries?
Option 1: Keyword search only (BM25)
├─ Pros: Fast, interpretable, good for exact matches
├─ Cons: Misses semantic similarity, poor on multi-hop
└─ Verdict: Insufficient for complex queries
Option 2: Embedding search only
├─ Pros: Semantic understanding, handles paraphrasing
├─ Cons: May miss exact keyword matches, harder to debug
└─ Verdict: Good but incomplete
Option 3: Hybrid search (BM25 + embeddings)
├─ Pros: Gets both exact matches AND semantic similarity
├─ Cons: More complex, need score fusion strategy
└─ CHOSEN: Best of both worlds
Fusion Strategy: RRF (Reciprocal Rank Fusion)
Score = 1/(rank_bm25 + k) + 1/(rank_embedding + k) where k=60
Chunking Strategy:
Reranking Decision:
After retrieval, should we rerank results?
Raw retrieval quality assessment:
├─ Top-5 contains answer: 75% of time
├─ Top-1 contains answer: 40% of time
└─ Conclusion: Retrieval recall good, precision needs work
Reranking options:
├─ Cross-encoder (BERT-based): +15% top-1 precision, +200ms latency
├─ LLM-based: +25% top-1 precision, +800ms latency
└─ CHOSEN: Cross-encoder (better precision/latency trade-off)
Final pipeline: Hybrid retrieval → Cross-encoder reranking → LLM generation
Context: Design homepage recommendations for social media app. 100M monthly users, infinite scroll feed, optimize for time spent and user satisfaction.
Cold Start Strategy:
Multi-Stage Architecture Decision:
Scale constraint: Cannot run complex model on all user-item pairs
Stage 1: Candidate Retrieval
├─ Input: User profile
├─ Output: 10k candidate items (from 100M+ total)
├─ Methods: Collaborative filtering, content similarity, trending
├─ Latency budget: 50ms
Stage 2: Ranking
├─ Input: 10k candidates
├─ Output: Ranked list of 100 items
├─ Model: Deep neural network with rich features
├─ Latency budget: 100ms
Stage 3: Diversification
├─ Input: Ranked 100 items
├─ Output: Final 20 items for feed
├─ Logic: MMR algorithm (maximal marginal relevance)
├─ Latency budget: 20ms
Total: 170ms (within 200ms SLA)
This skill should NOT be used for:
senior-coding-interview insteadinterview-loop-strategistWhere to delegate:
senior-coding-interviewinterview-loop-strategistanthropic-technical-deep-diveinterview-simulatortools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.