skills/knowledge-distillation-deep/SKILL.md
Deep analysis of knowledge distillation techniques for compressing large models into smaller efficient ones
npx skillsauth add curiositech/windags-skills knowledge-distillation-deepInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Version: 1.0
Domain: AI/ML System Design, Agent Architecture, Knowledge Transfer
Cognitive Load: Medium (requires understanding of model training, system tradeoffs)
Given: [Task Type, Model Size Constraint, Accuracy Target]
IF task_similarity(student, teacher) > 0.8:
├── Use Response-Based Distillation
│ ├── Temperature T = 3-6 for classification
│ ├── Temperature T = 1-3 for regression
│ └── Focus on soft label matching
│
ELIF abstraction_level_needed == "high":
├── Use Feature-Based Distillation
│ ├── Match intermediate layer activations
│ ├── Use attention transfer if available
│ └── Preserve semantic representations
│
ELIF compression_ratio > 10x:
├── Use Cascade Architecture
│ ├── Small model handles confidence > 0.9 cases
│ ├── Large model backup for confidence < 0.9
│ └── Route based on input complexity score
│
ELSE:
└── Use Relation-Based Distillation
├── Match similarity matrices between samples
├── Preserve ranking relationships
└── Focus on structural knowledge transfer
IF deployment_environment in ["mobile", "edge", "embedded"]:
├── SET α = 0.6-0.8 (prioritize size)
└── Accept accuracy loss for resource constraints
ELIF application_domain in ["safety_critical", "medical", "financial"]:
├── SET α = 0.1-0.3 (prioritize accuracy)
└── Use larger models, cascade for uncertainty
ELIF can_dynamically_route == True:
├── SET α = 0.5 for base model
├── Build multiple checkpoints at different α values
└── Route based on input difficulty estimation
ELSE:
└── SET α = 0.4-0.6 (balanced approach)
IF environment_stability == "static" AND task_distribution_known == True:
└── Use Hierarchical Distillation (teacher → student)
ELIF multiple_experts_available == True:
├── Use Ensemble Distillation
├── Weight teachers by domain expertise
└── Student learns from weighted combination
ELIF need_continuous_adaptation == True:
├── Use Online Distillation
├── Agents co-evolve simultaneously
└── Bidirectional knowledge flow
ELSE:
└── Use Self-Distillation with iterative refinement
Symptoms: High average accuracy (>90%) but catastrophic failures on edge cases Diagnosis: Standard metrics hide stratified performance degradation Detection Rule: If per-class accuracy variance > 20% or minority class accuracy < 70% of overall accuracy Fix: Implement stratified validation with per-class thresholds; use weighted loss functions for rare classes
Symptoms: Student model overconfident, poor calibration, loses "dark knowledge" Diagnosis: Training with temperature=1.0, ignoring probability distribution structure Detection Rule: If student confidence > teacher confidence on incorrect predictions Fix: Use temperature T=3-20 during distillation training; add calibration validation step
Symptoms: Model works fine until slight compression increase causes dramatic accuracy drop Diagnosis: Hit minimum capacity threshold for task complexity Detection Rule: If >5% accuracy drop from <10% parameter reduction Fix: Set hard minimum model size; use pruning instead of architecture changes; implement cascade routing
Symptoms: System can't adapt to new patterns; student errors persist despite available corrections Diagnosis: Fixed teacher-student roles prevent bidirectional learning Detection Rule: If student discovers edge cases but can't update teacher knowledge Fix: Implement feedback loops; use ensemble loss functions; enable peer learning between agents
Symptoms: Compressed model maintains average performance but amplifies existing biases Diagnosis: Distillation preserves teacher's biases while losing error correction capacity Detection Rule: If demographic parity decreases >10% or fairness metrics degrade disproportionately Fix: Use bias-aware distillation loss; oversample minority classes; validate on adversarial fairness benchmarks
Context: Deploying sentiment analysis to mobile app, 50MB model limit, 200ms latency requirement Teacher: 800MB BERT model, 95% accuracy, 1.2s inference Constraints: α = 0.7 (heavily prioritize size), maintain >90% accuracy
Decision Process:
Implementation:
Outcome: DS = 0.7×(50/800) + 0.3×(1-91.2/95) = 0.7×0.0625 + 0.3×0.04 = 0.056 (excellent score) Trade-off: 3.8% accuracy loss for 16x size reduction and 10x speed improvement
Context: Autonomous vehicle perception, multiple specialized agents for object detection, depth estimation, trajectory planning Constraints: α = 0.2 (heavily prioritize accuracy), 99.9% reliability requirement
Decision Process:
Implementation:
Agent Architecture:
Outcome: System maintains 99.92% safety threshold while reducing compute by 60% Trade-off: Modest efficiency gain for maintained safety with reduced single points of failure
Context: Multi-agent system for scientific literature analysis, agents specialize in different domains but share knowledge Constraints: Dynamic environment, new papers daily, agents must learn from each other
Decision Process:
Implementation:
Mechanism:
Outcome: Collective accuracy improves 8% over 6 months vs. isolated training Trade-off: Increased coordination complexity for better adaptation and knowledge sharing
Do NOT use this skill for:
Delegate to other skills:
neural-architecture-search skillmodel-deployment-optimization skillexplainable-ai-design skilldistributed-learning-coordination skilladversarial-defense-strategies skilltools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.