SKILL: Knowledge Distillation and Intelligent Compression

Version: 1.0
Domain: AI/ML System Design, Agent Architecture, Knowledge Transfer
Cognitive Load: Medium (requires understanding of model training, system tradeoffs)

DECISION POINTS

Primary Decision Tree: Choosing Distillation Strategy

Given: [Task Type, Model Size Constraint, Accuracy Target]

IF task_similarity(student, teacher) > 0.8:
    ├── Use Response-Based Distillation
    │   ├── Temperature T = 3-6 for classification
    │   ├── Temperature T = 1-3 for regression
    │   └── Focus on soft label matching
    │
ELIF abstraction_level_needed == "high":
    ├── Use Feature-Based Distillation
    │   ├── Match intermediate layer activations
    │   ├── Use attention transfer if available
    │   └── Preserve semantic representations
    │
ELIF compression_ratio > 10x:
    ├── Use Cascade Architecture
    │   ├── Small model handles confidence > 0.9 cases
    │   ├── Large model backup for confidence < 0.9
    │   └── Route based on input complexity score
    │
ELSE:
    └── Use Relation-Based Distillation
        ├── Match similarity matrices between samples
        ├── Preserve ranking relationships
        └── Focus on structural knowledge transfer

Secondary Decision: Setting Alpha (Size vs Accuracy Priority)

IF deployment_environment in ["mobile", "edge", "embedded"]:
    ├── SET α = 0.6-0.8 (prioritize size)
    └── Accept accuracy loss for resource constraints

ELIF application_domain in ["safety_critical", "medical", "financial"]:
    ├── SET α = 0.1-0.3 (prioritize accuracy)
    └── Use larger models, cascade for uncertainty

ELIF can_dynamically_route == True:
    ├── SET α = 0.5 for base model
    ├── Build multiple checkpoints at different α values
    └── Route based on input difficulty estimation

ELSE:
    └── SET α = 0.4-0.6 (balanced approach)

Tertiary Decision: Agent Coordination Pattern

IF environment_stability == "static" AND task_distribution_known == True:
    └── Use Hierarchical Distillation (teacher → student)

ELIF multiple_experts_available == True:
    ├── Use Ensemble Distillation
    ├── Weight teachers by domain expertise
    └── Student learns from weighted combination

ELIF need_continuous_adaptation == True:
    ├── Use Online Distillation
    ├── Agents co-evolve simultaneously
    └── Bidirectional knowledge flow

ELSE:
    └── Use Self-Distillation with iterative refinement

FAILURE MODES

1. Accuracy Mirage

Symptoms: High average accuracy (>90%) but catastrophic failures on edge cases Diagnosis: Standard metrics hide stratified performance degradation Detection Rule: If per-class accuracy variance > 20% or minority class accuracy < 70% of overall accuracy Fix: Implement stratified validation with per-class thresholds; use weighted loss functions for rare classes

2. Temperature Blindness

Symptoms: Student model overconfident, poor calibration, loses "dark knowledge" Diagnosis: Training with temperature=1.0, ignoring probability distribution structure Detection Rule: If student confidence > teacher confidence on incorrect predictions Fix: Use temperature T=3-20 during distillation training; add calibration validation step

3. Capacity Cliff Crash

Symptoms: Model works fine until slight compression increase causes dramatic accuracy drop Diagnosis: Hit minimum capacity threshold for task complexity Detection Rule: If >5% accuracy drop from <10% parameter reduction Fix: Set hard minimum model size; use pruning instead of architecture changes; implement cascade routing

4. Hierarchical Rigidity

Symptoms: System can't adapt to new patterns; student errors persist despite available corrections Diagnosis: Fixed teacher-student roles prevent bidirectional learning Detection Rule: If student discovers edge cases but can't update teacher knowledge Fix: Implement feedback loops; use ensemble loss functions; enable peer learning between agents

5. Compression Amplification Bias

Symptoms: Compressed model maintains average performance but amplifies existing biases Diagnosis: Distillation preserves teacher's biases while losing error correction capacity Detection Rule: If demographic parity decreases >10% or fairness metrics degrade disproportionately Fix: Use bias-aware distillation loss; oversample minority classes; validate on adversarial fairness benchmarks

WORKED EXAMPLES

Example 1: Mobile Deployment Scenario

Context: Deploying sentiment analysis to mobile app, 50MB model limit, 200ms latency requirement Teacher: 800MB BERT model, 95% accuracy, 1.2s inference Constraints: α = 0.7 (heavily prioritize size), maintain >90% accuracy

Decision Process:

Task similarity high (same domain) → Response-based distillation
Compression ratio 16x → Expect significant accuracy loss, need mitigation
Temperature selection: T=4 for sentiment (discrete classes with similarity structure)

Implementation:

Train 50MB DistilBERT student on soft labels from BERT teacher
Temperature T=4 during training, T=1 during inference
Achieve 91.2% accuracy (3.8% drop) with 16x compression
Add uncertainty threshold: defer to cloud API when confidence <0.85

Outcome: DS = 0.7×(50/800) + 0.3×(1-91.2/95) = 0.7×0.0625 + 0.3×0.04 = 0.056 (excellent score) Trade-off: 3.8% accuracy loss for 16x size reduction and 10x speed improvement

Example 2: Safety-Critical Multi-Agent System

Context: Autonomous vehicle perception, multiple specialized agents for object detection, depth estimation, trajectory planning Constraints: α = 0.2 (heavily prioritize accuracy), 99.9% reliability requirement

Decision Process:

Different tasks per agent → Feature-based distillation for shared representations
Safety critical → Cascade architecture with redundancy
Multi-expert system → Ensemble distillation with specialist weighting

Implementation:

Large teacher models for each perception task
Medium student agents learn shared feature representations
Tiny monitoring agent validates cross-agent consistency
Cascade: students handle confidence >0.95, teachers handle edge cases

Agent Architecture:

Object detection student: 100MB (from 500MB teacher)
Depth estimation student: 80MB (from 400MB teacher)
Trajectory planning: Keep full teacher (no compression for final decisions)
Monitor agent: 10MB, watches for inconsistencies

Outcome: System maintains 99.92% safety threshold while reducing compute by 60% Trade-off: Modest efficiency gain for maintained safety with reduced single points of failure

Example 3: Collaborative Research Assistant Agents

Context: Multi-agent system for scientific literature analysis, agents specialize in different domains but share knowledge Constraints: Dynamic environment, new papers daily, agents must learn from each other

Decision Process:

Dynamic environment → Online distillation with peer learning
Different specializations → Relation-based distillation for structural knowledge
Continuous adaptation needed → Bidirectional knowledge flow

Implementation:

5 specialist agents (biology, chemistry, physics, computer science, medicine)
Each agent maintains domain expertise but learns general patterns from peers
Weekly ensemble sessions where agents teach each other via attention transfer
Self-distillation within each agent to compress learned knowledge

Mechanism:

Agent A discovers new pattern in biology papers
Shares attention weights and feature representations with other agents
Other agents evaluate if pattern applies to their domains
Successful transfers update shared knowledge base

Outcome: Collective accuracy improves 8% over 6 months vs. isolated training Trade-off: Increased coordination complexity for better adaptation and knowledge sharing

QUALITY GATES

[ ] Distillation Score (DS) calculated with explicit α parameter and documented rationale
[ ] Stratified validation completed with per-class accuracy thresholds set and met
[ ] Temperature parameter tuned (T>1 during training) and calibration validated
[ ] Capacity cliff analysis performed - confirmed model size is above minimum threshold
[ ] Failure mode monitoring implemented for bias amplification and edge case degradation
[ ] Adversarial test suite created covering out-of-distribution and minority class scenarios
[ ] If compression ratio >5x, cascade architecture evaluated and routing strategy defined
[ ] If multi-agent system, coordination pattern chosen and feedback loops implemented
[ ] Deployment readiness confirmed with latency, memory, and accuracy benchmarks met
[ ] Rollback plan prepared with performance monitoring alerts and degradation thresholds

NOT-FOR BOUNDARIES

Do NOT use this skill for:

Simple rule-based systems with <1000 parameters (use direct optimization instead)
Single-task agents with abundant computational resources (compression unnecessary)
Systems where interpretability is primary requirement (use interpretable-models skill instead)
Real-time systems with <10ms latency requirements (use hardware-acceleration skill instead)
Prototype/research phases before performance requirements defined (premature optimization)

Delegate to other skills:

For model architecture selection → use neural-architecture-search skill
For hardware optimization → use model-deployment-optimization skill
For interpretable AI requirements → use explainable-ai-design skill
For federated learning scenarios → use distributed-learning-coordination skill
For adversarial robustness → use adversarial-defense-strategies skill

SKILL: Knowledge Distillation and Intelligent Compression

Version: 1.0
Domain: AI/ML System Design, Agent Architecture, Knowledge Transfer
Cognitive Load: Medium (requires understanding of model training, system tradeoffs)

DECISION POINTS

Primary Decision Tree: Choosing Distillation Strategy

Given: [Task Type, Model Size Constraint, Accuracy Target]

IF task_similarity(student, teacher) > 0.8:
    ├── Use Response-Based Distillation
    │   ├── Temperature T = 3-6 for classification
    │   ├── Temperature T = 1-3 for regression
    │   └── Focus on soft label matching
    │
ELIF abstraction_level_needed == "high":
    ├── Use Feature-Based Distillation
    │   ├── Match intermediate layer activations
    │   ├── Use attention transfer if available
    │   └── Preserve semantic representations
    │
ELIF compression_ratio > 10x:
    ├── Use Cascade Architecture
    │   ├── Small model handles confidence > 0.9 cases
    │   ├── Large model backup for confidence < 0.9
    │   └── Route based on input complexity score
    │
ELSE:
    └── Use Relation-Based Distillation
        ├── Match similarity matrices between samples
        ├── Preserve ranking relationships
        └── Focus on structural knowledge transfer

Secondary Decision: Setting Alpha (Size vs Accuracy Priority)

IF deployment_environment in ["mobile", "edge", "embedded"]:
    ├── SET α = 0.6-0.8 (prioritize size)
    └── Accept accuracy loss for resource constraints

ELIF application_domain in ["safety_critical", "medical", "financial"]:
    ├── SET α = 0.1-0.3 (prioritize accuracy)
    └── Use larger models, cascade for uncertainty

ELIF can_dynamically_route == True:
    ├── SET α = 0.5 for base model
    ├── Build multiple checkpoints at different α values
    └── Route based on input difficulty estimation

ELSE:
    └── SET α = 0.4-0.6 (balanced approach)

Tertiary Decision: Agent Coordination Pattern

IF environment_stability == "static" AND task_distribution_known == True:
    └── Use Hierarchical Distillation (teacher → student)

ELIF multiple_experts_available == True:
    ├── Use Ensemble Distillation
    ├── Weight teachers by domain expertise
    └── Student learns from weighted combination

ELIF need_continuous_adaptation == True:
    ├── Use Online Distillation
    ├── Agents co-evolve simultaneously
    └── Bidirectional knowledge flow

ELSE:
    └── Use Self-Distillation with iterative refinement

FAILURE MODES

1. Accuracy Mirage

2. Temperature Blindness

3. Capacity Cliff Crash

4. Hierarchical Rigidity

5. Compression Amplification Bias

WORKED EXAMPLES

Example 1: Mobile Deployment Scenario

Decision Process:

Task similarity high (same domain) → Response-based distillation
Compression ratio 16x → Expect significant accuracy loss, need mitigation
Temperature selection: T=4 for sentiment (discrete classes with similarity structure)

Implementation:

Train 50MB DistilBERT student on soft labels from BERT teacher
Temperature T=4 during training, T=1 during inference
Achieve 91.2% accuracy (3.8% drop) with 16x compression
Add uncertainty threshold: defer to cloud API when confidence <0.85

Outcome: DS = 0.7×(50/800) + 0.3×(1-91.2/95) = 0.7×0.0625 + 0.3×0.04 = 0.056 (excellent score) Trade-off: 3.8% accuracy loss for 16x size reduction and 10x speed improvement

Example 2: Safety-Critical Multi-Agent System

Decision Process:

Different tasks per agent → Feature-based distillation for shared representations
Safety critical → Cascade architecture with redundancy
Multi-expert system → Ensemble distillation with specialist weighting

Implementation:

Large teacher models for each perception task
Medium student agents learn shared feature representations
Tiny monitoring agent validates cross-agent consistency
Cascade: students handle confidence >0.95, teachers handle edge cases

Agent Architecture:

Object detection student: 100MB (from 500MB teacher)
Depth estimation student: 80MB (from 400MB teacher)
Trajectory planning: Keep full teacher (no compression for final decisions)
Monitor agent: 10MB, watches for inconsistencies

Outcome: System maintains 99.92% safety threshold while reducing compute by 60% Trade-off: Modest efficiency gain for maintained safety with reduced single points of failure

Example 3: Collaborative Research Assistant Agents

Decision Process:

Dynamic environment → Online distillation with peer learning
Different specializations → Relation-based distillation for structural knowledge
Continuous adaptation needed → Bidirectional knowledge flow

Implementation:

5 specialist agents (biology, chemistry, physics, computer science, medicine)
Each agent maintains domain expertise but learns general patterns from peers
Weekly ensemble sessions where agents teach each other via attention transfer
Self-distillation within each agent to compress learned knowledge

Mechanism:

Agent A discovers new pattern in biology papers
Shares attention weights and feature representations with other agents
Other agents evaluate if pattern applies to their domains
Successful transfers update shared knowledge base

Outcome: Collective accuracy improves 8% over 6 months vs. isolated training Trade-off: Increased coordination complexity for better adaptation and knowledge sharing

QUALITY GATES

[ ] Distillation Score (DS) calculated with explicit α parameter and documented rationale
[ ] Stratified validation completed with per-class accuracy thresholds set and met
[ ] Temperature parameter tuned (T>1 during training) and calibration validated
[ ] Capacity cliff analysis performed - confirmed model size is above minimum threshold
[ ] Failure mode monitoring implemented for bias amplification and edge case degradation
[ ] Adversarial test suite created covering out-of-distribution and minority class scenarios
[ ] If compression ratio >5x, cascade architecture evaluated and routing strategy defined
[ ] If multi-agent system, coordination pattern chosen and feedback loops implemented
[ ] Deployment readiness confirmed with latency, memory, and accuracy benchmarks met
[ ] Rollback plan prepared with performance monitoring alerts and degradation thresholds

NOT-FOR BOUNDARIES

Do NOT use this skill for:

Simple rule-based systems with <1000 parameters (use direct optimization instead)
Single-task agents with abundant computational resources (compression unnecessary)
Systems where interpretability is primary requirement (use interpretable-models skill instead)
Real-time systems with <10ms latency requirements (use hardware-acceleration skill instead)
Prototype/research phases before performance requirements defined (premature optimization)

Delegate to other skills:

For model architecture selection → use neural-architecture-search skill
For hardware optimization → use model-deployment-optimization skill
For interpretable AI requirements → use explainable-ai-design skill
For federated learning scenarios → use distributed-learning-coordination skill
For adversarial robustness → use adversarial-defense-strategies skill

Adoption

curiositech/knowledge-distillation-deep

$ install --global

Security Scan Results

SKILL.md

SKILL: Knowledge Distillation and Intelligent Compression

DECISION POINTS

Primary Decision Tree: Choosing Distillation Strategy

Secondary Decision: Setting Alpha (Size vs Accuracy Priority)

Tertiary Decision: Agent Coordination Pattern

FAILURE MODES

1. Accuracy Mirage

2. Temperature Blindness

3. Capacity Cliff Crash

4. Hierarchical Rigidity

5. Compression Amplification Bias

WORKED EXAMPLES

Example 1: Mobile Deployment Scenario

Example 2: Safety-Critical Multi-Agent System

Example 3: Collaborative Research Assistant Agents

QUALITY GATES

NOT-FOR BOUNDARIES

Related Skills

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

curiositech/knowledge-distillation-deep

$ install --global

Security Scan Results

SKILL.md

SKILL: Knowledge Distillation and Intelligent Compression

DECISION POINTS

Primary Decision Tree: Choosing Distillation Strategy

Secondary Decision: Setting Alpha (Size vs Accuracy Priority)

Tertiary Decision: Agent Coordination Pattern

FAILURE MODES

1. Accuracy Mirage

2. Temperature Blindness

3. Capacity Cliff Crash

4. Hierarchical Rigidity

5. Compression Amplification Bias

WORKED EXAMPLES

Example 1: Mobile Deployment Scenario

Example 2: Safety-Critical Multi-Agent System

Example 3: Collaborative Research Assistant Agents

QUALITY GATES

NOT-FOR BOUNDARIES

Related Skills

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy