Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/knowledge-distillation-survey

Name: knowledge-distillation-survey
Author: curiositech

skills/knowledge-distillation-survey/SKILL.md

npx skillsauth add curiositech/windags-skills knowledge-distillation-survey

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Knowledge Distillation: A Survey

Core Concept

Knowledge distillation transfers learning from teacher systems to students across capability gaps. The key insight: effective transfer requires matching knowledge type, teacher capacity, and transfer mechanism to the student's representational constraints.

DECISION POINTS

Primary Decision Tree: Choosing Transfer Approach

1. ASSESS CAPACITY GAP
   ├── Small gap (teacher 1-2x student params) → Direct offline distillation
   ├── Large gap (teacher 10x+ student params) → Progressive with teacher assistants  
   └── Similar capacity/peer learning → Online collaborative distillation

2. IDENTIFY KNOWLEDGE TYPE NEEDED
   ├── Task needs final decisions → Response-based (soft labels, logits)
   ├── Task needs pattern recognition → Feature-based (intermediate representations)
   ├── Task needs structural reasoning → Relation-based (similarity matrices, attention maps)
   └── Task needs multiple types → Multi-level distillation

3. SELECT TRANSFER MECHANISM
   ├── Proven expert + stable domain → Offline (sequential training)
   ├── Exploration needed + multiple learners → Online (mutual learning)
   ├── No external teacher available → Self-distillation (temporal/spatial)
   └── Cross-modal/domain required → Alignment-based transfer

4. HANDLE REPRESENTATION GAPS
   ├── Same modality → Direct feature matching
   ├── Different modalities → Paired sample alignment
   ├── Different architectures → Attention transfer or feature adaptation
   └── Different tasks → Structural knowledge extraction

Capacity Gap Diagnostic

| Teacher-Student Performance Ratio | Approach | Rationale | |-----------------------------------|----------|-----------| | 1.1-1.5x | Direct distillation | Student can decode teacher knowledge | | 1.5-3x | Add 1 teacher assistant | Bridge representational gap | | 3x+ | Multi-step progressive | Prevent knowledge loss in translation | | Similar performance | Collaborative learning | Mutual improvement through diversity |

FAILURE MODES

1. Capacity Mismatch Collapse

Symptom: Student performs worse with distillation than training alone
Detection: Student accuracy drops >2% vs independent baseline
Diagnosis: Teacher-student gap too large; student cannot represent teacher's compressed knowledge
Fix: Insert teacher assistant with intermediate capacity, or reduce teacher complexity

2. Knowledge Type Confusion

Symptom: Good distillation loss but poor task performance
Detection: Temperature scaling doesn't improve results; student mimics confidences but wrong decisions
Diagnosis: Transferring wrong knowledge dimension (e.g., response-based when student needs feature-based)
Fix: Switch to feature distillation or relation-based transfer matching student's architectural constraints

3. Teacher Overconfidence Amplification

Symptom: Student more confident than teacher but less accurate
Detection: Student entropy lower than teacher entropy on validation set
Diagnosis: Teacher's wrong predictions transferred with high confidence
Fix: Increase temperature, add label smoothing, or use multiple diverse teachers

4. Cross-Modal Alignment Failure

Symptom: Same-modality distillation works, cross-modal distillation fails completely
Detection: Cross-modal student performs at random baseline despite good teacher
Diagnosis: Modality representations not properly aligned; no semantic correspondence
Fix: Add alignment losses, use paired training data, or learn modality translation first

5. Progressive Learning Degradation

Symptom: Each step in teacher assistant chain performs worse than previous
Detection: Teacher assistant accuracy < 90% of previous teacher accuracy
Diagnosis: Knowledge corruption cascading through chain; information bottlenecks
Fix: Skip connections to original teacher, ensemble multiple assistants, or reduce chain length

WORKED EXAMPLES

Example 1: Large Language Model Compression

Scenario: Compress GPT-3.5 (175B params) to run on mobile (500M params max)

Decision Process:

Capacity gap assessment: 350x parameter difference = massive gap
Knowledge type: Need response-based (text generation) + relation-based (reasoning chains)
Progressive approach: GPT-3.5 → GPT-2 (1.5B) → GPT-small (500M)

Execution:

Stage 1: Distill GPT-3.5 → GPT-2 using response-based (next token prediction) + attention transfer
Stage 2: Distill GPT-2 → GPT-small using feature-based (hidden states) + relation-based (attention patterns)
Validation: Test reasoning on held-out problems, not just perplexity

Expert catches: Temperature needs adjustment per stage; mobile model needs different attention patterns optimized for inference speed

Example 2: Cross-Modal Robotics Transfer

Scenario: Robot trained on RGB cameras must work with depth sensors only

Decision Process:

Cross-modal gap: RGB (3 channels) → Depth (1 channel) with different semantic info
Knowledge type: Feature-based (spatial understanding) + relation-based (object relationships)
Alignment strategy: Paired RGB-depth data during distillation

Execution:

Collect paired RGB-depth sequences of same scenes
Train alignment network to map depth features → RGB feature space
Distill RGB teacher → depth student using aligned feature representations
Validate on manipulation tasks requiring spatial reasoning

Expert catches: Depth lacks color/texture info; student needs different attention for material properties

QUALITY GATES

Task completion checklist for knowledge distillation:

[ ] Student validation accuracy ≥ 95% of teacher accuracy (or ≥ 105% of independent baseline)
[ ] Knowledge transfer verified: Student generalizes to held-out test set, not just mimicking training
[ ] Capacity appropriateness: Teacher-student gap assessed; progressive learning used if gap >3x performance
[ ] Knowledge type alignment: Response/feature/relation distillation matches student architecture capabilities
[ ] Cross-validation performed: Student tested on different data distribution than teacher training
[ ] Failure mode check: No overconfidence, capacity mismatch, or degradation symptoms detected
[ ] Compression ratio documented: Inference speedup and memory reduction quantified vs. teacher
[ ] Transfer mechanism justified: Offline/online/self-distillation choice explained with rationale
[ ] Boundary conditions tested: Student performance verified at edge cases where teacher might fail
[ ] Knowledge preservation verified: Critical capabilities from teacher successfully transferred and validated

NOT-FOR BOUNDARIES

Do NOT use knowledge distillation for:

Simple rule-based systems: Use direct rule transfer instead - no learning gap to bridge
When teacher and student solve fundamentally different problems: Use transfer-learning-frameworks for task adaptation
Real-time systems where student must exceed teacher speed: Use model-optimization-techniques for architecture efficiency
When student needs capabilities teacher lacks: Use curriculum-learning-design for progressive skill building
Safety-critical systems without validation frameworks: Use robust-ai-validation for safety assurance first
When interpretability is primary concern: Use explainable-ai-methods - distillation often reduces interpretability

Delegate to other skills:

Model architecture optimization → neural-architecture-search
Cross-task transfer learning → transfer-learning-frameworks
Training data efficiency → few-shot-learning-strategies
Model safety and robustness → robust-ai-validation
Deployment optimization → model-optimization-techniques

curiositech/knowledge-distillation-survey

skills/knowledge-distillation-survey/SKILL.md

Comprehensive survey of knowledge distillation methods, architectures, and applications in neural network compression

tools

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills knowledge-distillation-survey

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:15 PM6.7s13 files scanned

SKILL.md

license:: Apache-2.0
name:: knowledge-distillation-survey
description:: Comprehensive survey of knowledge distillation methods, architectures, and applications in neural network compression
category:: Research & Academic

Knowledge Distillation: A Survey

Core Concept

DECISION POINTS

Primary Decision Tree: Choosing Transfer Approach

1. ASSESS CAPACITY GAP
   ├── Small gap (teacher 1-2x student params) → Direct offline distillation
   ├── Large gap (teacher 10x+ student params) → Progressive with teacher assistants  
   └── Similar capacity/peer learning → Online collaborative distillation

2. IDENTIFY KNOWLEDGE TYPE NEEDED
   ├── Task needs final decisions → Response-based (soft labels, logits)
   ├── Task needs pattern recognition → Feature-based (intermediate representations)
   ├── Task needs structural reasoning → Relation-based (similarity matrices, attention maps)
   └── Task needs multiple types → Multi-level distillation

3. SELECT TRANSFER MECHANISM
   ├── Proven expert + stable domain → Offline (sequential training)
   ├── Exploration needed + multiple learners → Online (mutual learning)
   ├── No external teacher available → Self-distillation (temporal/spatial)
   └── Cross-modal/domain required → Alignment-based transfer

4. HANDLE REPRESENTATION GAPS
   ├── Same modality → Direct feature matching
   ├── Different modalities → Paired sample alignment
   ├── Different architectures → Attention transfer or feature adaptation
   └── Different tasks → Structural knowledge extraction

Capacity Gap Diagnostic

FAILURE MODES

1. Capacity Mismatch Collapse

Symptom: Student performs worse with distillation than training alone
Detection: Student accuracy drops >2% vs independent baseline
Diagnosis: Teacher-student gap too large; student cannot represent teacher's compressed knowledge
Fix: Insert teacher assistant with intermediate capacity, or reduce teacher complexity

2. Knowledge Type Confusion

Symptom: Good distillation loss but poor task performance
Detection: Temperature scaling doesn't improve results; student mimics confidences but wrong decisions
Diagnosis: Transferring wrong knowledge dimension (e.g., response-based when student needs feature-based)
Fix: Switch to feature distillation or relation-based transfer matching student's architectural constraints

3. Teacher Overconfidence Amplification

Symptom: Student more confident than teacher but less accurate
Detection: Student entropy lower than teacher entropy on validation set
Diagnosis: Teacher's wrong predictions transferred with high confidence
Fix: Increase temperature, add label smoothing, or use multiple diverse teachers

4. Cross-Modal Alignment Failure

Symptom: Same-modality distillation works, cross-modal distillation fails completely
Detection: Cross-modal student performs at random baseline despite good teacher
Diagnosis: Modality representations not properly aligned; no semantic correspondence
Fix: Add alignment losses, use paired training data, or learn modality translation first

5. Progressive Learning Degradation

Symptom: Each step in teacher assistant chain performs worse than previous
Detection: Teacher assistant accuracy < 90% of previous teacher accuracy
Diagnosis: Knowledge corruption cascading through chain; information bottlenecks
Fix: Skip connections to original teacher, ensemble multiple assistants, or reduce chain length

WORKED EXAMPLES

Example 1: Large Language Model Compression

Scenario: Compress GPT-3.5 (175B params) to run on mobile (500M params max)

Decision Process:

Capacity gap assessment: 350x parameter difference = massive gap
Knowledge type: Need response-based (text generation) + relation-based (reasoning chains)
Progressive approach: GPT-3.5 → GPT-2 (1.5B) → GPT-small (500M)

Execution:

Stage 1: Distill GPT-3.5 → GPT-2 using response-based (next token prediction) + attention transfer
Stage 2: Distill GPT-2 → GPT-small using feature-based (hidden states) + relation-based (attention patterns)
Validation: Test reasoning on held-out problems, not just perplexity

Expert catches: Temperature needs adjustment per stage; mobile model needs different attention patterns optimized for inference speed

Example 2: Cross-Modal Robotics Transfer

Scenario: Robot trained on RGB cameras must work with depth sensors only

Decision Process:

Cross-modal gap: RGB (3 channels) → Depth (1 channel) with different semantic info
Knowledge type: Feature-based (spatial understanding) + relation-based (object relationships)
Alignment strategy: Paired RGB-depth data during distillation

Execution:

Collect paired RGB-depth sequences of same scenes
Train alignment network to map depth features → RGB feature space
Distill RGB teacher → depth student using aligned feature representations
Validate on manipulation tasks requiring spatial reasoning

Expert catches: Depth lacks color/texture info; student needs different attention for material properties

QUALITY GATES

Task completion checklist for knowledge distillation:

[ ] Student validation accuracy ≥ 95% of teacher accuracy (or ≥ 105% of independent baseline)
[ ] Knowledge transfer verified: Student generalizes to held-out test set, not just mimicking training
[ ] Capacity appropriateness: Teacher-student gap assessed; progressive learning used if gap >3x performance
[ ] Knowledge type alignment: Response/feature/relation distillation matches student architecture capabilities
[ ] Cross-validation performed: Student tested on different data distribution than teacher training
[ ] Failure mode check: No overconfidence, capacity mismatch, or degradation symptoms detected
[ ] Compression ratio documented: Inference speedup and memory reduction quantified vs. teacher
[ ] Transfer mechanism justified: Offline/online/self-distillation choice explained with rationale
[ ] Boundary conditions tested: Student performance verified at edge cases where teacher might fail
[ ] Knowledge preservation verified: Critical capabilities from teacher successfully transferred and validated

NOT-FOR BOUNDARIES

Do NOT use knowledge distillation for:

Simple rule-based systems: Use direct rule transfer instead - no learning gap to bridge
When teacher and student solve fundamentally different problems: Use transfer-learning-frameworks for task adaptation
Real-time systems where student must exceed teacher speed: Use model-optimization-techniques for architecture efficiency
When student needs capabilities teacher lacks: Use curriculum-learning-design for progressive skill building
Safety-critical systems without validation frameworks: Use robust-ai-validation for safety assurance first
When interpretability is primary concern: Use explainable-ai-methods - distillation often reduces interpretability

Delegate to other skills:

Model architecture optimization → neural-architecture-search
Cross-task transfer learning → transfer-learning-frameworks
Training data efficiency → few-shot-learning-strategies
Model safety and robustness → robust-ai-validation
Deployment optimization → model-optimization-techniques

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/knowledge-distillation-survey ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT