CoreML Diagnostics

Quick Reference

| Symptom | First Check | Pattern | |---------|-------------|---------| | Model won't load | Deployment target | 1a-1c | | Slow first load | Cache miss | 2a | | Slow inference | Compute units | 2b-2c | | High memory | Concurrent predictions | 3a-3b | | Bad accuracy after compression | Granularity | 4a-4c | | Conversion fails | Operation support | 5a-5b |

Decision Tree

CoreML issue
├─ Load failure?
│   ├─ "Unsupported model version" → 1a
│   ├─ "Failed to create compute plan" → 1b
│   └─ Other load error → 1c
├─ Performance issue?
│   ├─ First load slow, subsequent fast? → 2a
│   ├─ All predictions slow? → 2b
│   └─ Slow only on specific device? → 2c
├─ Memory issue?
│   ├─ Memory grows during predictions? → 3a
│   └─ Out of memory on load? → 3b
├─ Accuracy degraded?
│   ├─ After palettization? → 4a
│   ├─ After quantization? → 4b
│   └─ After pruning? → 4c
└─ Conversion issue?
    ├─ Operation not supported? → 5a
    └─ Wrong output? → 5b

Pattern 1a - "Unsupported model version"

Symptom: Model fails to load with version error.

Cause: Model compiled for newer OS than device supports.

Diagnosis:

# Check model's minimum deployment target
import coremltools as ct
model = ct.models.MLModel("Model.mlpackage")
print(model.get_spec().specificationVersion)

| Spec Version | Minimum iOS | |--------------|-------------| | 4 | iOS 13 | | 5 | iOS 14 | | 6 | iOS 15 | | 7 | iOS 16 | | 8 | iOS 17 | | 9 | iOS 18 |

Fix: Re-convert with lower deployment target:

mlmodel = ct.convert(
    traced,
    minimum_deployment_target=ct.target.iOS16  # Lower target
)

Tradeoff: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).

Pattern 1b - "Failed to create compute plan"

Symptom: Model loads on some devices but not others.

Cause: Unsupported operations for target compute unit.

Diagnosis:

Open model in Xcode
Create Performance Report
Check "Unsupported" operations
Hover for hints

Fix:

// Force CPU-only to bypass unsupported GPU/NE operations
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly
let model = try MLModel(contentsOf: url, configuration: config)

Better fix: Update model precision or operations during conversion:

# Float16 often better supported
mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)

Pattern 1c - General Load Failures

Symptom: Model fails to load with unclear error.

Checklist:

Check file exists and is readable
Check compiled vs source model (runtime needs .mlmodelc)
Check available disk space (cache needs room)
Check model isn't corrupted (re-convert)

// Debug logging
let config = MLModelConfiguration()
config.parameters = [.reporter: { print($0) }]  // iOS 17+

Pattern 2a - Slow First Load (Cache Miss)

Symptom: First prediction after install/update is slow, subsequent are fast.

Cause: Device specialization not cached.

Diagnosis:

Profile with Core ML Instrument
Look at Load event label:
- "prepare and cache" = cache miss (slow)
- "cached" = cache hit (fast)

Why cache misses:

First launch after install
System update invalidated cache
Low disk space cleared cache
Model file was modified

Mitigation:

// Warm cache in background at app launch
Task.detached(priority: .background) {
    _ = try? await MLModel.load(contentsOf: modelURL)
}

Note: Cache is tied to (model path + configuration + device). Different configs = different cache entries.

Pattern 2b - All Predictions Slow

Symptom: Predictions consistently slow, not just first one.

Diagnosis:

Create Xcode Performance Report
Check compute unit distribution
Look for high-cost operations

Common causes:

| Cause | Fix | |-------|-----| | Running on CPU when GPU/NE available | Check computeUnits config | | Model too large for Neural Engine | Compress model | | Frequent CPU↔GPU↔NE transfers | Adjust segmentation | | Dynamic shapes recompiling | Use fixed/enumerated shapes |

Profile compute unit usage:

let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
    let info = plan.computeDeviceInfo(for: op)
    print("\(op.name): \(info.preferredDevice)")
}

Pattern 2c - Slow on Specific Device

Symptom: Fast on Mac, slow on iPhone (or vice versa).

Cause: Different hardware characteristics.

Diagnosis:

// Check available compute
let devices = MLModel.availableComputeDevices
print(devices)  // Different per device

Common issues:

| Scenario | Cause | Fix | |----------|-------|-----| | Fast on M-series Mac, slow on iPhone | Model optimized for GPU | Use palettization (Neural Engine) | | Fast on iPhone, slow on Intel Mac | No Neural Engine | Use quantization (GPU) | | Slow on older devices | Less compute power | Use more aggressive compression |

Recommendation: Profile on target devices, not just development Mac.

Pattern 3a - Memory Grows During Predictions

Symptom: Memory increases with each prediction, doesn't release.

Cause: Input/output buffers accumulating from concurrent predictions.

Diagnosis:

Instruments → Allocations + Core ML template
Look for: Many concurrent prediction intervals
Check: MLMultiArray allocations growing

Fix: Limit concurrent predictions:

actor PredictionLimiter {
    private let maxConcurrent = 2
    private var inFlight = 0

    func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider {
        while inFlight >= maxConcurrent {
            await Task.yield()
        }
        inFlight += 1
        defer { inFlight -= 1 }
        return try await model.prediction(from: input)
    }
}

Pattern 3b - Out of Memory on Load

Symptom: App crashes or model fails to load on memory-constrained devices.

Cause: Model too large for device memory.

Diagnosis:

# Check model size
ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/

Fix options:

| Approach | Compression | Memory Impact | |----------|-------------|---------------| | 8-bit palettization | 2x smaller | 2x less memory | | 4-bit palettization | 4x smaller | 4x less memory | | Pruning (50%) | ~2x smaller | ~2x less memory |

Note: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.

Pattern 4a - Bad Accuracy After Palettization

Symptom: Model output degraded after palettization.

Diagnosis:

What bit depth? (2-bit most likely to fail)
What granularity? (per-tensor loses more than per-grouped-channel)

Fix progression:

# Step 1: Try grouped channels (iOS 18+)
config = OpPalettizerConfig(
    nbits=4,
    granularity="per_grouped_channel",
    group_size=16
)

# Step 2: If still bad, try more bits
config = OpPalettizerConfig(nbits=6, ...)

# Step 3: If still need 4-bit, use calibration
from coremltools.optimize.torch.palettization import DKMPalettizer
# ... training-time compression

Key insight: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.

Pattern 4b - Bad Accuracy After Quantization

Symptom: Model output degraded after INT8/INT4 quantization.

Diagnosis:

What bit depth?
What granularity?

Fix progression:

# Step 1: Use per-block (iOS 18+)
config = OpLinearQuantizerConfig(
    dtype="int4",
    granularity="per_block",
    block_size=32
)

# Step 2: Use calibration data
from coremltools.optimize.torch.quantization import LayerwiseCompressor
compressor = LayerwiseCompressor(model, config)
quantized = compressor.compress(calibration_loader)

Note: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.

Pattern 4c - Bad Accuracy After Pruning

Symptom: Model output degraded after weight pruning.

Diagnosis:

What sparsity level?
Post-training or training-time?

Thresholds (model-dependent):

0-30% sparsity: Usually safe
30-50% sparsity: May need calibration
50%+ sparsity: Usually needs training-time

Fix:

# Use calibration-based pruning
from coremltools.optimize.torch.pruning import LayerwiseCompressor

config = MagnitudePrunerConfig(
    target_sparsity=0.4,
    n_samples=128
)
compressor = LayerwiseCompressor(model, config)
sparse = compressor.compress(calibration_loader)

Pattern 5a - Operation Not Supported

Symptom: Conversion fails with unsupported operation error.

Diagnosis:

Error: "Op 'custom_op' is not supported for conversion"

Options:

Check if op is in coremltools: May need newer version

pip install --upgrade coremltools

Use composite ops: Split into supported primitives

# Instead of custom_op(x)
# Use: supported_op1(supported_op2(x))

Register custom op: Advanced, requires MIL programming

from coremltools.converters.mil import Builder as mb

@mb.register_torch_op
def custom_op(context, node):
    # Map to MIL operations
    ...

Pattern 5b - Conversion Succeeds but Wrong Output

Symptom: Model converts but predictions differ from PyTorch.

Diagnosis checklist:

Input normalization: Ensure preprocessing matches

# PyTorch often uses ImageNet normalization
# CoreML may need explicit preprocessing

Shape ordering: PyTorch (NCHW) vs CoreML (NHWC for some ops)

# Check shapes in conversion
ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])

Precision differences: Float16 may differ from Float32

# Force Float32 to match PyTorch
ct.convert(..., compute_precision=ct.precision.FLOAT32)

Random ops: Dropout, random initialization differ

# Ensure eval mode
model.eval()

Debug:

# Compare outputs layer by layer
import numpy as np

torch_output = model(input).detach().numpy()
coreml_output = mlmodel.predict({"input": input.numpy()})["output"]

print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")

Pressure Scenario - "Model works on simulator but not device"

Wrong approach: Assume simulator bug, ignore.

Right approach:

Check model spec version vs device iOS version (Pattern 1a)
Check compute unit availability (Pattern 2c)
Profile on actual device, not simulator
Simulator uses host Mac's GPU/CPU, not device Neural Engine

Pressure Scenario - "Ship now, optimize later"

Wrong approach: Compress to smallest possible size without testing.

Right approach:

Ship Float16 baseline first
Profile on target devices
Apply compression incrementally with accuracy testing
Document compression settings for future optimization

Diagnostic Checklist

When CoreML isn't working:

[ ] Check deployment target matches device iOS
[ ] Check model file is compiled (.mlmodelc)
[ ] Profile load: cached vs uncached
[ ] Profile prediction: which compute units
[ ] Check memory: concurrent predictions limited
[ ] For compression issues: try higher granularity
[ ] For conversion issues: check op support, precision

Resources

WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161

Docs: /coreml, /coreml/mlmodel

Skills: coreml, coreml-ref

CoreML Diagnostics

Quick Reference

Decision Tree

CoreML issue
├─ Load failure?
│   ├─ "Unsupported model version" → 1a
│   ├─ "Failed to create compute plan" → 1b
│   └─ Other load error → 1c
├─ Performance issue?
│   ├─ First load slow, subsequent fast? → 2a
│   ├─ All predictions slow? → 2b
│   └─ Slow only on specific device? → 2c
├─ Memory issue?
│   ├─ Memory grows during predictions? → 3a
│   └─ Out of memory on load? → 3b
├─ Accuracy degraded?
│   ├─ After palettization? → 4a
│   ├─ After quantization? → 4b
│   └─ After pruning? → 4c
└─ Conversion issue?
    ├─ Operation not supported? → 5a
    └─ Wrong output? → 5b

Pattern 1a - "Unsupported model version"

Symptom: Model fails to load with version error.

Cause: Model compiled for newer OS than device supports.

Diagnosis:

# Check model's minimum deployment target
import coremltools as ct
model = ct.models.MLModel("Model.mlpackage")
print(model.get_spec().specificationVersion)

| Spec Version | Minimum iOS | |--------------|-------------| | 4 | iOS 13 | | 5 | iOS 14 | | 6 | iOS 15 | | 7 | iOS 16 | | 8 | iOS 17 | | 9 | iOS 18 |

Fix: Re-convert with lower deployment target:

mlmodel = ct.convert(
    traced,
    minimum_deployment_target=ct.target.iOS16  # Lower target
)

Tradeoff: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).

Pattern 1b - "Failed to create compute plan"

Symptom: Model loads on some devices but not others.

Cause: Unsupported operations for target compute unit.

Diagnosis:

Open model in Xcode
Create Performance Report
Check "Unsupported" operations
Hover for hints

Fix:

// Force CPU-only to bypass unsupported GPU/NE operations
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly
let model = try MLModel(contentsOf: url, configuration: config)

Better fix: Update model precision or operations during conversion:

# Float16 often better supported
mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)

Pattern 1c - General Load Failures

Symptom: Model fails to load with unclear error.

Checklist:

Check file exists and is readable
Check compiled vs source model (runtime needs .mlmodelc)
Check available disk space (cache needs room)
Check model isn't corrupted (re-convert)

// Debug logging
let config = MLModelConfiguration()
config.parameters = [.reporter: { print($0) }]  // iOS 17+

Pattern 2a - Slow First Load (Cache Miss)

Symptom: First prediction after install/update is slow, subsequent are fast.

Cause: Device specialization not cached.

Diagnosis:

Profile with Core ML Instrument
Look at Load event label:
- "prepare and cache" = cache miss (slow)
- "cached" = cache hit (fast)

Why cache misses:

First launch after install
System update invalidated cache
Low disk space cleared cache
Model file was modified

Mitigation:

// Warm cache in background at app launch
Task.detached(priority: .background) {
    _ = try? await MLModel.load(contentsOf: modelURL)
}

Note: Cache is tied to (model path + configuration + device). Different configs = different cache entries.

Pattern 2b - All Predictions Slow

Symptom: Predictions consistently slow, not just first one.

Diagnosis:

Create Xcode Performance Report
Check compute unit distribution
Look for high-cost operations

Common causes:

Profile compute unit usage:

let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
    let info = plan.computeDeviceInfo(for: op)
    print("\(op.name): \(info.preferredDevice)")
}

Pattern 2c - Slow on Specific Device

Symptom: Fast on Mac, slow on iPhone (or vice versa).

Cause: Different hardware characteristics.

Diagnosis:

// Check available compute
let devices = MLModel.availableComputeDevices
print(devices)  // Different per device

Common issues:

Recommendation: Profile on target devices, not just development Mac.

Pattern 3a - Memory Grows During Predictions

Symptom: Memory increases with each prediction, doesn't release.

Cause: Input/output buffers accumulating from concurrent predictions.

Diagnosis:

Instruments → Allocations + Core ML template
Look for: Many concurrent prediction intervals
Check: MLMultiArray allocations growing

Fix: Limit concurrent predictions:

actor PredictionLimiter {
    private let maxConcurrent = 2
    private var inFlight = 0

    func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider {
        while inFlight >= maxConcurrent {
            await Task.yield()
        }
        inFlight += 1
        defer { inFlight -= 1 }
        return try await model.prediction(from: input)
    }
}

Pattern 3b - Out of Memory on Load

Symptom: App crashes or model fails to load on memory-constrained devices.

Cause: Model too large for device memory.

Diagnosis:

# Check model size
ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/

Fix options:

Note: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.

Pattern 4a - Bad Accuracy After Palettization

Symptom: Model output degraded after palettization.

Diagnosis:

What bit depth? (2-bit most likely to fail)
What granularity? (per-tensor loses more than per-grouped-channel)

Fix progression:

# Step 1: Try grouped channels (iOS 18+)
config = OpPalettizerConfig(
    nbits=4,
    granularity="per_grouped_channel",
    group_size=16
)

# Step 2: If still bad, try more bits
config = OpPalettizerConfig(nbits=6, ...)

# Step 3: If still need 4-bit, use calibration
from coremltools.optimize.torch.palettization import DKMPalettizer
# ... training-time compression

Key insight: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.

Pattern 4b - Bad Accuracy After Quantization

Symptom: Model output degraded after INT8/INT4 quantization.

Diagnosis:

What bit depth?
What granularity?

Fix progression:

# Step 1: Use per-block (iOS 18+)
config = OpLinearQuantizerConfig(
    dtype="int4",
    granularity="per_block",
    block_size=32
)

# Step 2: Use calibration data
from coremltools.optimize.torch.quantization import LayerwiseCompressor
compressor = LayerwiseCompressor(model, config)
quantized = compressor.compress(calibration_loader)

Note: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.

Pattern 4c - Bad Accuracy After Pruning

Symptom: Model output degraded after weight pruning.

Diagnosis:

What sparsity level?
Post-training or training-time?

Thresholds (model-dependent):

0-30% sparsity: Usually safe
30-50% sparsity: May need calibration
50%+ sparsity: Usually needs training-time

Fix:

# Use calibration-based pruning
from coremltools.optimize.torch.pruning import LayerwiseCompressor

config = MagnitudePrunerConfig(
    target_sparsity=0.4,
    n_samples=128
)
compressor = LayerwiseCompressor(model, config)
sparse = compressor.compress(calibration_loader)

Pattern 5a - Operation Not Supported

Symptom: Conversion fails with unsupported operation error.

Diagnosis:

Error: "Op 'custom_op' is not supported for conversion"

Options:

Check if op is in coremltools: May need newer version

pip install --upgrade coremltools

Use composite ops: Split into supported primitives

# Instead of custom_op(x)
# Use: supported_op1(supported_op2(x))

Register custom op: Advanced, requires MIL programming

from coremltools.converters.mil import Builder as mb

@mb.register_torch_op
def custom_op(context, node):
    # Map to MIL operations
    ...

Pattern 5b - Conversion Succeeds but Wrong Output

Symptom: Model converts but predictions differ from PyTorch.

Diagnosis checklist:

Input normalization: Ensure preprocessing matches

# PyTorch often uses ImageNet normalization
# CoreML may need explicit preprocessing

Shape ordering: PyTorch (NCHW) vs CoreML (NHWC for some ops)

# Check shapes in conversion
ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])

Precision differences: Float16 may differ from Float32

# Force Float32 to match PyTorch
ct.convert(..., compute_precision=ct.precision.FLOAT32)

Random ops: Dropout, random initialization differ

# Ensure eval mode
model.eval()

Debug:

# Compare outputs layer by layer
import numpy as np

torch_output = model(input).detach().numpy()
coreml_output = mlmodel.predict({"input": input.numpy()})["output"]

print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")

Pressure Scenario - "Model works on simulator but not device"

Wrong approach: Assume simulator bug, ignore.

Right approach:

Check model spec version vs device iOS version (Pattern 1a)
Check compute unit availability (Pattern 2c)
Profile on actual device, not simulator
Simulator uses host Mac's GPU/CPU, not device Neural Engine

Pressure Scenario - "Ship now, optimize later"

Wrong approach: Compress to smallest possible size without testing.

Right approach:

Ship Float16 baseline first
Profile on target devices
Apply compression incrementally with accuracy testing
Document compression settings for future optimization

Diagnostic Checklist

When CoreML isn't working:

[ ] Check deployment target matches device iOS
[ ] Check model file is compiled (.mlmodelc)
[ ] Profile load: cached vs uncached
[ ] Profile prediction: which compute units
[ ] Check memory: concurrent predictions limited
[ ] For compression issues: try higher granularity
[ ] For conversion issues: check op support, precision

Resources

WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161

Docs: /coreml, /coreml/mlmodel

Skills: coreml, coreml-ref

Adoption

kurko/coreml-diag

$ install --global

Security Scan Results

SKILL.md

CoreML Diagnostics

Quick Reference

Decision Tree

Pattern 1a - "Unsupported model version"

Pattern 1b - "Failed to create compute plan"

Pattern 1c - General Load Failures

Pattern 2a - Slow First Load (Cache Miss)

Pattern 2b - All Predictions Slow

Pattern 2c - Slow on Specific Device

Pattern 3a - Memory Grows During Predictions

Pattern 3b - Out of Memory on Load

Pattern 4a - Bad Accuracy After Palettization

Pattern 4b - Bad Accuracy After Quantization

Pattern 4c - Bad Accuracy After Pruning

Pattern 5a - Operation Not Supported

Pattern 5b - Conversion Succeeds but Wrong Output

Pressure Scenario - "Model works on simulator but not device"

Pressure Scenario - "Ship now, optimize later"

Diagnostic Checklist

Resources

Related Skills

kurko/create-pr

kurko/worktree-merge-to-main

kurko/sync-agent-assets

kurko/compose-user-cases-for-qa

kurko/coreml-diag

$ install --global

Security Scan Results

SKILL.md

CoreML Diagnostics

Quick Reference

Decision Tree

Pattern 1a - "Unsupported model version"

Pattern 1b - "Failed to create compute plan"

Pattern 1c - General Load Failures

Pattern 2a - Slow First Load (Cache Miss)

Pattern 2b - All Predictions Slow

Pattern 2c - Slow on Specific Device

Pattern 3a - Memory Grows During Predictions

Pattern 3b - Out of Memory on Load

Pattern 4a - Bad Accuracy After Palettization

Pattern 4b - Bad Accuracy After Quantization

Pattern 4c - Bad Accuracy After Pruning

Pattern 5a - Operation Not Supported

Pattern 5b - Conversion Succeeds but Wrong Output

Pressure Scenario - "Model works on simulator but not device"

Pressure Scenario - "Ship now, optimize later"

Diagnostic Checklist

Resources

Related Skills

kurko/create-pr

kurko/worktree-merge-to-main

kurko/sync-agent-assets

kurko/compose-user-cases-for-qa