ai/ios-skills/ios-axiom-ios-ml/coreml-diag/SKILL.md
CoreML diagnostics - model load failures, slow inference, memory issues, compression accuracy loss, compute unit problems, conversion errors.
npx skillsauth add kurko/dotfiles coreml-diagInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Symptom | First Check | Pattern | |---------|-------------|---------| | Model won't load | Deployment target | 1a-1c | | Slow first load | Cache miss | 2a | | Slow inference | Compute units | 2b-2c | | High memory | Concurrent predictions | 3a-3b | | Bad accuracy after compression | Granularity | 4a-4c | | Conversion fails | Operation support | 5a-5b |
CoreML issue
├─ Load failure?
│ ├─ "Unsupported model version" → 1a
│ ├─ "Failed to create compute plan" → 1b
│ └─ Other load error → 1c
├─ Performance issue?
│ ├─ First load slow, subsequent fast? → 2a
│ ├─ All predictions slow? → 2b
│ └─ Slow only on specific device? → 2c
├─ Memory issue?
│ ├─ Memory grows during predictions? → 3a
│ └─ Out of memory on load? → 3b
├─ Accuracy degraded?
│ ├─ After palettization? → 4a
│ ├─ After quantization? → 4b
│ └─ After pruning? → 4c
└─ Conversion issue?
├─ Operation not supported? → 5a
└─ Wrong output? → 5b
Symptom: Model fails to load with version error.
Cause: Model compiled for newer OS than device supports.
Diagnosis:
# Check model's minimum deployment target
import coremltools as ct
model = ct.models.MLModel("Model.mlpackage")
print(model.get_spec().specificationVersion)
| Spec Version | Minimum iOS | |--------------|-------------| | 4 | iOS 13 | | 5 | iOS 14 | | 6 | iOS 15 | | 7 | iOS 16 | | 8 | iOS 17 | | 9 | iOS 18 |
Fix: Re-convert with lower deployment target:
mlmodel = ct.convert(
traced,
minimum_deployment_target=ct.target.iOS16 # Lower target
)
Tradeoff: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).
Symptom: Model loads on some devices but not others.
Cause: Unsupported operations for target compute unit.
Diagnosis:
Fix:
// Force CPU-only to bypass unsupported GPU/NE operations
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly
let model = try MLModel(contentsOf: url, configuration: config)
Better fix: Update model precision or operations during conversion:
# Float16 often better supported
mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)
Symptom: Model fails to load with unclear error.
Checklist:
.mlmodelc)// Debug logging
let config = MLModelConfiguration()
config.parameters = [.reporter: { print($0) }] // iOS 17+
Symptom: First prediction after install/update is slow, subsequent are fast.
Cause: Device specialization not cached.
Diagnosis:
Why cache misses:
Mitigation:
// Warm cache in background at app launch
Task.detached(priority: .background) {
_ = try? await MLModel.load(contentsOf: modelURL)
}
Note: Cache is tied to (model path + configuration + device). Different configs = different cache entries.
Symptom: Predictions consistently slow, not just first one.
Diagnosis:
Common causes:
| Cause | Fix |
|-------|-----|
| Running on CPU when GPU/NE available | Check computeUnits config |
| Model too large for Neural Engine | Compress model |
| Frequent CPU↔GPU↔NE transfers | Adjust segmentation |
| Dynamic shapes recompiling | Use fixed/enumerated shapes |
Profile compute unit usage:
let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
let info = plan.computeDeviceInfo(for: op)
print("\(op.name): \(info.preferredDevice)")
}
Symptom: Fast on Mac, slow on iPhone (or vice versa).
Cause: Different hardware characteristics.
Diagnosis:
// Check available compute
let devices = MLModel.availableComputeDevices
print(devices) // Different per device
Common issues:
| Scenario | Cause | Fix | |----------|-------|-----| | Fast on M-series Mac, slow on iPhone | Model optimized for GPU | Use palettization (Neural Engine) | | Fast on iPhone, slow on Intel Mac | No Neural Engine | Use quantization (GPU) | | Slow on older devices | Less compute power | Use more aggressive compression |
Recommendation: Profile on target devices, not just development Mac.
Symptom: Memory increases with each prediction, doesn't release.
Cause: Input/output buffers accumulating from concurrent predictions.
Diagnosis:
Instruments → Allocations + Core ML template
Look for: Many concurrent prediction intervals
Check: MLMultiArray allocations growing
Fix: Limit concurrent predictions:
actor PredictionLimiter {
private let maxConcurrent = 2
private var inFlight = 0
func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider {
while inFlight >= maxConcurrent {
await Task.yield()
}
inFlight += 1
defer { inFlight -= 1 }
return try await model.prediction(from: input)
}
}
Symptom: App crashes or model fails to load on memory-constrained devices.
Cause: Model too large for device memory.
Diagnosis:
# Check model size
ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/
Fix options:
| Approach | Compression | Memory Impact | |----------|-------------|---------------| | 8-bit palettization | 2x smaller | 2x less memory | | 4-bit palettization | 4x smaller | 4x less memory | | Pruning (50%) | ~2x smaller | ~2x less memory |
Note: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.
Symptom: Model output degraded after palettization.
Diagnosis:
Fix progression:
# Step 1: Try grouped channels (iOS 18+)
config = OpPalettizerConfig(
nbits=4,
granularity="per_grouped_channel",
group_size=16
)
# Step 2: If still bad, try more bits
config = OpPalettizerConfig(nbits=6, ...)
# Step 3: If still need 4-bit, use calibration
from coremltools.optimize.torch.palettization import DKMPalettizer
# ... training-time compression
Key insight: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.
Symptom: Model output degraded after INT8/INT4 quantization.
Diagnosis:
Fix progression:
# Step 1: Use per-block (iOS 18+)
config = OpLinearQuantizerConfig(
dtype="int4",
granularity="per_block",
block_size=32
)
# Step 2: Use calibration data
from coremltools.optimize.torch.quantization import LayerwiseCompressor
compressor = LayerwiseCompressor(model, config)
quantized = compressor.compress(calibration_loader)
Note: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.
Symptom: Model output degraded after weight pruning.
Diagnosis:
Thresholds (model-dependent):
Fix:
# Use calibration-based pruning
from coremltools.optimize.torch.pruning import LayerwiseCompressor
config = MagnitudePrunerConfig(
target_sparsity=0.4,
n_samples=128
)
compressor = LayerwiseCompressor(model, config)
sparse = compressor.compress(calibration_loader)
Symptom: Conversion fails with unsupported operation error.
Diagnosis:
Error: "Op 'custom_op' is not supported for conversion"
Options:
pip install --upgrade coremltools
# Instead of custom_op(x)
# Use: supported_op1(supported_op2(x))
from coremltools.converters.mil import Builder as mb
@mb.register_torch_op
def custom_op(context, node):
# Map to MIL operations
...
Symptom: Model converts but predictions differ from PyTorch.
Diagnosis checklist:
# PyTorch often uses ImageNet normalization
# CoreML may need explicit preprocessing
# Check shapes in conversion
ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
# Force Float32 to match PyTorch
ct.convert(..., compute_precision=ct.precision.FLOAT32)
# Ensure eval mode
model.eval()
Debug:
# Compare outputs layer by layer
import numpy as np
torch_output = model(input).detach().numpy()
coreml_output = mlmodel.predict({"input": input.numpy()})["output"]
print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")
Wrong approach: Assume simulator bug, ignore.
Right approach:
Wrong approach: Compress to smallest possible size without testing.
Right approach:
When CoreML isn't working:
WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161
Docs: /coreml, /coreml/mlmodel
Skills: coreml, coreml-ref
data-ai
Merge the current worktree branch into main and sync main back. Use when the user says "merge to main", "ship it", "merge and continue", or after completing a task in a worktree and wanting to continue with the next one.
tools
Synchronize AI agent skills, commands, configs, permissions, hooks, and instructions across Claude Code, Codex CLI, and other Agent Skills-compatible tools. Use when the user asks to pull skills from Claude into Codex, sync Codex work back to Claude, migrate agent commands, reconcile frontmatter, update permissions, or keep agent setup files in parity.
testing
Write or update UI-independent use cases for QA. Use when the user says "write use cases", "add use cases", "QA use cases", "update use cases", "compose use cases", or when starting implementation of a new feature (after plan approval). Also activates for "what should we test", "regression cases", or "use cases for QA".
documentation
Skill on how to write a task. Use when user asks you to write a task (for Asana, Linear, Jira, Notion and equivalent). Also activates when user says "create task", "write task", or similar task creation workflow requests.