skills/apple-on-device-ai/SKILL.md
Integrate on-device AI using Foundation Models framework, Core ML, and open-source LLM runtimes on Apple Silicon. Covers Foundation Models (LanguageModelSession, @Generable, @Guide, SystemLanguageModel, structured output, tool calling), Core ML (coremltools, model conversion, quantization, palettization, pruning, Neural Engine, MLTensor), MLX Swift (transformer inference, unified memory), and llama.cpp (GGUF, cross-platform LLM). Use when building tool-calling AI features, working with guided generation schemas, converting models, or running on-device inference.
npx skillsauth add dpearson2699/swift-ios-skills apple-on-device-aiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide for selecting, deploying, and optimizing on-device ML models. Covers Apple Foundation Models, Core ML, MLX Swift, and llama.cpp.
Use this decision tree to pick the right framework for your use case.
When to use: Text generation, summarization, entity extraction, structured output, and short dialog on iOS 26+ / macOS 26+ devices with Apple Intelligence enabled. No app-managed API key, network round trip, or model hosting; still handle system model asset readiness.
Best for:
@Generable typesTool protocolNot suited for: Complex math, code generation, factual accuracy tasks, or apps targeting pre-iOS 26 devices.
When to use: Deploying custom trained models (vision, NLP, audio) across all Apple platforms. Converting models from PyTorch, TensorFlow, or scikit-learn with coremltools.
Best for:
When to use: Running specific open-source LLMs (Llama, Mistral, Qwen, Gemma) on Apple Silicon with maximum throughput. Research and prototyping.
Best for:
mlx-communityWhen to use: Cross-platform LLM inference using GGUF model format. Production deployments needing broad device support.
Best for:
| Scenario | Framework |
|---|---|
| Text generation on Apple Intelligence devices (iOS 26+) | Foundation Models |
| Structured output from on-device LLM | Foundation Models (@Generable) |
| Image classification, object detection | Core ML |
| Custom model from PyTorch/TensorFlow | Core ML + coremltools |
| Running specific open-source LLMs | MLX Swift or llama.cpp |
| Maximum throughput on Apple Silicon | MLX Swift |
| Cross-platform LLM inference | llama.cpp |
| OCR and text recognition | Vision framework |
| Sentiment analysis, NER, tokenization | Natural Language framework |
| Training custom classifiers on device | Create ML |
On-device language model optimized for Apple Silicon. Available on devices supporting Apple Intelligence (iOS 26+, macOS 26+).
contextSize for the limitsupportsLocale(_:) against
Locale.current and preferred fallbacks; do not raw-match supportedLanguagesAlways check before using. Never crash on unavailability.
import FoundationModels
switch SystemLanguageModel.default.availability {
case .available:
guard SystemLanguageModel.default.supportsLocale(Locale.current) else {
// Use locale fallback before generating
break
}
// Proceed with model usage
case .unavailable(.appleIntelligenceNotEnabled):
// Guide user to enable Apple Intelligence in Settings
case .unavailable(.modelNotReady):
// System model assets are not ready; show loading state
case .unavailable(.deviceNotEligible):
// Device cannot run Apple Intelligence; use fallback
case .unavailable(let reason):
// Unknown or future unavailable reason; use fallback and log reason
}
// Basic session
let session = LanguageModelSession()
// Session with instructions
let session = LanguageModelSession {
"You are a helpful cooking assistant."
}
// Session with tools
let session = LanguageModelSession(
tools: [weatherTool, recipeTool]
) {
"You are a helpful assistant with access to tools."
}
Key rules:
session.isResponding)session.prewarm() before user interaction for faster first responseLanguageModelSession(model: model, tools: [], transcript: savedTranscript)@GenerableThe @Generable macro creates compile-time schemas for type-safe output:
@Generable
struct Recipe {
@Guide(description: "The recipe name")
var name: String
@Guide(description: "Cooking steps", .count(3))
var steps: [String]
@Guide(description: "Prep time in minutes", .range(1...120))
var prepTime: Int
}
let response = try await session.respond(
to: "Suggest a quick pasta recipe",
generating: Recipe.self
)
print(response.content.name)
@Guide Constraints| Constraint | Purpose |
|---|---|
| description: | Natural language hint for generation |
| .anyOf([values]) | Restrict to enumerated string values |
| .count(n) | Fixed array length |
| .range(min...max) | Numeric range |
| .minimum(n) / .maximum(n) | One-sided numeric bound |
| .minimumCount(n) / .maximumCount(n) | Array length bounds |
| .constant(value) | Always returns this value |
| .pattern(regex) | String format enforcement |
| .element(guide) | Guide applied to each array element |
Properties generate in declaration order. Place foundational data before dependent data for better results.
let stream = session.streamResponse(
to: "Suggest a recipe",
generating: Recipe.self
)
for try await snapshot in stream {
// snapshot.content is Recipe.PartiallyGenerated (all properties optional)
if let name = snapshot.content.name { updateNameLabel(name) }
}
struct WeatherTool: Tool {
let name = "weather"
let description = "Get current weather for a city."
@Generable
struct Arguments {
@Guide(description: "The city name")
var city: String
}
func call(arguments: Arguments) async throws -> String {
let weather = try await fetchWeather(arguments.city)
return weather.description
}
}
Register only necessary tools at session creation. Tool is Sendable; tool
descriptors and @Generable schemas consume the shared context window. The
model chooses when to call tools, so prefetch deterministic required data into
the prompt and reserve autonomous tools for dynamic lookups.
do {
let response = try await session.respond(to: prompt)
} catch let error as LanguageModelSession.GenerationError {
switch error {
case .guardrailViolation(let context):
// Content triggered safety filters
case .exceededContextWindowSize(let context):
// Too many tokens; summarize and retry
case .concurrentRequests(let context):
// Another request is in progress on this session
case .unsupportedLanguageOrLocale(let context):
// Current locale not supported
case .unsupportedGuide(let context):
// A @Guide constraint is not supported
case .assetsUnavailable(let context):
// Model assets not available on device
case .refusal(let refusal, _):
// Model refused; stream refusal.explanation for details
case .rateLimited(let context):
// Too many requests; back off and retry
case .decodingFailure(let context):
// Response could not be decoded into the expected type
default: break
}
}
let options = GenerationOptions(
sampling: .random(top: 40),
temperature: 0.7,
maximumResponseTokens: 512
)
let response = try await session.respond(to: prompt, options: options)
Sampling modes: .greedy, .random(top:seed:), .random(probabilityThreshold:seed:).
tokenCount(for:) to monitor the context window budget[descriptive example]Foundation Models supports specialized use cases via SystemLanguageModel.UseCase:
.general -- Default for text generation, summarization, dialog.contentTagging -- Optimized for categorization and labeling tasksLoad fine-tuned adapters for specialized behavior (requires entitlement):
let adapter = try SystemLanguageModel.Adapter(name: "my-adapter")
try await adapter.compile()
let model = SystemLanguageModel(adapter: adapter, guardrails: .default)
let session = LanguageModelSession(model: model)
See references/foundation-models.md for the complete Foundation Models API reference.
Apple's framework for deploying trained models. Automatically dispatches to the optimal compute unit (CPU, GPU, or Neural Engine).
| Format | Extension | When to Use |
|---|---|---|
| .mlpackage | Directory (mlprogram) | All new models (iOS 15+) |
| .mlmodel | Single file (neuralnetwork) | Legacy only (iOS 11-14) |
| .mlmodelc | Compiled | Pre-compiled for faster loading |
Always use mlprogram (.mlpackage) for new work.
import coremltools as ct
# PyTorch conversion (torch.jit.trace)
model.eval() # CRITICAL: always call eval() before tracing
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(
traced,
inputs=[ct.TensorType(shape=(1, 3, 224, 224), name="image")],
minimum_deployment_target=ct.target.iOS18,
convert_to='mlprogram',
)
mlmodel.save("Model.mlpackage")
| Technique | Size Reduction | Accuracy Impact | Best Compute Unit | |---|---|---|---| | INT8 per-channel | ~4x | Low | CPU/GPU | | INT4 per-block | ~8x | Medium | GPU | | Palettization 4-bit | ~8x | Low-Medium | Neural Engine | | W8A8 (weights+activations) | ~4x | Low | ANE (A17 Pro/M4+) | | Pruning 75% | ~4x | Medium | CPU/ANE |
coremlThis skill owns Python-side conversion, compression, profiling, and framework
selection. Use the sibling coreml skill for Swift app integration, prediction
APIs, runtime configuration, Vision request wiring, and detailed model loading.
See references/coreml-conversion.md for the full conversion pipeline and references/coreml-optimization.md for optimization techniques.
Apple's ML framework for Swift. Highest sustained generation throughput on Apple Silicon via unified memory architecture.
import MLX
import MLXLLM
import MLXLMCommon
import MLXLMHFAPI
let container = try await LLMModelFactory.shared.loadContainer(
from: HubClient.default,
using: TokenizersLoader(),
configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)
let session = ChatSession(container)
print(try await session.respond(to: "Hello"))
| Device | RAM | Recommended Model | RAM Usage | |---|---|---|---| | iPhone 12-14 | 4-6 GB | SmolLM2-135M or Qwen 2.5 0.5B | ~0.3 GB | | iPhone 15 Pro+ | 8 GB | Gemma 3n E4B 4-bit | ~3.5 GB | | Mac 8 GB | 8 GB | Llama 3.2 3B 4-bit | ~3 GB | | Mac 16 GB+ | 16 GB+ | Mistral 7B 4-bit | ~6 GB |
Memory.cacheLimit = 512 * 1024 * 1024Memory.clearCache() after generation-heavy phasesSee references/mlx-swift.md for full MLX Swift patterns and llama.cpp integration.
When an app needs multiple AI backends (e.g., Foundation Models + MLX fallback):
func respond(to prompt: String) async throws -> String {
if SystemLanguageModel.default.isAvailable {
return try await foundationModelsRespond(prompt)
} else if canLoadMLXModel() {
return try await mlxRespond(prompt)
} else {
throw AIError.noBackendAvailable
}
}
Serialize all model access through a coordinator actor to prevent contention:
actor ModelCoordinator {
func withExclusiveAccess<T>(_ work: () async throws -> T) async rethrows -> T {
try await work()
}
}
For custom Core ML models, name only the conversion/optimization handoff here:
send Swift app integration, model loading, Vision wiring, and prediction
lifecycle to coreml. Keep private user content, such as journals, on device
unless product explicitly opts into a nonlocal fallback.
session.prewarm() for Foundation Models before user interaction.mlmodelc for faster loadingSystemLanguageModel.default.availability leaves unsupported devices with
failures instead of fallback UI.tokenCount(for:) and summarize when needed.LanguageModelSession supports one
request at a time. Check session.isResponding or serialize access.model.eval() before Core ML tracing. PyTorch models must be
in eval mode before torch.jit.trace. Training-mode artifacts corrupt output.mlprogram (.mlpackage) for new
Core ML models. The legacy neuralnetwork format is deprecated.Memory.clearCache().@Generable properties in logical generation ordercontextSize)Sendable-conformant or @MainActor-isolated@Generable, tool calling, prompt designdevelopment
Implement, review, or improve data visualizations using Swift Charts. Use when building bar, line, area, point, pie, donut, or iOS 26 3D charts; when adding chart selection, scrolling, annotations, axes, scales, legends, or foregroundStyle grouping; when plotting functions with BarPlot, LinePlot, AreaPlot, PointPlot, Chart3D, or SurfacePlot; or when creating heat maps, Gantt charts, grouped bars, sparklines, threshold lines, or spatial visualizations.
data-ai
Select, implement, or migrate between app architecture patterns for Apple platform apps. Use when choosing between MV (Model-View with @Observable), MVVM, MVI, TCA (The Composable Architecture), Clean Architecture, VIPER, or Coordinator patterns; when evaluating architecture fit for a feature's complexity; when migrating from one pattern to another; or when reviewing whether an app's current architecture is appropriate. Scoped to Apple-platform patterns using Swift 6.3, SwiftUI, and UIKit.
development
Apply Swift API Design Guidelines to name, label, and document Swift APIs. Covers argument label rules (prepositional phrase rule, grammatical phrase rule, first-label omission), mutating/nonmutating pair naming (-ed/-ing participle pattern, form- prefix, sort/sorted, formUnion/union), side-effect naming (noun for pure, verb for mutating), documentation comment structure (summary by declaration kind, O(1) complexity rule), clarity at call site, role-based naming, protocol naming (-able/-ible/-ing), default arguments over method families, casing conventions, and terminology. Use when designing new Swift APIs, reviewing naming and argument labels, writing documentation comments, or refactoring for call site clarity.
development
Implement, review, or improve in-app purchases and subscriptions using StoreKit 2. Use when building paywalls with SubscriptionStoreView or ProductView, processing transactions with Product and Transaction APIs, verifying entitlements, handling purchase flows (consumable, non-consumable, auto-renewable), implementing offer codes or promotional/win-back/introductory offers, managing subscription status and renewal state, setting up StoreKit testing with configuration files, or integrating Family Sharing, Ask to Buy, refund handling, and billing retry logic.