skills/cloud/aws/sagemaker-mlops/SKILL.md
Use when building ML training/serving pipelines on AWS SageMaker, implementing MLOps with SageMaker Pipelines and Model Registry, monitoring models in production, or optimizing training costs with Spot instances. Covers AWS MLA-C01 exam domains.
npx skillsauth add kienbui1995/magic-powers sagemaker-mlopsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Option | Cost | Best For | |--------|------|---------| | On-Demand instances | Full price | Short jobs, time-critical, no interruption risk | | Spot training | Up to 90% savings | Long batch jobs; must use checkpointing | | SageMaker Training Warm Pools | Reserve compute between runs | Iterative development (reduces startup time) |
Spot training requirements:
s3://bucket/checkpoints/job-name/Managed Spot Training code:
estimator = Estimator(
...
use_spot_instances=True,
max_run=3600, # max total training time (seconds)
max_wait=7200, # max wait including interruptions
checkpoint_s3_uri="s3://bucket/checkpoints/",
checkpoint_local_path="/opt/ml/checkpoints"
)
Built-in algorithms vs custom containers:
| Approach | Use Case | Example | |----------|---------|---------| | Built-in algorithms | Common ML tasks, fast start | XGBoost, Linear Learner, K-Means, BlazingText | | Script mode | Familiar framework (TF/PyTorch/sklearn), custom code | Bring your own training script | | Custom container | Exotic runtime, custom dependencies | Custom C++ inference, specialized research | | Pre-trained model (Jumpstart) | Fine-tune foundation models | LLMs, BERT, ResNet |
| Endpoint Type | Latency | Payload Size | Use Case | |--------------|---------|-------------|---------| | Real-time endpoint | Synchronous, milliseconds | < 6MB | Interactive APIs, recommendations, fraud detection | | Serverless endpoint | Cold start possible | < 4MB (request), < 20MB (model) | Infrequent traffic (cost savings, no idle cost) | | Async endpoint | Minutes (result to S3) | Up to 1GB | Large payloads, long processing (NLP, video) | | Batch Transform | Offline, hours | Entire dataset | Offline scoring, pre-computation, bulk inference |
Async endpoint: request queued in SQS; processing result written to S3; notification via SNS/EventBridge.
Batch Transform: no endpoint needed; input from S3; output to S3; best for periodic bulk scoring.
Multi-model endpoint (MME): host thousands of models on a single endpoint; SageMaker loads/unloads models from S3 to GPU/CPU memory dynamically. Cost-effective for many similar models.
Multi-container endpoint: run different models/containers on one endpoint; invoke a specific container. Use for A/B testing or ensemble inference.
Supported step types:
| Step Type | Purpose |
|-----------|---------|
| ProcessingStep | Data preprocessing, feature engineering, evaluation |
| TrainingStep | Model training job |
| TuningStep | Hyperparameter optimization (HPO) |
| TransformStep | Batch inference |
| RegisterModel | Register model version in Model Registry |
| ConditionStep | Branch pipeline based on evaluation metrics |
| CreateModelStep | Create SageMaker model from training artifacts |
| LambdaStep | Invoke Lambda function (custom logic) |
| ClarifyCheckStep | Bias/explainability analysis |
Example pipeline flow:
ProcessingStep (feature engineering)
↓
TrainingStep (train XGBoost)
↓
ProcessingStep (evaluate on test set)
↓
ConditionStep (accuracy > 0.9?)
├── Yes → RegisterModel (Approved)
└── No → RegisterModel (Rejected)
SageMaker Pipelines vs Step Functions:
Workflow:
Continuously monitors deployed endpoint data for:
| Monitor Type | What It Detects | Baseline | |-------------|----------------|---------| | Data quality | Feature distribution drift (input data statistics change) | Baseline from training data | | Model quality | Accuracy/precision drift (compare predictions vs ground truth) | Baseline from training evaluation | | Bias drift | Fairness metric changes (demographic parity, etc.) | Baseline from Clarify bias analysis | | Feature attribution drift | SHAP value changes (important features changing) | Baseline from Clarify explainability analysis |
Setup requirements:
| Store Type | Latency | Backed By | Best For | |-----------|---------|----------|---------| | Online store | Milliseconds | In-memory cache | Real-time inference (serving) | | Offline store | Seconds-minutes | S3 (Parquet, Iceberg) | Model training, batch queries |
Feature reuse: compute features once, store in Feature Store, reuse across multiple models and teams. Point-in-time queries: offline store supports time-travel queries (get feature values as of specific timestamp) — prevents training/serving skew.
content-media
Use when designing for XR (AR/VR/MR), choosing interaction modes, or adapting 2D UI patterns for spatial computing
testing
Use when creating new skills, editing existing skills, or verifying skills work before deployment
development
Use when you have a spec or requirements for a multi-step task, before touching code
development
Use when executing a structured workflow — select and run a feature, bugfix, refactor, research, or incident template with correct agent and model assignments per phase.