ml-engineer-skill/SKILL.md
Expert in building scalable ML systems, from data pipelines and model training to production deployment and monitoring.
npx skillsauth add 404kidwiz/claude-supercode-skills ml-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Provides MLOps and production ML engineering expertise specializing in end-to-end ML pipelines, model deployment, and infrastructure automation. Bridges data science and production engineering with robust, scalable machine learning systems.
Need to serve predictions?
│
├─ Real-time (Low Latency)?
│ │
│ ├─ High Throughput? → **Kubernetes (KServe/Seldon)**
│ ├─ Low/Medium Traffic? → **Serverless (Lambda/Cloud Run)**
│ └─ Ultra-low latency (<10ms)? → **C++/Rust Inference Server (Triton)**
│
├─ Batch Processing?
│ │
│ ├─ Large Scale? → **Spark / Ray**
│ └─ Scheduled Jobs? → **Airflow / Prefect**
│
└─ Edge / Client-side?
│
├─ Mobile? → **TFLite / CoreML**
└─ Browser? → **TensorFlow.js / ONNX Runtime Web**
Training Environment?
│
├─ Single Node?
│ │
│ ├─ Interactive? → **JupyterHub / SageMaker Notebooks**
│ └─ Automated? → **Docker Container on VM**
│
└─ Distributed?
│
├─ Data Parallelism? → **Ray Train / PyTorch DDP**
└─ Pipeline orchestration? → **Kubeflow / Airflow / Vertex AI**
| Need | Recommendation | Rationale | |------|----------------|-----------| | Simple / MVP | No Feature Store | Use SQL/Parquet files. Overhead of FS is too high. | | Team Consistency | Feast | Open source, manages online/offline consistency. | | Enterprise / Managed | Tecton / Hopsworks | Full governance, lineage, managed SLA. | | Cloud Native | Vertex/SageMaker FS | Tight integration if already in that cloud ecosystem. |
Red Flags → Escalate to oracle:
Goal: Automate model training, validation, and registration using MLflow.
Steps:
Setup Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("churn-prediction-prod")
Training Script (train.py)
def train(max_depth, n_estimators):
with mlflow.start_run():
# Log params
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("n_estimators", n_estimators)
# Train
model = RandomForestClassifier(
max_depth=max_depth,
n_estimators=n_estimators,
random_state=42
)
model.fit(X_train, y_train)
# Evaluate
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
prec = precision_score(y_test, preds)
# Log metrics
mlflow.log_metric("accuracy", acc)
mlflow.log_metric("precision", prec)
# Log model artifact with signature
from mlflow.models.signature import infer_signature
signature = infer_signature(X_train, preds)
mlflow.sklearn.log_model(
model,
"model",
signature=signature,
registered_model_name="churn-model"
)
print(f"Run ID: {mlflow.active_run().info.run_id}")
if __name__ == "__main__":
train(max_depth=5, n_estimators=100)
Pipeline Orchestration (Bash/Airflow)
#!/bin/bash
# Run training
python train.py
# Check if model passed threshold (e.g. via MLflow API)
# If yes, transition to Staging
Goal: Detect if production data distribution has shifted from training data.
Steps:
Baseline Generation (During Training)
import evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
# Calculate baseline profile on training data
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=test_df)
report.save_json("baseline_drift.json")
Production Monitoring Job
# Scheduled daily job
def check_drift():
# Load production logs (last 24h)
current_data = load_production_logs()
reference_data = load_training_data()
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)
result = report.as_dict()
dataset_drift = result['metrics'][0]['result']['dataset_drift']
if dataset_drift:
trigger_alert("Data Drift Detected!")
trigger_retraining()
Goal: Build a production retrieval pipeline using Pinecone/Weaviate and LangChain.
Steps:
Ingestion (Chunking & Embedding)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
# Chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(raw_documents)
# Embedding & Indexing
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore.from_documents(
docs,
embeddings,
index_name="knowledge-base"
)
Retrieval & Generation
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
)
response = qa_chain.invoke("How do I reset my password?")
print(response['result'])
Optimization (Hybrid Search)
What it looks like:
Why it fails:
Correct approach:
What it looks like:
.pkl file to an engineer.Why it fails:
Correct approach:
What it looks like:
200 OK but prediction is garbage because input data was corrupted (e.g., all Nulls).0 for everything.Why it fails:
Correct approach:
Reliability:
/health endpoint implemented (liveness/readiness).Performance:
Reproducibility:
requirements.txt / conda.yaml).Monitoring:
development
Expert in automating Excel workflows using Node.js (ExcelJS, SheetJS) and Python (pandas, openpyxl).
content-media
Expert in designing durable, scalable workflow systems using Temporal, Camunda, and Event-Driven Architectures.
tools
Use when user needs WordPress development, theme or plugin creation, site optimization, security hardening, multisite management, or scaling WordPress from small sites to enterprise platforms.
tools
Expert in Windows Server, Active Directory (AD DS), Hybrid Identity (Entra ID), and PowerShell automation.