.claude/skills/ai-ml-expert/SKILL.md
AI and ML expert covering PyTorch, TensorFlow, Hugging Face, scikit-learn, LLM integration, RAG pipelines, MLOps, and production ML systems
npx skillsauth add oimiragieo/agent-studio ai-ml-expertInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When reviewing or writing PyTorch code, apply these guidelines:
torch.nn.Module for all model definitions; avoid raw function-based modelsmodel.to(device), tensor.to(device)model.train() and model.eval() context switches appropriatelyoptimizer.zero_grad() at the top of the training looptorch.no_grad() or @torch.inference_mode() for all inference codepin_memory=True) and use multiple workers in DataLoader for GPU trainingtorch.compile() (PyTorch 2.x) for production inference speedupsF.cross_entropy over manual softmax + NLLLoss (numerically stable)When reviewing or writing TensorFlow code, apply these guidelines:
tf.data.Dataset pipelines over manual batching for scalabilitytf.function for graph execution on performance-critical pathstf.keras.mixed_precision.set_global_policy('mixed_float16')tf.saved_model for portable model export; avoid picklingWhen reviewing or writing Hugging Face code, apply these guidelines:
padding=True and truncation=True when tokenizing batchesAutoModel, AutoTokenizer, and AutoConfig for checkpoint portabilitymodel.gradient_checkpointing_enable() to reduce memory for large modelsTrainer API for standard fine-tuning; use custom loops only when Trainer is insufficientTRANSFORMERS_CACHE environment variable in CI/CD pipelinesWhen reviewing or writing scikit-learn code, apply these guidelines:
Pipeline to chain preprocessing and model steps; prevents data leakageStratifiedKFold for classification tasks with class imbalanceGridSearchCV or RandomizedSearchCV for hyperparameter tuning.fit() only on training data; transform test data with the fitted transformerjoblib.dump / joblib.load (faster than pickle for large arrays)"Think step by step...") for complex reasoning taskstemperature=0 for deterministic, fact-based outputs; increase for creative tasks# Standard RAG pipeline components
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS # or Chroma, Pinecone, Weaviate
from langchain.chains import RetrievalQA
# 1. Embed and index documents
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vectorstore = FAISS.from_documents(documents, embeddings)
# 2. Retrieve relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 3. Generate with retrieved context
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
RAG best practices:
LCEL (LangChain Expression Language) for composable chainsRunnableParallel for concurrent retrieval stepsLangGraph for stateful multi-agent workflows with cyclesRunnableRetry for unreliable external calls# Standard PyTorch training loop with best practices
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
optimizer.zero_grad()
inputs, labels = batch["input_ids"].to(device), batch["labels"].to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # gradient clipping
optimizer.step()
scheduler.step()
# Validation loop
model.eval()
with torch.no_grad():
for batch in val_dataloader:
# evaluate...
Key standards:
max_norm=1.0) for stability in Transformer trainingsmoothing=0.1) to reduce overconfidencefrom peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # LoRA rank
lora_alpha=32, # scaling factor
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters() # verify < 1% parameters trainable
PEFT guidelines:
r=8 to r=64; higher rank = more capacity, more memoryimport mlflow
with mlflow.start_run():
mlflow.log_params({"learning_rate": lr, "batch_size": bs, "epochs": epochs})
mlflow.log_metrics({"train_loss": loss, "val_accuracy": acc}, step=epoch)
mlflow.pytorch.log_model(model, "model")
import wandb
wandb.init(project="my-project", config={"lr": 1e-4, "epochs": 10})
wandb.log({"train_loss": loss, "val_f1": f1_score})
wandb.finish()
MLOps standards:
torch.manual_seed(42), np.random.seed(42), random.seed(42)| Task | Primary Metrics | Secondary Metrics | | --------------------- | ------------------------------------ | ------------------------- | | Binary Classification | AUC-ROC, F1, Precision/Recall | Calibration (Brier Score) | | Multi-class | Macro F1, Weighted F1, Cohen's Kappa | Confusion Matrix | | Regression | RMSE, MAE, R² | Residual Analysis | | NLP Generation | BLEU, ROUGE, BERTScore | Human Evaluation | | Ranking/Retrieval | NDCG@k, MRR, MAP | Hit Rate@k | | LLM Evaluation | LLM-as-judge, exact match, pass@k | Hallucination Rate |
torch.onnx.export(model, ...)torch.quantization.quantize_dynamic(model, ...)# Example: data drift detection with Evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html("drift_report.html")
Monitoring standards:
# Proper train/test split to avoid leakage
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y # stratify for classification
)
# Fit scaler ONLY on training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform only, never fit_transform
Standards:
torch.manual_seed(42), np.random.seed(42), random.seed(42) and log via MLflow/W&B..transform() test; fitting on test causes data leakage and inflated performance estimates.| Anti-Pattern | Problem | Fix | | ------------------------------ | --------------------------------- | ------------------------------------------------- | | Ignoring class imbalance | Model biased to majority class | Stratified sampling, class weights, SMOTE | | No validation set | Overfitting undetected | Hold out 10-20% for validation | | Optimizing a single metric | Missing failure modes | Multiple metrics (precision, recall, F1, AUC) | | No baseline comparison | Cannot assess model quality | Establish heuristic baseline before ML | | Accuracy on imbalanced data | Misleading performance estimate | Use F1, precision-recall curve, ROC-AUC | | Data leakage (test in train) | Inflated performance estimates | Fit on train only; transform test with fitted obj | | No error analysis | Cannot improve strategically | Analyze failure cases by error type | | Training without checkpoints | Lost progress on failure | Save best model by validation metric | | Mutable global random state | Non-reproducible experiments | Fix all seeds; log in experiment metadata | | Embedding model in application | Cannot update model independently | Serve model via API (REST, gRPC) | | No latency budget | Inference too slow for production | Profile and set SLO before deployment |
</instructions> <examples>Training a Transformer classifier:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
dataset = dataset.map(tokenize, batched=True)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
compute_metrics=compute_metrics,
)
trainer.train()
Minimal RAG pipeline:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
qa = RetrievalQA.from_chain_type(ChatOpenAI(model="gpt-4o"), retriever=retriever)
answer = qa.run("What is the refund policy?")
</examples>
This skill is used by:
developer — Implements ML models, data pipelines, and LLM integrationsresearcher — Investigates novel architectures and evaluates research papersarchitect — Designs ML system architecture and deployment topologysecurity-architect — Reviews data privacy, model security, and inference safetypython-backend-expert — NumPy, Pandas, async Python patternscode-analyzer — Static analysis and complexity metrics for ML codedebugging — Systematic debugging for training failures and inference errorsBefore starting:
cat .claude/context/memory/learnings.md
Check for:
After completing:
.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.mdASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.