devops/ai/model-registry-governance/SKILL.md
Establish model registry standards, governance controls, metadata schemas, approvals, and lifecycle policies for enterprise AI deployments.
npx skillsauth add bagelhole/devops-security-agent-skills model-registry-governanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create a trustworthy system of record for model artifacts, prompts, adapters, and evaluation evidence.
# Install MLflow with required backends
pip install mlflow[extras] psycopg2-binary boto3
# Start MLflow tracking server with PostgreSQL backend and S3 artifact store
mlflow server \
--backend-store-uri postgresql://mlflow:password@db:5432/mlflow \
--default-artifact-root s3://mlflow-artifacts/models \
--host 0.0.0.0 \
--port 5000 \
--serve-artifacts
# docker-compose.yaml for MLflow
services:
mlflow:
image: ghcr.io/mlflow/mlflow:2.12.0
command: >
mlflow server
--backend-store-uri postgresql://mlflow:${DB_PASSWORD}@db:5432/mlflow
--default-artifact-root s3://mlflow-artifacts/models
--host 0.0.0.0
--port 5000
--serve-artifacts
ports:
- "5000:5000"
environment:
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
depends_on:
- db
db:
image: postgres:16-alpine
environment:
POSTGRES_DB: mlflow
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
# model_metadata_schema.py
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
from enum import Enum
class LifecycleState(str, Enum):
DRAFT = "draft"
CANDIDATE = "candidate"
APPROVED = "approved"
DEPRECATED = "deprecated"
RETIRED = "retired"
class RiskRating(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class ModelMetadata(BaseModel):
"""Required metadata for every registered model."""
# Identity
name: str = Field(description="Model name matching registry key")
version: str = Field(description="Semantic version")
checksum: str = Field(description="SHA-256 of model artifact")
storage_uri: str = Field(description="Artifact store path")
# Lineage
base_model: str = Field(description="Parent model identifier")
fine_tune_method: Optional[str] = Field(default=None)
training_dataset: Optional[str] = Field(default=None)
training_date: Optional[datetime] = Field(default=None)
source_commit: str = Field(description="Git SHA of training code")
# Evaluation
eval_datasets: List[str] = Field(description="Evaluation dataset IDs")
eval_report_uri: str = Field(description="Path to evaluation results")
quality_score: float = Field(ge=0, le=1)
safety_score: float = Field(ge=0, le=1)
# Governance
license: str = Field(description="SPDX license identifier")
allowed_use_cases: List[str]
prohibited_use_cases: List[str]
risk_rating: RiskRating
security_controls: List[str]
# Ownership
owner: str = Field(description="Primary owner email")
backup_owner: str = Field(description="Backup owner email")
escalation_contact: str
team: str
# Lifecycle
state: LifecycleState = LifecycleState.DRAFT
created_at: datetime = Field(default_factory=datetime.utcnow)
approved_at: Optional[datetime] = None
approved_by: Optional[str] = None
expires_at: Optional[datetime] = None
# register_model.py
import mlflow
from mlflow.tracking import MlflowClient
import json
import hashlib
def register_model(
model_path: str,
model_name: str,
metadata: dict,
mlflow_uri: str = "http://mlflow:5000"
):
"""Register a model with full metadata and governance tags."""
mlflow.set_tracking_uri(mlflow_uri)
client = MlflowClient()
# Compute artifact checksum
with open(model_path, "rb") as f:
checksum = hashlib.sha256(f.read()).hexdigest()
metadata["checksum"] = checksum
# Log model with metadata
with mlflow.start_run(run_name=f"register-{model_name}-{metadata['version']}") as run:
# Log all metadata as params
mlflow.log_params({
"model_name": model_name,
"version": metadata["version"],
"base_model": metadata["base_model"],
"risk_rating": metadata["risk_rating"],
"owner": metadata["owner"],
"license": metadata["license"],
})
# Log quality metrics
mlflow.log_metrics({
"quality_score": metadata["quality_score"],
"safety_score": metadata["safety_score"],
})
# Log full metadata as artifact
with open("metadata.json", "w") as f:
json.dump(metadata, f, indent=2, default=str)
mlflow.log_artifact("metadata.json")
# Log model artifact
mlflow.log_artifact(model_path)
# Register in model registry
model_uri = f"runs:/{run.info.run_id}/model"
result = mlflow.register_model(model_uri, model_name)
# Set lifecycle tags
client.set_model_version_tag(
model_name, result.version, "state", "draft"
)
client.set_model_version_tag(
model_name, result.version, "risk_rating", metadata["risk_rating"]
)
client.set_model_version_tag(
model_name, result.version, "checksum", checksum
)
return result
# promote_model.py
import mlflow
from mlflow.tracking import MlflowClient
from datetime import datetime
import sys
def promote_model(
model_name: str,
version: str,
target_stage: str,
approver: str,
mlflow_uri: str = "http://mlflow:5000"
):
"""Promote a model version after governance checks pass."""
mlflow.set_tracking_uri(mlflow_uri)
client = MlflowClient()
# Verify current state allows promotion
mv = client.get_model_version(model_name, version)
current_state = mv.tags.get("state", "draft")
valid_transitions = {
"draft": ["candidate"],
"candidate": ["approved", "draft"],
"approved": ["deprecated"],
"deprecated": ["retired"],
}
if target_stage not in valid_transitions.get(current_state, []):
raise ValueError(
f"Invalid transition: {current_state} -> {target_stage}. "
f"Allowed: {valid_transitions.get(current_state, [])}"
)
# Verify required eval scores for production promotion
if target_stage == "approved":
run = client.get_run(mv.run_id)
quality = float(run.data.metrics.get("quality_score", 0))
safety = float(run.data.metrics.get("safety_score", 0))
if quality < 0.85:
raise ValueError(f"Quality score {quality} below threshold 0.85")
if safety < 0.95:
raise ValueError(f"Safety score {safety} below threshold 0.95")
# Record promotion
now = datetime.utcnow().isoformat()
client.set_model_version_tag(model_name, version, "state", target_stage)
client.set_model_version_tag(model_name, version, f"promoted_to_{target_stage}_at", now)
client.set_model_version_tag(model_name, version, f"promoted_to_{target_stage}_by", approver)
# Transition MLflow stage alias
stage_map = {
"candidate": "Staging",
"approved": "Production",
"deprecated": "Archived",
}
if target_stage in stage_map:
client.transition_model_version_stage(
model_name, version, stage_map[target_stage]
)
print(f"Model {model_name} v{version}: {current_state} -> {target_stage}")
print(f"Approved by: {approver} at {now}")
if __name__ == "__main__":
promote_model(
model_name=sys.argv[1],
version=sys.argv[2],
target_stage=sys.argv[3],
approver=sys.argv[4],
)
| State | Description | Serving Allowed | New Usage |
|-------|-------------|-----------------|-----------|
| draft | Internal experimentation | Dev only | Dev only |
| candidate | Passed baseline tests | Staging | Staging |
| approved | Authorized for production | All environments | Yes |
| deprecated | Replacement announced | Existing only | Blocked |
| retired | Archived for audit | None | None |
# lifecycle_policy.py
from mlflow.tracking import MlflowClient
from datetime import datetime, timedelta
def enforce_lifecycle_policies(mlflow_uri: str = "http://mlflow:5000"):
"""Run periodic lifecycle enforcement."""
client = MlflowClient()
for rm in client.search_registered_models():
for mv in client.search_model_versions(f"name='{rm.name}'"):
tags = mv.tags
state = tags.get("state", "draft")
# Auto-deprecate models with expired approvals (90 days)
if state == "approved":
approved_at = tags.get("promoted_to_approved_at")
if approved_at:
approved_date = datetime.fromisoformat(approved_at)
if datetime.utcnow() - approved_date > timedelta(days=90):
print(f"Auto-deprecating {rm.name} v{mv.version}: approval expired")
client.set_model_version_tag(rm.name, mv.version, "state", "deprecated")
client.set_model_version_tag(
rm.name, mv.version, "auto_deprecated_reason", "approval_expired"
)
# Auto-retire deprecated models after 30 days
if state == "deprecated":
deprecated_at = tags.get("promoted_to_deprecated_at")
if deprecated_at:
deprecated_date = datetime.fromisoformat(deprecated_at)
if datetime.utcnow() - deprecated_date > timedelta(days=30):
print(f"Auto-retiring {rm.name} v{mv.version}")
client.set_model_version_tag(rm.name, mv.version, "state", "retired")
client.transition_model_version_stage(
rm.name, mv.version, "Archived"
)
# Flag drafts with no activity for 14 days
if state == "draft":
created = datetime.fromisoformat(mv.creation_timestamp / 1000)
if datetime.utcnow() - created > timedelta(days=14):
print(f"Stale draft: {rm.name} v{mv.version}")
# policy/model_governance.rego
package model.governance
# Reject artifacts without SBOM
deny[msg] {
not input.metadata.sbom_uri
msg := "Model must include SBOM artifact URI"
}
# Block promotion if critical CVEs remain
deny[msg] {
input.target_state == "approved"
input.security_scan.critical_cves > 0
msg := sprintf("Cannot promote: %d critical CVEs unresolved", [input.security_scan.critical_cves])
}
# Require refreshed evals after prompt changes
deny[msg] {
input.target_state == "approved"
input.prompt_changed
not input.eval_refreshed_after_prompt_change
msg := "Evaluation must be re-run after prompt template changes"
}
# Require minimum eval scores for production
deny[msg] {
input.target_state == "approved"
input.metadata.quality_score < 0.85
msg := sprintf("Quality score %.2f below threshold 0.85", [input.metadata.quality_score])
}
# Require dual approval for high-risk models
deny[msg] {
input.target_state == "approved"
input.metadata.risk_rating == "high"
count(input.approvals) < 2
msg := "High-risk models require at least 2 approvals"
}
Maintain immutable records of:
| Issue | Diagnosis | Resolution | |-------|-----------|------------| | Model registration fails | Check MLflow server connectivity and artifact store permissions | Verify S3/GCS credentials and bucket policy | | Promotion blocked by policy | Review OPA deny messages in CI output | Fix metadata gaps or request policy exception | | Stale models not auto-retiring | Lifecycle cron job not running | Check CronJob status in Kubernetes | | Duplicate model versions | Race condition in CI pipeline | Add locking via registry API or database | | Missing eval evidence | Eval pipeline skipped or failed | Re-run eval suite and re-register |
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.