Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

pjt222/run-ab-test-models

Name: run-ab-test-models
Author: pjt222

i18n/de/skills/run-ab-test-models/SKILL.md

npx skillsauth add pjt222/agent-almanac run-ab-test-models

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

A/B-Tests fuer Modelle durchfuehren

See Extended Examples for complete configuration files and templates.

Ausfuehren controlled experiments comparing model versions using traffic splitting and statistical analysis.

Wann verwenden

Deploying new model version and want to validate improvement vor full rollout
Comparing multiple candidate models trained with different algorithms or features
Testing impact of hyperparameter changes on business metrics
Need to measure model performance in production ohne risking full traffic
Regulatory requirements for gradual rollout (e.g., medical ML systems)
Evaluating cost-performance tradeoffs zwischen model sizes

Eingaben

Erforderlich: Champion model (current production version)
Erforderlich: Challenger model(s) (new version to test)
Erforderlich: Traffic allocation percentage (e.g., 5% to challenger)
Erforderlich: Success metrics (business and ML metrics)
Erforderlich: Minimum sample size or test duration
Optional: Guardrail metrics (latency, error rate thresholds)
Optional: User segments for stratified testing

Vorgehensweise

Schritt 1: Entwerfen Experiment

Definieren test parameters, success criteria, and statistical requirements.

# ab_test/experiment_config.py
from dataclasses import dataclass
from typing import List, Dict
import numpy as np
from scipy.stats import norm


@dataclass
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Experiment configuration with statistically sound sample size calculation, typischerweise 5-10k samples per variant for 5-10% MDE.

Bei Fehler: If required sample size too large, increase traffic allocation, extend test duration, or accept larger MDE; verify baseline metric estimate is accurate; consider sequential testing for continuous monitoring.

Schritt 2: Implementieren Traffic Splitting

Einrichten routing logic to randomly assign requests to models.

# ab_test/traffic_router.py
import hashlib
import random
from typing import Dict, Optional
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Consistent user-to-variant assignment, accurate traffic split matching configured percentages, all assignments logged for analysis.

Bei Fehler: Verifizieren hash function produces uniform distribution (test with 10k user IDs), check that user_id is stable across requests (not session_id), ensure logs capture all prediction events, validate traffic split in first 1000 requests.

Schritt 3: Implementieren Shadow Deployment (Optional)

Ausfuehren challenger model in parallel ohne affecting users (shadow mode).

# ab_test/shadow_deployment.py
import asyncio
from typing import Dict, Any
import logging
from concurrent.futures import ThreadPoolExecutor
import time

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Champion predictions served with normal latency, challenger predictions logged asynchronously ohne blocking, prediction differences captured for analysis.

Bei Fehler: Set challenger timeout < champion SLA to avoid blocking, handle challenger errors gracefully ohne affecting champion, monitor memory usage (two models loaded), consider sampling (log only 10% of shadow predictions).

Schritt 4: Sammeln and Analysieren Metrics

Sammeln experiment data and perform statistical tests.

# ab_test/analysis.py
import pandas as pd
import numpy as np
from scipy import stats
from typing import Dict, Tuple
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Statistical test results with p-values, confidence intervals, and clear decision (rollout/keep/inconclusive), typischerweise nach 7-14 days or reaching sample size.

Bei Fehler: Verifizieren ground truth labels are available (may need delayed analysis), check for sample ratio mismatch (SRM) indicating assignment bugs, ensure sufficient sample size reached, look for novelty/primacy effects in early data, consider sequential testing if fixed-horizon test is too slow.

Schritt 5: Ueberwachen Guardrail Metrics

Continuously check that challenger doesn't violate safety thresholds.

# ab_test/guardrails.py
import pandas as pd
import logging
from typing import Dict, List

logger = logging.getLogger(__name__)


# ... (see EXAMPLES.md for complete implementation)

Erwartet: Guardrail violations detected innerhalb 5-15 minutes, automated experiment stop if critical thresholds breached (latency, errors), alerts sent to team.

Bei Fehler: Verifizieren guardrail thresholds are realistic (not too tight), ensure monitoring loop is running continuously, check that stop_experiment() function actually updates routing, test alert delivery channels.

Schritt 6: Make Rollout Decision

Based on experiment results, decide whether to rollout challenger.

# ab_test/rollout_decision.py
import logging
from typing import Dict
from dataclasses import dataclass

logger = logging.getLogger(__name__)


# ... (see EXAMPLES.md for complete implementation)

Erwartet: Clear decision (full/gradual rollout, keep champion, or extend test) with justification and action items.

Bei Fehler: If decision unclear, perform subgroup analysis (by user segment, time of day, device type), check for interaction effects, review business context (e.g., is 2% lift worth engineering cost?), consult with stakeholders.

Validierung

[ ] Traffic split matches configured percentages (innerhalb 1%)
[ ] Same user always assigned to same variant (consistency check)
[ ] Sample size calculation produces reasonable numbers (5-50k per variant)
[ ] Statistical tests produce p-values consistent with manual calculation
[ ] Guardrail violations trigger alerts innerhalb 5 minutes
[ ] Shadow deployment shows <5% prediction divergence zwischen models
[ ] Experiment reports include confidence intervals
[ ] Rollout decision documented with justification

Haeufige Stolperfallen

Sample ratio mismatch (SRM): If observed traffic split differs from configured (e.g., 95/5 becomes 92/8), indicates assignment bug; check hash function uniformity
Peeking: Checking results vor reaching sample size inflates Type I error; use sequential testing or wait for pre-determined end date
Novelty effect: Users respond differently to new model initially; run for 2+ weeks to see steady-state behavior
Carryover effects: Previous variant exposure affects current behavior; use new users or sufficient washout period
Multiple testing: Testing many metrics increases false positive risk; correct with Bonferroni or focus on single primary metric
Insufficient power: Small traffic allocation may require months to detect realistic effects; balance statistical power with risk tolerance
Ignoring segments: Aggregate lift may hide negative impact on important user segments; perform subgroup analysis
Attribution errors: Sicherstellen outcome metrics korrekt attributed to model predictions (not other system changes)

pjt222/run-ab-test-models

i18n/de/skills/run-ab-test-models/SKILL.md

Entwerfen and execute A/B tests for ML models in production using traffic splitting, statistical significance testing, and canary/shadow deployment strategies. Measure performance differences and make data-driven decisions about model rollout. Verwenden wenn validating a new model version vor full rollout, comparing candidate models trained with different algorithms, measuring business metric impact of model changes, or when regulatory requirements mandate gradual rollout.

9 stars

development

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add pjt222/agent-almanac run-ab-test-models

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 9:25 AM9.6s1 file scanned

SKILL.md

name:: run-ab-test-models
description:: >
license:: MIT
allowed-tools:: Read Write Edit Bash Grep Glob
author:: Philipp Thoss
version:: 1.0
domain:: mlops
complexity:: intermediate
language:: multi
tags:: ab-testing, canary, shadow-deployment, traffic-splitting, statistical-significance, experimentation
locale:: de
source_locale:: en
source_commit:: 6f65f316
translator:: claude
translation_date:: 2026-03-17

A/B-Tests fuer Modelle durchfuehren

See Extended Examples for complete configuration files and templates.

Ausfuehren controlled experiments comparing model versions using traffic splitting and statistical analysis.

Wann verwenden

Deploying new model version and want to validate improvement vor full rollout
Comparing multiple candidate models trained with different algorithms or features
Testing impact of hyperparameter changes on business metrics
Need to measure model performance in production ohne risking full traffic
Regulatory requirements for gradual rollout (e.g., medical ML systems)
Evaluating cost-performance tradeoffs zwischen model sizes

Eingaben

Erforderlich: Champion model (current production version)
Erforderlich: Challenger model(s) (new version to test)
Erforderlich: Traffic allocation percentage (e.g., 5% to challenger)
Erforderlich: Success metrics (business and ML metrics)
Erforderlich: Minimum sample size or test duration
Optional: Guardrail metrics (latency, error rate thresholds)
Optional: User segments for stratified testing

Vorgehensweise

Schritt 1: Entwerfen Experiment

Definieren test parameters, success criteria, and statistical requirements.

# ab_test/experiment_config.py
from dataclasses import dataclass
from typing import List, Dict
import numpy as np
from scipy.stats import norm


@dataclass
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Experiment configuration with statistically sound sample size calculation, typischerweise 5-10k samples per variant for 5-10% MDE.

Schritt 2: Implementieren Traffic Splitting

Einrichten routing logic to randomly assign requests to models.

# ab_test/traffic_router.py
import hashlib
import random
from typing import Dict, Optional
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Consistent user-to-variant assignment, accurate traffic split matching configured percentages, all assignments logged for analysis.

Schritt 3: Implementieren Shadow Deployment (Optional)

Ausfuehren challenger model in parallel ohne affecting users (shadow mode).

# ab_test/shadow_deployment.py
import asyncio
from typing import Dict, Any
import logging
from concurrent.futures import ThreadPoolExecutor
import time

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Champion predictions served with normal latency, challenger predictions logged asynchronously ohne blocking, prediction differences captured for analysis.

Schritt 4: Sammeln and Analysieren Metrics

Sammeln experiment data and perform statistical tests.

# ab_test/analysis.py
import pandas as pd
import numpy as np
from scipy import stats
from typing import Dict, Tuple
import logging

logger = logging.getLogger(__name__)
# ... (see EXAMPLES.md for complete implementation)

Erwartet: Statistical test results with p-values, confidence intervals, and clear decision (rollout/keep/inconclusive), typischerweise nach 7-14 days or reaching sample size.

Schritt 5: Ueberwachen Guardrail Metrics

Continuously check that challenger doesn't violate safety thresholds.

# ab_test/guardrails.py
import pandas as pd
import logging
from typing import Dict, List

logger = logging.getLogger(__name__)


# ... (see EXAMPLES.md for complete implementation)

Erwartet: Guardrail violations detected innerhalb 5-15 minutes, automated experiment stop if critical thresholds breached (latency, errors), alerts sent to team.

Schritt 6: Make Rollout Decision

Based on experiment results, decide whether to rollout challenger.

# ab_test/rollout_decision.py
import logging
from typing import Dict
from dataclasses import dataclass

logger = logging.getLogger(__name__)


# ... (see EXAMPLES.md for complete implementation)

Erwartet: Clear decision (full/gradual rollout, keep champion, or extend test) with justification and action items.

Validierung

[ ] Traffic split matches configured percentages (innerhalb 1%)
[ ] Same user always assigned to same variant (consistency check)
[ ] Sample size calculation produces reasonable numbers (5-50k per variant)
[ ] Statistical tests produce p-values consistent with manual calculation
[ ] Guardrail violations trigger alerts innerhalb 5 minutes
[ ] Shadow deployment shows <5% prediction divergence zwischen models
[ ] Experiment reports include confidence intervals
[ ] Rollout decision documented with justification

Haeufige Stolperfallen

Sample ratio mismatch (SRM): If observed traffic split differs from configured (e.g., 95/5 becomes 92/8), indicates assignment bug; check hash function uniformity
Peeking: Checking results vor reaching sample size inflates Type I error; use sequential testing or wait for pre-determined end date
Novelty effect: Users respond differently to new model initially; run for 2+ weeks to see steady-state behavior
Carryover effects: Previous variant exposure affects current behavior; use new users or sufficient washout period
Multiple testing: Testing many metrics increases false positive risk; correct with Bonferroni or focus on single primary metric
Insufficient power: Small traffic allocation may require months to detect realistic effects; balance statistical power with risk tolerance
Ignoring segments: Aggregate lift may hide negative impact on important user segments; perform subgroup analysis
Attribution errors: Sicherstellen outcome metrics korrekt attributed to model predictions (not other system changes)

Related Skills

pjt222/unleash-the-agents

testing

VerifiedTrustedCommunity

Launch all available agents in parallel waves for open-ended hypothesis generation on problems where the correct domain is unknown. Use when facing a cross-domain problem with no clear starting point, when single-agent approaches have stalled, or when diverse perspectives are more valuable than deep expertise. Produces a ranked hypothesis set with convergence analysis and adversarial refinement.

9SKILL.mdUpdated Apr 15, 2026

pjt222/unleash-the-agents

pjt222/test-cli-application

tools

VerifiedTrustedCommunity

Write integration tests for a Node.js CLI application using the built-in node:test module. Covers the exec helper pattern, output assertions, filesystem state verification, cleanup hooks, JSON output parsing, error case testing, and state restoration after destructive tests. Use when adding tests to an existing CLI, testing a new command, verifying adapter behavior across frameworks, or setting up CI for a CLI tool.

9SKILL.mdUpdated Apr 15, 2026

pjt222/test-cli-application

pjt222/screen-trademark

development

VerifiedTrustedCommunity

Screen a proposed trademark for conflicts and distinctiveness before filing. Covers trademark database searches (TMview, WIPO Global Brand Database, USPTO TESS), distinctiveness analysis using the Abercrombie spectrum, likelihood of confusion assessment using DuPont factors and EUIPO relative grounds, common law rights evaluation, and goods/services overlap analysis. Produces a conflict report with a risk matrix. Use before adopting a new brand name, logo, or slogan — distinct from patent prior art search, which uses different databases, legal frameworks, and analysis methods.

9SKILL.mdUpdated Apr 15, 2026

pjt222/screen-trademark

pjt222/scaffold-cli-command

tools

VerifiedTrustedCommunity

Scaffold a new CLI command using Commander.js with options, action handler, three output modes (human-readable, quiet, JSON), and optional ceremony variant. Covers command naming, option design, shared context patterns, error handling, and integration testing. Use when adding a command to an existing Commander.js CLI, designing a new CLI tool from scratch, or standardizing command structure across a multi-command CLI.

9SKILL.mdUpdated Apr 15, 2026

pjt222/scaffold-cli-command

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/pjt222/agent-almanac.git

# Copy into Claude Code skills folder (global)
cp -r agent-almanac/i18n/de/skills/run-ab-test-models ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

pjt222/agent-almanac

9 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

pjt222/run-ab-test-models

$ install --global

Security Scan Results

SKILL.md

A/B-Tests fuer Modelle durchfuehren

Wann verwenden

Eingaben

Vorgehensweise

Schritt 1: Entwerfen Experiment

Schritt 2: Implementieren Traffic Splitting

Schritt 3: Implementieren Shadow Deployment (Optional)

Schritt 4: Sammeln and Analysieren Metrics

Schritt 5: Ueberwachen Guardrail Metrics

Schritt 6: Make Rollout Decision

Validierung

Haeufige Stolperfallen

Verwandte Skills

Related Skills

pjt222/unleash-the-agents

pjt222/test-cli-application

pjt222/screen-trademark

pjt222/scaffold-cli-command

pjt222/run-ab-test-models

$ install --global

Security Scan Results

SKILL.md

A/B-Tests fuer Modelle durchfuehren

Wann verwenden

Eingaben

Vorgehensweise

Schritt 1: Entwerfen Experiment

Schritt 2: Implementieren Traffic Splitting

Schritt 3: Implementieren Shadow Deployment (Optional)

Schritt 4: Sammeln and Analysieren Metrics

Schritt 5: Ueberwachen Guardrail Metrics

Schritt 6: Make Rollout Decision

Validierung

Haeufige Stolperfallen

Verwandte Skills

Related Skills

pjt222/unleash-the-agents

pjt222/test-cli-application

pjt222/screen-trademark

pjt222/scaffold-cli-command