003-skills/.claude/skills/nixtla-benchmark-reporter/SKILL.md
Generate comprehensive markdown benchmark reports from forecast accuracy metrics with model comparisons, statistical analysis, and regression detection. Use when analyzing baseline performance, comparing forecast models, or validating model quality. Trigger with 'generate benchmark report', 'analyze forecast metrics', or 'create performance summary'.
npx skillsauth add intent-solutions-io/plugins-nixtla nixtla-benchmark-reporterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate production-ready benchmark reports from forecasting accuracy metrics, enabling systematic model comparison and regression detection for Nixtla forecasting workflows.
This skill transforms raw forecast metrics (sMAPE, MASE, MAE, RMSE) into actionable insights. It:
Key Benefits:
series_id, model, sMAPE, MASE (minimum)Expected CSV Structure:
series_id,model,sMAPE,MASE,MAE,RMSE
D1,SeasonalNaive,15.23,1.05,12.5,18.3
D1,AutoETS,13.45,0.92,10.2,15.1
D1,AutoTheta,12.34,0.87,9.8,14.5
D2,SeasonalNaive,18.67,1.23,15.1,22.4
...
The script automatically:
Usage:
python {baseDir}/scripts/generate_benchmark_report.py \
--results /path/to/benchmark_results.csv \
--output /path/to/report.md
For each model, calculates:
Creates markdown table comparing all models:
## Model Comparison (sMAPE)
| Model | Mean | Median | Std Dev | Min | Max | Wins |
|-------|------|--------|---------|-----|-----|------|
| AutoTheta | 12.3% | 11.8% | 4.2% | 5.1% | 28.9% | 32/50 (64%) |
| AutoETS | 13.5% | 12.9% | 5.1% | 6.2% | 31.2% | 18/50 (36%) |
| SeasonalNaive | 15.2% | 14.5% | 6.3% | 7.8% | 35.4% | 0/50 (0%) |
Determines overall best model based on:
Generates recommendations:
If baseline results provided, compares current vs. baseline:
python {baseDir}/scripts/generate_benchmark_report.py \
--results current_results.csv \
--baseline baseline_results.csv \
--output regression_report.md \
--threshold 5.0 # Alert if sMAPE degrades >5%
Regression Report Includes:
Supports multiple output formats:
Standard Report (default):
python {baseDir}/scripts/generate_benchmark_report.py --results metrics.csv
Executive Summary (1-page):
python {baseDir}/scripts/generate_benchmark_report.py \
--results metrics.csv \
--format executive \
--output summary.md
GitHub Issue Template:
python {baseDir}/scripts/generate_benchmark_report.py \
--results metrics.csv \
--format github \
--output .github/ISSUE_TEMPLATE/regression.md
The script generates:
Standard Report (report.md):
Regression Report (if baseline provided):
GitHub Issue Template:
---
title: "Performance Regression Detected: {model_name}"
labels: ["regression", "performance"]
assignees: ["team-lead"]
---
## Regression Summary
Model: {model_name}
Metric: sMAPE degraded by {X}%
Baseline: {baseline_value}%
Current: {current_value}%
## Affected Series
- {series_1}: {baseline}% → {current}% ({delta}%)
- {series_2}: {baseline}% → {current}% ({delta}%)
...
## Acceptance Criteria
- [ ] Investigate root cause
- [ ] Restore performance to within 2% of baseline
- [ ] Add regression test to CI/CD
Missing Metrics File:
Error: Benchmark results not found at /path/to/results.csv
Solution: Verify path and ensure CSV file exists
Invalid CSV Structure:
Error: Required columns missing: series_id, model, sMAPE
Solution: Ensure CSV has minimum required columns
Empty Results:
Warning: No metrics found in CSV file
Solution: Verify CSV has data rows (not just headers)
Regression Threshold Exceeded:
🚨 REGRESSION DETECTED: AutoTheta sMAPE degraded by 12.5%
Baseline: 12.3%
Current: 13.8%
Threshold: 5.0%
Solution: Review recent model changes, check data quality
python {baseDir}/scripts/generate_benchmark_report.py \
--results nixtla_baseline_m4/results_M4_Daily_h14.csv \
--output reports/m4_daily_baseline.md \
--verbose
Output:
✓ Loaded 150 results (50 series × 3 models)
✓ Calculated summary statistics
✓ Identified winner: AutoTheta (mean sMAPE: 12.3%)
✓ Generated report: reports/m4_daily_baseline.md (1,245 words)
python {baseDir}/scripts/generate_benchmark_report.py \
--results current_run/results.csv \
--baseline baseline/v1.0_results.csv \
--output regression_report.md \
--threshold 3.0
Output:
⚠️ REGRESSION DETECTED in 2/3 models:
- AutoETS: sMAPE 13.5% → 14.8% (+9.6%)
- AutoTheta: sMAPE 12.3% → 12.7% (+3.3%)
✓ Generated regression report with GitHub issue template
python {baseDir}/scripts/generate_benchmark_report.py \
--results quarterly_benchmark.csv \
--format executive \
--output Q1_summary.md
Output:
# Q1 2025 Forecast Baseline Report
**Winner**: AutoTheta with 12.3% sMAPE (vs. 13.5% AutoETS, 15.2% Naive)
**Key Findings**:
- AutoTheta won 64% of series (32/50)
- Most consistent performance (std dev 4.2%)
- Recommended for production baseline
**Action Items**:
- Deploy AutoTheta as default model
- Use AutoETS for highly seasonal data (criteria: seasonal_strength > 0.8)
- Investigate 3 failure cases (sMAPE > 30%)
python {baseDir}/scripts/generate_benchmark_report.py \
--results results.csv \
--primary-metric MASE \
--output mase_focused_report.md
--threshold to catch regressions early (recommend 3-5%){baseDir}/scripts/generate_benchmark_report.py{baseDir}/assets/templates/report_template.md{baseDir}/references/EXAMPLE_REPORT.mdtesting
This skill enables Claude to manage isolated test environments using Docker Compose, Testcontainers, and environment variables. It is used to create consistent, reproducible testing environments for software projects. Claude should use this skill when the user needs to set up a test environment with specific configurations, manage Docker Compose files for test infrastructure, set up programmatic container management with Testcontainers, manage environment variables for tests, or ensure cleanup after tests. Trigger terms include "test environment", "docker compose", "testcontainers", "environment variables", "isolated environment", "env-setup", and "test setup".
tools
This skill uses the test-doubles-generator plugin to automatically create mocks, stubs, spies, and fakes for unit testing. It analyzes dependencies in the code and generates appropriate test doubles based on the chosen testing framework, such as Jest, Sinon, or others. Use this skill when you need to generate test doubles, mocks, stubs, spies, or fakes to isolate units of code during testing. Trigger this skill by requesting test double generation or using the `/gen-doubles` or `/gd` command.
tools
This skill enables Claude to generate realistic test data for software development. It uses the test-data-generator plugin to create users, products, orders, and custom schemas for comprehensive testing. Use this skill when you need to populate databases, simulate user behavior, or create fixtures for automated tests. Trigger phrases include "generate test data", "create fake users", "populate database", "generate product data", "create test orders", or "generate data based on schema". This skill is especially useful for populating testing environments or creating sample data for demonstrations.
development
This skill analyzes code coverage metrics to identify untested code and generate comprehensive coverage reports. It is triggered when the user requests analysis of code coverage, identification of coverage gaps, or generation of coverage reports. The skill is best used to improve code quality by ensuring adequate test coverage and identifying areas for improvement. Use trigger terms like "analyze coverage", "code coverage report", "untested code", or the shortcut "cov".