.claude/skills/nixtla-baseline-review/SKILL.md
Analyze Nixtla baseline forecasting results (sMAPE/MASE on M4 or other benchmark datasets). Use when the user asks about baseline performance, model comparisons, or metric interpretation for Nixtla time-series experiments. Trigger with "baseline review", "interpret sMAPE/MASE", or "compare AutoETS vs AutoTheta".
npx skillsauth add intent-solutions-io/plugins-nixtla nixtla-baseline-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze baseline forecasting results from the nixtla-baseline-m4 workflow. Interpret metrics, compare models, surface patterns, and recommend next steps.
Activate this skill when the user:
nixtla_baseline_m4/ directoryresults_*.csv file must be presentseries_id, model, sMAPE, MASEUse the Read tool to find baseline results:
# Check for results directory (use Bash tool)
ls -la nixtla_baseline_m4/
# Identify most recent results file
ls -t nixtla_baseline_m4/results_*.csv | head -1
Expected files:
results_M4_Daily_h{horizon}.csv - Full metrics tablesummary_M4_Daily_h{horizon}.txt - Text summary (optional)If files are missing, inform the user they need to run /nixtla-baseline-m4 first.
Read the metrics CSV file:
# View first few rows to confirm format
head -10 nixtla_baseline_m4/results_M4_Daily_h*.csv
# Or use Read tool to load the full file
Expected CSV structure:
series_id,model,sMAPE,MASE
D1,SeasonalNaive,15.23,1.05
D1,AutoETS,13.45,0.92
D1,AutoTheta,12.34,0.87
D2,SeasonalNaive,18.67,1.23
...
Calculate summary statistics manually or with bash:
cut -d, -f1 results_*.csv | sort -u | wc -lgrep "AutoTheta" results_*.csvFor each model (SeasonalNaive, AutoETS, AutoTheta), calculate:
sMAPE (Symmetric Mean Absolute Percentage Error):
MASE (Mean Absolute Scaled Error):
1.0: Worse than seasonal naive
Model Characteristics:
Look for:
Provide analysis in this format:
## Baseline Performance Analysis
Based on {N} series from M4-Daily with horizon={H}:
**Overall Winner**: {ModelName}
- Mean sMAPE: {X.XX}% (vs {Y.YY}% for second place)
- Wins on {N} out of {Total} series ({Percentage}%)
- Consistent performance (std dev {X.X}%)
**Key Insights**:
1. {Insight about winner's strengths}
2. {Insight about other models or patterns}
3. {Insight about failure cases or opportunities}
**Recommendation**: Use {ModelName} as production baseline. {Additional context about when to consider alternatives}
**Next Steps**:
- {Actionable recommendation 1}
- {Actionable recommendation 2}
User: "Which model was best?"
Response:
Based on the baseline results (50 series, horizon=14):
**AutoTheta won** with average sMAPE of 12.3%
- Outperformed AutoETS (13.5%) and SeasonalNaive (15.2%)
- Most consistent with std dev of 4.2%
- Recommended for production use
User: "Compare AutoETS vs AutoTheta performance"
Response:
## AutoETS vs AutoTheta Comparison
| Metric | AutoETS | AutoTheta | Winner |
|---------------|---------|-----------|----------|
| Mean sMAPE | 13.5% | 12.3% | Theta ✓ |
| Median sMAPE | 12.9% | 11.8% | Theta ✓ |
| Std Dev | 5.1% | 4.2% | Theta ✓ |
| Series Won | 18/50 | 32/50 | Theta ✓ |
**Key Differences**:
- Theta handles trends better (exponential smoothing with drift)
- ETS excels on strong seasonality (state space framework)
- Theta more consistent across diverse patterns
**Recommendation**: Use AutoTheta as default, fallback to AutoETS for highly seasonal data.
User: "What does sMAPE of 12.3% mean?"
Response:
sMAPE of 12.3% is **good performance** for forecasting:
- This means on average, predictions are off by about 12% from actual values
- For M4 Daily benchmark, anything under 15% is considered good
- This is in the "acceptable" range (10-20%)
Context: If you're forecasting daily sales of 100 units, a 12.3% sMAPE means your forecast will typically be within ±12 units of the true value.
MASE provides additional context - if MASE < 1.0, this forecast beats a simple "repeat last year" baseline.
If results files are missing:
I don't see baseline results in nixtla_baseline_m4/.
Please run the baseline command first:
Run: `/nixtla-baseline-m4 horizon=14 series_limit=50`
This will generate the metrics files I need to analyze.
If CSV is malformed:
The results file exists but appears malformed. Expected columns:
- series_id, model, sMAPE, MASE
Please re-run /nixtla-baseline-m4 to regenerate clean results.
For complete technical details, see:
000-docs/6767-a-OD-ARCH-nixtla-claude-plugin-poc-baseline-lab.md000-docs/6767-b-PP-PLAN-nixtla-claude-plugin-poc-baseline-lab.md000-docs/6767-d-OD-OVRV-nixtla-baseline-lab-overview.mdtesting
This skill enables Claude to manage isolated test environments using Docker Compose, Testcontainers, and environment variables. It is used to create consistent, reproducible testing environments for software projects. Claude should use this skill when the user needs to set up a test environment with specific configurations, manage Docker Compose files for test infrastructure, set up programmatic container management with Testcontainers, manage environment variables for tests, or ensure cleanup after tests. Trigger terms include "test environment", "docker compose", "testcontainers", "environment variables", "isolated environment", "env-setup", and "test setup".
tools
This skill uses the test-doubles-generator plugin to automatically create mocks, stubs, spies, and fakes for unit testing. It analyzes dependencies in the code and generates appropriate test doubles based on the chosen testing framework, such as Jest, Sinon, or others. Use this skill when you need to generate test doubles, mocks, stubs, spies, or fakes to isolate units of code during testing. Trigger this skill by requesting test double generation or using the `/gen-doubles` or `/gd` command.
tools
This skill enables Claude to generate realistic test data for software development. It uses the test-data-generator plugin to create users, products, orders, and custom schemas for comprehensive testing. Use this skill when you need to populate databases, simulate user behavior, or create fixtures for automated tests. Trigger phrases include "generate test data", "create fake users", "populate database", "generate product data", "create test orders", or "generate data based on schema". This skill is especially useful for populating testing environments or creating sample data for demonstrations.
development
This skill analyzes code coverage metrics to identify untested code and generate comprehensive coverage reports. It is triggered when the user requests analysis of code coverage, identification of coverage gaps, or generation of coverage reports. The skill is best used to improve code quality by ensuring adequate test coverage and identifying areas for improvement. Use trigger terms like "analyze coverage", "code coverage report", "untested code", or the shortcut "cov".