src/datapro/data/skills/clustering-toolkit/SKILL.md
Advanced clustering and grouping toolkit using PCA and DBSCAN. Provides a complete pipeline for identifying homogeneous groups in high-dimensional data with built-in quality diagnostics and stability metrics. Use for: (1) Grouping similar entities (assets, products, clients) based on multi-dimensional features, (2) Principal Component Analysis for dimensionality reduction, (3) DBSCAN clustering with noise filtering, (4) Diagnosing clustering pathologies like giant cluster ratio or configuration instability.
npx skillsauth add pablodiegoo/data-pro-skill clustering-toolkitInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides a specialized pipeline for identifying homogeneous groups within high-dimensional datasets. It combines dimensionality reduction (PCA) with density-based clustering (DBSCAN) to find natural patterns while filtering noise.
pca_dbscan_grouping)A hybrid pipeline that uses Principal Component Analysis to extract features and DBSCAN to group entities.
basic_clustering)Standard K-Means clustering pipeline for rapid entity grouping.
k).residual_segmentation)Advanced behavioral segmentation using regression residuals (Actual vs. Predicted).
gower_distance)Similarity metric for mixed data types (numerical + categorical).
dbscan_cluster_quality)Utilities to detect common clustering pathologies.
from scripts.pca_dbscan_grouping import PCA_DBSCAN_Pipeline
from scripts.dbscan_cluster_quality import calculate_cluster_metrics
# 1. Run clustering pipeline
pipeline = PCA_DBSCAN_Pipeline(n_components=5, eps=0.015)
clusters = pipeline.fit_predict(df)
# 2. Diagnose quality
metrics = calculate_cluster_metrics(clusters)
if metrics['Giant_Ratio'] > 0.5:
print("Warning: Pathological giant cluster detected. Reduce EPS.")
sector_weight (or equivalent) to balance statistical similarity with domain knowledge.eps can have drastic effects. Use grid_search_checkpoint for tuning.scikit-learn, pandas, numpy.
testing
Comprehensive time-series validation and analysis suite. Handles backtesting of trading and non-trading strategies with support for walk-forward validation (training vs testing windows), performance metric calculation (Sharpe, Drawdown, Win Rate), and event-driven resource allocation simulation. Use for: (1) Validating sequential logic on time-series data, (2) Calculating risk-adjusted performance, (3) Simulating constraints in resource distribution, (4) Detecting look-ahead bias through walk-forward testing.
tools
Core statistical analysis and pipeline automation for survey datasets. Use for: (1) Running standard Crosstabs, NPS, Top-Box calculations, (2) Generating complete EDA or Analytics notebooks, (3) Quantitative and qualitative processing of questionnaire data.
development
Business-level frameworks and actionable reporting for executives. Use for: (1) Plotting Priority Matrices, (2) Generating Pain Curves, (3) Conversion Funnels, (4) Removing Halo Effects to uncover true sentiment.
testing
Tactical and highly interpretable Machine Learning. Use for: (1) Extracting Feature Importance via Random Forest, (2) Running Permutation Tests, (3) Handling Imbalanced Data (SMOTE).