skills/AI/AI-Model-Data-Preparation-and-Evaluation/SKILL.md
Prepare and evaluate machine learning data. Use this skill whenever the user needs to clean, transform, or split datasets for ML training, or evaluate model performance with metrics like accuracy, precision, recall, F1, ROC-AUC, MAE, or confusion matrices. Trigger for any data preprocessing task, feature engineering, handling missing values, encoding categorical variables, normalization, or model evaluation requests.
npx skillsauth add abelrguezr/hacktricks-skills ml-data-prep-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill helps you prepare raw data for machine learning and evaluate model performance. Follow the workflow below for systematic data preparation.
# Clean and prepare your data
python scripts/data_cleaning.py --input data.csv --output cleaned_data.csv
# Transform features
python scripts/data_transformation.py --input cleaned_data.csv --output transformed_data.csv
# Split for training
python scripts/data_splitting.py --input transformed_data.csv --train-ratio 0.7 --val-ratio 0.15
# Evaluate model predictions
python scripts/model_evaluation.py --actual actual.csv --predicted predictions.csv
| Source | Method | Example |
|--------|--------|---------|
| CSV/JSON files | pandas.read_csv() | pd.read_csv('data.csv') |
| SQL databases | sqlalchemy | pd.read_sql(query, connection) |
| APIs | requests | requests.get(url).json() |
| Web scraping | beautifulsoup4 | BeautifulSoup(html, 'html.parser') |
Strategies by data type:
| Type | Strategy | When to use | |------|----------|-------------| | Numeric | Mean/Median imputation | Small gaps, normal distribution | | Numeric | KNN imputation | Complex relationships between features | | Categorical | Mode (most frequent) | When category matters | | Categorical | New category "Unknown" | When missingness is meaningful | | Any | Drop rows/columns | When >50% missing or not critical |
Use the cleaning script:
python scripts/data_cleaning.py \
--input data.csv \
--numeric-strategy median \
--categorical-strategy most_frequent \
--remove-duplicates \
--filter-outliers zscore:3
df.drop_duplicates()Detection methods:
| Method | Use case | Threshold | |--------|----------|----------| | Z-score | Normal distribution | |z| > 3 | | IQR | Skewed distribution | Q1 - 1.5×IQR, Q3 + 1.5×IQR | | Box plot | Visual inspection | Whisker bounds |
Decision framework:
| Method | Formula | Range | Use when |
|--------|---------|-------|----------|
| Min-Max | (X - min) / (max - min) | [0, 1] | Neural networks, distance-based algorithms |
| Z-Score | (X - μ) / σ | Mean=0, Std=1 | Linear models, when outliers exist |
| Robust | (X - median) / IQR | - | Heavy outliers |
Script usage:
python scripts/data_transformation.py \
--input cleaned_data.csv \
--normalize zscore \
--columns "feature1,feature2,feature3"
| Method | Output | Use when | |--------|--------|----------| | One-Hot | Binary columns | Low cardinality (<10 categories) | | Label | Integer 0,1,2... | Ordinal data or tree models | | Ordinal | Ordered integers | Natural ordering exists | | Target | Mean of target | High cardinality, supervised learning | | Hashing | Fixed-size vector | Very high cardinality |
Text encoding:
Common patterns:
# Date/time features
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6])
# Ratios and combinations
df['price_per_sqft'] = df['price'] / df['sqft']
df['total_value'] = df['quantity'] * df['unit_price']
# Binning
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
labels=['child', 'young', 'middle', 'senior'])
| Dataset Size | Train | Validation | Test | |--------------|-------|------------|------| | Small (<10K) | 70% | 15% | 15% | | Medium (10K-100K) | 80% | 10% | 10% | | Large (>100K) | 90% | 5% | 5% |
Stratified Split (classification with imbalanced classes):
python scripts/data_splitting.py \
--input data.csv \
--stratify target_column \
--train-ratio 0.7 \
--val-ratio 0.15
Time Series Split (temporal data):
K-Fold Cross-Validation (small datasets):
| Metric | Formula | Best for |
|--------|---------|----------|
| Accuracy | (TP+TN) / Total | Balanced classes |
| Precision | TP / (TP+FP) | Costly false positives |
| Recall | TP / (TP+FN) | Costly false negatives |
| F1 Score | 2×(P×R)/(P+R) | Imbalanced classes |
| ROC-AUC | Area under ROC curve | Threshold-independent comparison |
| MCC | Correlation coefficient | Imbalanced, all confusion matrix cells |
| Specificity | TN / (TN+FP) | Costly false positives |
Script usage:
python scripts/model_evaluation.py \
--actual actual_labels.csv \
--predicted predictions.csv \
--metrics "accuracy,precision,recall,f1,roc_auc,mcc"
| Metric | Formula | Interpretation |
|--------|---------|----------------|
| MAE | mean(|y - ŷ|) | Average error in original units |
| MSE | mean((y - ŷ)²) | Penalizes large errors |
| RMSE | sqrt(MSE) | Error in original units |
| R² | 1 - SS_res/SS_tot | Proportion of variance explained |
Predicted
Positive Negative
Actual Positive TP FN
Actual Negative FP TN
Key insights:
| Scenario | Primary Metric | Secondary Metric | |----------|---------------|------------------| | Balanced classification | Accuracy | F1 Score | | Imbalanced classification | F1 Score | ROC-AUC | | Medical diagnosis | Recall | Precision | | Fraud detection | Precision | Recall | | Spam filtering | Recall | Specificity | | Regression | MAE or RMSE | R² | | Small dataset | MCC | F1 Score |
# Full pipeline
python scripts/data_cleaning.py -i raw.csv -o clean.csv --remove-duplicates --filter-outliers zscore:3
python scripts/data_transformation.py -i clean.csv -o prep.csv --normalize zscore --encode onehot
python scripts/data_splitting.py -i prep.csv --stratify target --train-ratio 0.8
python scripts/model_evaluation.py -a actual.csv -p pred.csv --metrics all
After data preparation:
For model training and deployment, consider using specialized ML frameworks (scikit-learn, TensorFlow, PyTorch).
testing
How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.
testing
How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".
tools
How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.
testing
How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.