skills/AI/AI-Unsupervised-Learning-Algorithms/SKILL.md
Apply unsupervised machine learning algorithms to security data for anomaly detection, clustering, and dimensionality reduction. Use this skill whenever the user needs to analyze unlabeled security data, detect unknown threats, cluster network events, reduce feature dimensions, or identify outliers in logs, traffic, or behavioral data. Trigger for tasks involving K-Means, DBSCAN, HDBSCAN, Isolation Forest, GMM, PCA, t-SNE, or any unsupervised pattern discovery in cybersecurity contexts.
npx skillsauth add abelrguezr/hacktricks-skills unsupervised-learning-securityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill helps you apply unsupervised machine learning algorithms to security data. Unlike supervised learning, these methods work with unlabeled data to discover hidden patterns, detect anomalies, and cluster similar events.
Use this skill when you need to:
Choose the right algorithm based on your use case:
| Use Case | Recommended Algorithm | Why | |----------|----------------------|-----| | Anomaly detection | Isolation Forest, DBSCAN, HDBSCAN | Designed to flag outliers without labels | | Clustering with known K | K-Means, GMM | Fast, works well when you know cluster count | | Clustering with unknown K | DBSCAN, HDBSCAN, Hierarchical | Automatically determines cluster count | | Varying density clusters | HDBSCAN, DBSCAN | Handles clusters of different densities | | Soft clustering | GMM | Provides probability of cluster membership | | Dimensionality reduction | PCA (linear), t-SNE (nonlinear) | Compress features for visualization | | Large datasets | Isolation Forest, K-Means | Efficient O(n log n) or O(n) complexity | | Visual exploration | t-SNE, PCA | Create 2D/3D plots for human analysis |
Best for: Quick clustering when you know the number of groups, spherical clusters of similar size.
How it works:
Security use cases:
Key parameters:
n_clusters: Number of clusters (use Elbow Method or Silhouette Score to determine)random_state: For reproducibilityn_init: Number of initializations (higher = more stable)Limitations: Assumes spherical, equally-sized clusters. Sensitive to initialization. Requires normalization.
Best for: Finding arbitrarily shaped clusters, detecting outliers as noise, unknown cluster count.
How it works:
eps distance that have at least min_samples neighborsSecurity use cases:
Key parameters:
eps: Maximum distance between points in same clustermin_samples: Minimum points to form a dense regionLimitations: Struggles with varying densities. Single eps may not work for all clusters. Can be slow on large datasets.
Best for: Clusters of varying density, modern threat-hunting pipelines, outlier scoring.
How it works:
Security use cases:
Key parameters:
min_cluster_size: Minimum points per cluster (sensible defaults available)prediction_data=True: Enable outlier scoringAdvantages over DBSCAN: Handles varying densities, only one main hyperparameter, provides outlier probabilities.
Best for: Fast anomaly detection on large, high-dimensional datasets.
How it works:
Security use cases:
Key parameters:
n_estimators: Number of trees (100-200 typical)contamination: Expected fraction of anomalies (e.g., 0.05 for 5%)Advantages: O(n log n) complexity, works on high-dimensional data, no distribution assumptions.
Best for: Soft clustering, probabilistic anomaly detection, ellipsoidal clusters.
How it works:
Security use cases:
Key parameters:
n_components: Number of Gaussian componentscovariance_type: 'full', 'tied', 'diag', 'spherical'Advantages: Soft assignments, handles different cluster shapes, provides likelihood scores.
Best for: Linear dimensionality reduction, noise reduction, feature decorrelation.
How it works:
Security use cases:
Key parameters:
n_components: Number of components to keepwhiten=True: Scale components to unit varianceLimitations: Linear only, components may be hard to interpret, prioritizes variance not "interestingness".
Best for: Nonlinear visualization of high-dimensional data.
How it works:
Security use cases:
Key parameters:
perplexity: Effective number of neighbors (5-50 typical)learning_rate: Step size for optimizationn_iter: Number of iterations (1000+ for convergence)Limitations: Computationally heavy O(n²), distances not globally meaningful, can't project new points easily.
Unsupervised learners are not immune to active attackers:
score_samples supportfrom sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
# 1. Prepare features
X = df[['duration', 'bytes', 'packets', 'errors']].values
X_scaled = StandardScaler().fit_transform(X)
# 2. Train detector
iso = IsolationForest(contamination=0.05, random_state=42)
iso.fit(X_scaled)
# 3. Score and flag
scores = iso.decision_function(X_scaled)
labels = iso.predict(X_scaled) # -1 = anomaly, 1 = normal
# 4. Extract anomalies
anomalies = df[labels == -1].sort_values('score', ascending=False)
from sklearn.cluster import KMeans, DBSCAN
from hdbscan import HDBSCAN
# Try multiple algorithms
kmeans = KMeans(n_clusters=4, random_state=42).fit(X_scaled)
dbscan = DBSCAN(eps=0.5, min_samples=5).fit(X_scaled)
hdb = HDBSCAN(min_cluster_size=15, prediction_data=True).fit(X_scaled)
# Compare results
print(f"K-Means: {len(set(kmeans.labels_))} clusters")
print(f"DBSCAN: {len(set(dbscan.labels_) - {-1})} clusters, {sum(dbscan.labels_ == -1)} noise")
print(f"HDBSCAN: {len(set(hdb.labels_) - {-1})} clusters, {sum(hdb.labels_ == -1)} noise")
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# PCA for quick linear reduction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"PCA explained variance: {pca.explained_variance_ratio_}")
# t-SNE for nonlinear visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
from pyod.models import ECOD, IForest
# Train multiple detectors
models = [ECOD(), IForest(contamination=0.05)]
# Ensemble: flag if any model detects anomaly
scores = sum(m.fit(X_train).decision_function(X_test) for m in models) / len(models)
anomalies = X_test[scores < threshold]
After running an algorithm:
testing
How to perform a House of Lore (small bin attack) heap exploitation. Use this skill whenever the user mentions heap exploitation, small bin attacks, fake chunks, glibc heap vulnerabilities, or needs to insert fake chunks into small bins for arbitrary read/write. Trigger for CTF challenges involving heap corruption, glibc 2.31+ exploitation, or when the user needs to bypass malloc sanity checks using fake chunk linking.
testing
How to perform House of Force heap exploitation attacks. Use this skill whenever the user mentions heap exploitation, House of Force, top chunk manipulation, arbitrary memory allocation, malloc manipulation, or wants to allocate chunks at specific addresses. Also trigger for CTF challenges involving heap overflows, top chunk size overwrites, or when the user needs to calculate evil_size for heap attacks. Make sure to use this skill for any binary exploitation task involving glibc heap manipulation, even if they don't explicitly say "House of Force".
tools
How to perform House of Einherjar heap exploitation to allocate memory at arbitrary addresses. Use this skill whenever the user mentions heap exploitation, glibc heap attacks, arbitrary memory allocation, off-by-one overflow exploitation, tcache poisoning, fast bin attacks, or any CTF challenge involving heap manipulation. This is essential for binary exploitation tasks where you need to control malloc() return addresses.
testing
How to identify, analyze, and exploit heap overflow vulnerabilities in binary exploitation challenges and real-world scenarios. Use this skill whenever the user mentions heap overflows, memory corruption, heap grooming, tcache poisoning, fast-bin attacks, or any heap-related vulnerability in CTF challenges, binary analysis, or security research. This skill covers heap overflow fundamentals, exploitation techniques, heap grooming strategies, and real-world CVE analysis.