Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

qa-aman/feature-engineering

Name: feature-engineering
Author: qa-aman

skills/by-role/data-scientist/feature-engineering/SKILL.md

npx skillsauth add qa-aman/claude-skills feature-engineering

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview

Based on "Feature Engineering for Machine Learning" by Alice Zheng and Amanda Casari. The core principle: features are the interface between raw data and model performance. The right transformation of an existing variable outperforms adding more data or tuning hyperparameters in most real-world settings. Feature engineering is domain knowledge encoded into math - and it is the highest-leverage step in the ML pipeline.

Workflow

Step 1: Audit raw columns for engineering opportunity

Before transforming anything, catalog what you have and what problems each column has.

For each column, note:

Type: numeric continuous, numeric discrete, ordinal categorical, nominal categorical, datetime, text, ID
Problem: skewed, high cardinality, missing, mixed type, free text, leaky (contains target signal from the future)
Engineering opportunity: log transform, binning, encoding, extraction, embedding

Create a feature engineering plan as a table before writing any code:

| Column      | Type       | Problem          | Planned Transform         |
|-------------|-----------|------------------|---------------------------|
| revenue     | numeric    | Right-skewed     | log1p transform           |
| country     | nominal    | 150 unique vals  | Frequency encoding        |
| signup_date | datetime   | Raw timestamp    | Extract day-of-week, hour |
| description | text       | Unstructured     | TF-IDF, 500 features      |

Step 2: Apply numeric transformations

Skewed distributions: Apply log transform to right-skewed variables. Use np.log1p to handle zeros.

df["log_revenue"] = np.log1p(df["revenue"])

Scaling: Standardize for distance-based models (KNN, SVM, neural nets). Tree-based models (XGBoost, Random Forest) do not require scaling.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df["revenue_scaled"] = scaler.fit_transform(df[["revenue"]])

Binning: Convert continuous to ordinal when the relationship to the target is non-linear.

df["age_bin"] = pd.cut(df["age"], bins=[0, 18, 35, 55, 100], labels=["<18", "18-35", "35-55", "55+"])

Capping outliers: Cap at a percentile rather than removing rows.

p99 = df["session_time"].quantile(0.99)
df["session_time_capped"] = df["session_time"].clip(upper=p99)

Step 3: Encode categorical variables

Match the encoding method to the cardinality and the model type.

| Cardinality | Model Type | Method | |---|---|---| | Low (< 10) | Any | One-hot encoding | | Medium (10-50) | Tree-based | Ordinal / Label encoding | | High (> 50) | Any | Frequency or target encoding | | High (> 50) | Neural net | Embedding layer |

# One-hot
df = pd.get_dummies(df, columns=["color"], drop_first=True)

# Frequency encoding
freq = df["country"].value_counts() / len(df)
df["country_freq"] = df["country"].map(freq)

# Target encoding (use cross-validation to avoid leakage)
from category_encoders import TargetEncoder
encoder = TargetEncoder(cols=["city"])
df["city_encoded"] = encoder.fit_transform(df["city"], df["target"])

Step 4: Extract features from dates and text

Datetime columns: Raw timestamps carry no signal. Extract structured components.

df["signup_hour"] = df["signup_ts"].dt.hour
df["signup_dow"] = df["signup_ts"].dt.dayofweek    # 0=Mon, 6=Sun
df["signup_is_weekend"] = df["signup_dow"].isin([5, 6]).astype(int)
df["days_since_signup"] = (pd.Timestamp.now() - df["signup_ts"]).dt.days

Text columns: Start with TF-IDF for structured text. Move to embeddings when semantic meaning matters.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=200, ngram_range=(1, 2), stop_words="english")
text_features = tfidf.fit_transform(df["description"])

Step 5: Create interaction and aggregate features

Zheng and Casari's insight: the most predictive features are often combinations of existing ones, not raw columns.

Interaction features: Multiply or divide two numeric columns when their ratio or product has domain meaning.

df["revenue_per_session"] = df["revenue"] / (df["session_count"] + 1)
df["cost_per_click"] = df["spend"] / (df["clicks"] + 1)  # +1 avoids division by zero

Aggregate features (group-level statistics): Capture behavior relative to a group.

group_stats = df.groupby("user_id")["purchase_amount"].agg(["mean", "std", "count"]).reset_index()
group_stats.columns = ["user_id", "user_avg_purchase", "user_std_purchase", "user_purchase_count"]
df = df.merge(group_stats, on="user_id", how="left")

Lag features for time series:

df = df.sort_values(["user_id", "date"])
df["revenue_lag_7d"] = df.groupby("user_id")["revenue"].shift(7)
df["revenue_rolling_30d"] = df.groupby("user_id")["revenue"].transform(lambda x: x.rolling(30).mean())

Step 6: Validate features and check for leakage

Before training, run these checks:

Leakage check: Can any feature be computed from the target or from future data? Flag and remove.
Correlation check: Drop features with correlation > 0.95 to another feature.
Importance check: Train a simple Random Forest and check feature importances. Drop zero-importance features.
Missing rate check: Any feature > 30% missing should be dropped or imputed before inclusion.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

importance = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importance.head(20))

Document the final feature set with a name, source column(s), transform applied, and rationale.

Anti-Patterns

1. One-hot encoding high-cardinality categoricals Bad: One-hot encoding a "city" column with 500 unique values - creates 500 sparse binary columns. Good: Use frequency encoding or target encoding. Reserve one-hot for columns with fewer than 10 unique values.

2. Using the full dataset to fit encoders Bad: Fitting a TF-IDF or TargetEncoder on the entire dataset before splitting into train/test. Good: Fit only on training data. Apply (transform) to test data. This prevents leakage from test labels.

3. Adding features without checking importance Bad: Engineering 50 new features and feeding all of them to the model. Good: Run a quick feature importance check. Remove zero-importance features before final training.

4. Ignoring domain knowledge Bad: Letting the model figure out that revenue/sessions is meaningful by including both raw columns. Good: Create the ratio feature explicitly. Domain-derived features outperform raw features in most tabular settings.

Quality Checklist

[ ] Feature engineering plan documented as a table before any code is written
[ ] Skewed numeric columns log-transformed or capped
[ ] Categorical encoding method matched to cardinality and model type
[ ] Datetime columns decomposed into structured components (hour, day-of-week, recency)
[ ] Interaction and aggregate features created for high-signal relationships
[ ] All encoders fitted on training data only - no test data leakage
[ ] Feature importance check run and zero-importance features removed
[ ] Final feature set documented with source, transform, and rationale

qa-aman/feature-engineering

skills/by-role/data-scientist/feature-engineering/SKILL.md

Design and implement features from raw data for machine learning models. Use when the user says "engineer features", "feature engineering for X", "create features from this dataset", "transform these variables", "encode categoricals", "handle skew", "create interaction features", "lag features", "extract features from text or dates", "improve model performance with better features", or wants to move from raw columns to model-ready inputs.

13 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add qa-aman/claude-skills feature-engineering

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 1:56 PM168.9s1 file scanned

SKILL.md

name:: feature-engineering
description:: >

Overview

Workflow

Step 1: Audit raw columns for engineering opportunity

Before transforming anything, catalog what you have and what problems each column has.

For each column, note:

Type: numeric continuous, numeric discrete, ordinal categorical, nominal categorical, datetime, text, ID
Problem: skewed, high cardinality, missing, mixed type, free text, leaky (contains target signal from the future)
Engineering opportunity: log transform, binning, encoding, extraction, embedding

Create a feature engineering plan as a table before writing any code:

| Column      | Type       | Problem          | Planned Transform         |
|-------------|-----------|------------------|---------------------------|
| revenue     | numeric    | Right-skewed     | log1p transform           |
| country     | nominal    | 150 unique vals  | Frequency encoding        |
| signup_date | datetime   | Raw timestamp    | Extract day-of-week, hour |
| description | text       | Unstructured     | TF-IDF, 500 features      |

Step 2: Apply numeric transformations

Skewed distributions: Apply log transform to right-skewed variables. Use np.log1p to handle zeros.

df["log_revenue"] = np.log1p(df["revenue"])

Scaling: Standardize for distance-based models (KNN, SVM, neural nets). Tree-based models (XGBoost, Random Forest) do not require scaling.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df["revenue_scaled"] = scaler.fit_transform(df[["revenue"]])

Binning: Convert continuous to ordinal when the relationship to the target is non-linear.

df["age_bin"] = pd.cut(df["age"], bins=[0, 18, 35, 55, 100], labels=["<18", "18-35", "35-55", "55+"])

Capping outliers: Cap at a percentile rather than removing rows.

p99 = df["session_time"].quantile(0.99)
df["session_time_capped"] = df["session_time"].clip(upper=p99)

Step 3: Encode categorical variables

Match the encoding method to the cardinality and the model type.

# One-hot
df = pd.get_dummies(df, columns=["color"], drop_first=True)

# Frequency encoding
freq = df["country"].value_counts() / len(df)
df["country_freq"] = df["country"].map(freq)

# Target encoding (use cross-validation to avoid leakage)
from category_encoders import TargetEncoder
encoder = TargetEncoder(cols=["city"])
df["city_encoded"] = encoder.fit_transform(df["city"], df["target"])

Step 4: Extract features from dates and text

Datetime columns: Raw timestamps carry no signal. Extract structured components.

df["signup_hour"] = df["signup_ts"].dt.hour
df["signup_dow"] = df["signup_ts"].dt.dayofweek    # 0=Mon, 6=Sun
df["signup_is_weekend"] = df["signup_dow"].isin([5, 6]).astype(int)
df["days_since_signup"] = (pd.Timestamp.now() - df["signup_ts"]).dt.days

Text columns: Start with TF-IDF for structured text. Move to embeddings when semantic meaning matters.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=200, ngram_range=(1, 2), stop_words="english")
text_features = tfidf.fit_transform(df["description"])

Step 5: Create interaction and aggregate features

Zheng and Casari's insight: the most predictive features are often combinations of existing ones, not raw columns.

Interaction features: Multiply or divide two numeric columns when their ratio or product has domain meaning.

df["revenue_per_session"] = df["revenue"] / (df["session_count"] + 1)
df["cost_per_click"] = df["spend"] / (df["clicks"] + 1)  # +1 avoids division by zero

Aggregate features (group-level statistics): Capture behavior relative to a group.

group_stats = df.groupby("user_id")["purchase_amount"].agg(["mean", "std", "count"]).reset_index()
group_stats.columns = ["user_id", "user_avg_purchase", "user_std_purchase", "user_purchase_count"]
df = df.merge(group_stats, on="user_id", how="left")

Lag features for time series:

df = df.sort_values(["user_id", "date"])
df["revenue_lag_7d"] = df.groupby("user_id")["revenue"].shift(7)
df["revenue_rolling_30d"] = df.groupby("user_id")["revenue"].transform(lambda x: x.rolling(30).mean())

Step 6: Validate features and check for leakage

Before training, run these checks:

Leakage check: Can any feature be computed from the target or from future data? Flag and remove.
Correlation check: Drop features with correlation > 0.95 to another feature.
Importance check: Train a simple Random Forest and check feature importances. Drop zero-importance features.
Missing rate check: Any feature > 30% missing should be dropped or imputed before inclusion.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

importance = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importance.head(20))

Document the final feature set with a name, source column(s), transform applied, and rationale.

Anti-Patterns

Quality Checklist

[ ] Feature engineering plan documented as a table before any code is written
[ ] Skewed numeric columns log-transformed or capped
[ ] Categorical encoding method matched to cardinality and model type
[ ] Datetime columns decomposed into structured components (hour, day-of-week, recency)
[ ] Interaction and aggregate features created for high-signal relationships
[ ] All encoders fitted on training data only - no test data leakage
[ ] Feature importance check run and zero-importance features removed
[ ] Final feature set documented with source, transform, and rationale

Related Skills

qa-aman/webinar-planner

development

VerifiedTrustedCommunity

Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/webinar-planner

qa-aman/thought-leadership-writer

development

VerifiedTrustedCommunity

Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/thought-leadership-writer

qa-aman/social-calendar

development

VerifiedTrustedCommunity

Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/social-calendar

qa-aman/seo-article-writer

development

VerifiedTrustedCommunity

Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.

13SKILL.mdUpdated May 5, 2026

qa-aman/seo-article-writer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/qa-aman/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/skills/by-role/data-scientist/feature-engineering ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

qa-aman/claude-skills

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT