Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

brycewang-stanford/data-cog-guide

Name: data-cog-guide
Author: brycewang-stanford

skills/43-wentorai-research-plugins/skills/analysis/wrangling/data-cog-guide/SKILL.md

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research data-cog-guide

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Cog Guide

An intelligent data analysis assistant that accepts messy, poorly documented CSV files and automatically infers structure, cleans anomalies, and produces deep analytical reports with minimal user prompting. Designed for researchers who need quick insights from unfamiliar or inherited datasets without spending hours on manual data preparation.

Overview

Researchers frequently receive datasets from collaborators, public repositories, or legacy systems that lack documentation, use inconsistent formatting, and contain mixed data quality. Traditional analysis requires significant upfront effort to understand and prepare such data. Data Cog automates this process by applying heuristic inference, pattern recognition, and iterative cleaning to produce analysis-ready data along with a comprehensive profile report.

The skill implements a "zero-configuration" philosophy: provide the CSV file path and an optional research question, and it handles encoding detection, delimiter inference, type casting, missingness assessment, and initial exploratory statistics automatically.

Automated Ingestion Pipeline

Smart Loading

import pandas as pd
import chardet
import io

def smart_load_csv(filepath: str) -> tuple:
    """
    Intelligently load a CSV file, auto-detecting encoding,
    delimiter, header row, and comment lines.
    """
    # Step 1: Detect encoding
    with open(filepath, 'rb') as f:
        raw = f.read(100000)
    encoding = chardet.detect(raw)['encoding']

    # Step 2: Detect delimiter
    import csv
    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
        sample = f.read(8192)
    sniffer = csv.Sniffer()
    try:
        dialect = sniffer.sniff(sample)
        delimiter = dialect.delimiter
    except csv.Error:
        delimiter = ','

    # Step 3: Detect header row (skip comment lines)
    skip_rows = 0
    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
        for line in f:
            if line.startswith('#') or line.startswith('//') or line.strip() == '':
                skip_rows += 1
            else:
                break

    # Step 4: Load with inferred parameters
    df = pd.read_csv(
        filepath, encoding=encoding, delimiter=delimiter,
        skiprows=skip_rows, low_memory=False
    )

    metadata = {
        'encoding': encoding,
        'delimiter': repr(delimiter),
        'skipped_rows': skip_rows,
        'shape': df.shape
    }
    return df, metadata

Automatic Type Inference

def auto_cast_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Automatically cast columns to their most appropriate types.
    Handles dates, numerics stored as strings, booleans, and categories.
    """
    for col in df.columns:
        # Try numeric conversion
        numeric = pd.to_numeric(df[col], errors='coerce')
        if numeric.notna().mean() > 0.85:
            df[col] = numeric
            continue

        # Try datetime conversion
        datetime = pd.to_datetime(df[col], errors='coerce', infer_datetime_format=True)
        if datetime.notna().mean() > 0.85:
            df[col] = datetime
            continue

        # Try boolean detection
        unique_lower = df[col].dropna().astype(str).str.lower().unique()
        if set(unique_lower).issubset({'true', 'false', 'yes', 'no', '1', '0', 'y', 'n'}):
            df[col] = df[col].astype(str).str.lower().map(
                {'true': True, 'false': False, 'yes': True, 'no': False,
                 '1': True, '0': False, 'y': True, 'n': False}
            )
            continue

        # Convert low-cardinality strings to category
        if df[col].nunique() / len(df) < 0.05 and df[col].nunique() < 50:
            df[col] = df[col].astype('category')

    return df

Deep Automated Profiling

Profile Report Generation

The profiling stage produces a structured report covering:

Schema overview: Column names, inferred types, semantic roles (ID, feature, target, timestamp).
Univariate statistics: Mean, median, mode, std, skewness, kurtosis for numeric columns; frequency tables for categoricals.
Missing data matrix: Heatmap-style report of missingness patterns across all columns.
Correlation analysis: Pairwise Pearson, Spearman, and Cramér's V correlations.
Distribution flags: Columns that are heavily skewed, zero-inflated, or constant.
Duplicate detection: Exact row duplicates and near-duplicate clusters.

| Metric | Numeric Columns | Categorical Columns | |--------|----------------|-------------------| | Central tendency | Mean, median, mode | Mode, frequency | | Dispersion | Std, IQR, range, CV | Unique count, entropy | | Shape | Skewness, kurtosis | Imbalance ratio | | Quality | Missing %, zero %, outlier % | Missing %, rare labels % |

Interactive Analysis Workflow

Minimal-Prompt Usage Pattern

The recommended workflow requires only three inputs:

File path: The CSV to analyze.
Research question (optional): A one-sentence description of what you want to learn.
Output format: "summary", "full_report", or "cleaned_csv".

User: Analyze /data/survey_results_2025.csv
      Question: What factors predict participant satisfaction?
      Output: full_report

Data Cog will:
  1. Load and profile the dataset (auto-detect everything)
  2. Clean and transform (handle missing data, encode categoricals)
  3. Run correlation analysis focused on satisfaction-related columns
  4. Generate regression models predicting satisfaction
  5. Produce a structured report with findings and visualizations

Iterative Refinement

After the initial automated analysis, you can refine by asking targeted follow-up questions:

"Focus only on respondents from Group A"
"Exclude the first 50 rows (pilot data)"
"Treat column X as ordinal with levels: low < medium < high"
"Run the same analysis but with log-transformed income"

Best Practices

Always review the auto-generated profile before trusting downstream results.
Verify that automatic type inference made sensible choices, especially for ambiguous columns.
Provide a research question when possible to guide feature selection and analysis focus.
Save the cleaning audit log alongside your results for reproducibility.
For datasets over 1 million rows, consider sampling for the initial profile to save time.

References

Breck, E., et al. (2019). Data Validation for Machine Learning. MLSys 2019.
Hynes, N., et al. (2017). The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets. NIPS MLSys Workshop.
Pandas Development Team (2024). pandas: Powerful Python Data Analysis Toolkit. https://pandas.pydata.org/

brycewang-stanford/data-cog-guide

skills/43-wentorai-research-plugins/skills/analysis/wrangling/data-cog-guide/SKILL.md

Upload messy CSVs with minimal prompting for deep automated analysis

1,232 stars

documentation

Updated May 26, 2026

$ install --global

skillsauth

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research data-cog-guide

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 26, 2026, 4:49 AM81.8s1 file scanned

SKILL.md

name:: data-cog-guide
description:: Upload messy CSVs with minimal prompting for deep automated analysis
emoji:: 🧠
category:: analysis
subcategory:: wrangling
keywords:: ["automated analysis", "data wrangling", "CSV upload", "data profiling", "smart analysis", "minimal prompting"]
source:: wentor-research-plugins

Data Cog Guide

Overview

Automated Ingestion Pipeline

Smart Loading

import pandas as pd
import chardet
import io

def smart_load_csv(filepath: str) -> tuple:
    """
    Intelligently load a CSV file, auto-detecting encoding,
    delimiter, header row, and comment lines.
    """
    # Step 1: Detect encoding
    with open(filepath, 'rb') as f:
        raw = f.read(100000)
    encoding = chardet.detect(raw)['encoding']

    # Step 2: Detect delimiter
    import csv
    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
        sample = f.read(8192)
    sniffer = csv.Sniffer()
    try:
        dialect = sniffer.sniff(sample)
        delimiter = dialect.delimiter
    except csv.Error:
        delimiter = ','

    # Step 3: Detect header row (skip comment lines)
    skip_rows = 0
    with open(filepath, 'r', encoding=encoding, errors='replace') as f:
        for line in f:
            if line.startswith('#') or line.startswith('//') or line.strip() == '':
                skip_rows += 1
            else:
                break

    # Step 4: Load with inferred parameters
    df = pd.read_csv(
        filepath, encoding=encoding, delimiter=delimiter,
        skiprows=skip_rows, low_memory=False
    )

    metadata = {
        'encoding': encoding,
        'delimiter': repr(delimiter),
        'skipped_rows': skip_rows,
        'shape': df.shape
    }
    return df, metadata

Automatic Type Inference

def auto_cast_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Automatically cast columns to their most appropriate types.
    Handles dates, numerics stored as strings, booleans, and categories.
    """
    for col in df.columns:
        # Try numeric conversion
        numeric = pd.to_numeric(df[col], errors='coerce')
        if numeric.notna().mean() > 0.85:
            df[col] = numeric
            continue

        # Try datetime conversion
        datetime = pd.to_datetime(df[col], errors='coerce', infer_datetime_format=True)
        if datetime.notna().mean() > 0.85:
            df[col] = datetime
            continue

        # Try boolean detection
        unique_lower = df[col].dropna().astype(str).str.lower().unique()
        if set(unique_lower).issubset({'true', 'false', 'yes', 'no', '1', '0', 'y', 'n'}):
            df[col] = df[col].astype(str).str.lower().map(
                {'true': True, 'false': False, 'yes': True, 'no': False,
                 '1': True, '0': False, 'y': True, 'n': False}
            )
            continue

        # Convert low-cardinality strings to category
        if df[col].nunique() / len(df) < 0.05 and df[col].nunique() < 50:
            df[col] = df[col].astype('category')

    return df

Deep Automated Profiling

Profile Report Generation

The profiling stage produces a structured report covering:

Schema overview: Column names, inferred types, semantic roles (ID, feature, target, timestamp).
Univariate statistics: Mean, median, mode, std, skewness, kurtosis for numeric columns; frequency tables for categoricals.
Missing data matrix: Heatmap-style report of missingness patterns across all columns.
Correlation analysis: Pairwise Pearson, Spearman, and Cramér's V correlations.
Distribution flags: Columns that are heavily skewed, zero-inflated, or constant.
Duplicate detection: Exact row duplicates and near-duplicate clusters.

Interactive Analysis Workflow

Minimal-Prompt Usage Pattern

The recommended workflow requires only three inputs:

File path: The CSV to analyze.
Research question (optional): A one-sentence description of what you want to learn.
Output format: "summary", "full_report", or "cleaned_csv".

User: Analyze /data/survey_results_2025.csv
      Question: What factors predict participant satisfaction?
      Output: full_report

Data Cog will:
  1. Load and profile the dataset (auto-detect everything)
  2. Clean and transform (handle missing data, encode categoricals)
  3. Run correlation analysis focused on satisfaction-related columns
  4. Generate regression models predicting satisfaction
  5. Produce a structured report with findings and visualizations

Iterative Refinement

After the initial automated analysis, you can refine by asking targeted follow-up questions:

"Focus only on respondents from Group A"
"Exclude the first 50 rows (pilot data)"
"Treat column X as ordinal with levels: low < medium < high"
"Run the same analysis but with log-transformed income"

Best Practices

Always review the auto-generated profile before trusting downstream results.
Verify that automatic type inference made sensible choices, especially for ambiguous columns.
Provide a research question when possible to guide feature selection and analysis focus.
Save the cleaning audit log alongside your results for reproducibility.
For datasets over 1 million rows, consider sampling for the initial profile to save time.

References

Breck, E., et al. (2019). Data Validation for Machine Learning. MLSys 2019.
Hynes, N., et al. (2017). The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets. NIPS MLSys Workshop.
Pandas Development Team (2024). pandas: Powerful Python Data Analysis Toolkit. https://pandas.pydata.org/

Related Skills

brycewang-stanford/literature-review-tools

tools

VerifiedTrustedCommunity

Recommend AND run open-source AI tools, agents, Claude Code / Codex skills, and MCP servers for any stage of a literature review — searching, reading, extracting, synthesizing, screening, citation-checking, and paper writing. Use when the user asks "what tool should I use to..." OR "install/run/use <tool> to ..." for research/lit-review work: automating a survey or related-work section, PDF→Markdown extraction for LLMs (MinerU/marker/docling), PRISMA / systematic review (ASReview), citation-backed Q&A over PDFs (PaperQA2), wiring papers into Claude/Cursor via MCP (arxiv/paper-search/zotero servers), or chatting with a Zotero library. Ships a launcher (scripts/litrun.py) that installs each tool in an isolated venv and runs it. Curated catalog of 70+ vetted projects. 支持中英文（用于「文献综述工具选型」与「一键安装/运行」）。

3,109SKILL.mdUpdated Jul 28, 2026

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

development

VerifiedTrustedCommunity

Route empirical-research requests through the Auto-Empirical Research Skills catalog when this whole repository is installed as one skill in Codex, CodeBuddy, Claude Code, or another IDE. Use to choose and load the right vendored AERS skill for causal inference, econometrics, replication, data acquisition, manuscript writing, peer review and referee responses, citation checking, de-AIGC editing, or full empirical-paper workflows without reading the entire repository at once.

3,109SKILL.mdUpdated Jun 27, 2026

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

documentation

VerifiedTrustedCommunity

Use when the project collects primary data or runs a field, lab, or survey experiment, before the intervention begins — write the pre-analysis plan, size the sample from a power calculation, and register with the AEA RCT Registry. Apply after the design is chosen in aer-identification and before any outcome data are seen.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

tools

VerifiedTrustedCommunity

Guide economists to authoritative data sources with explicit, confirmed data specifications before retrieval; interfaces with Playwright MCP to navigate portals and extract real data, not articles about data.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/economist-data-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.git

# Copy into Claude Code skills folder (global)
cp -r Awesome-Agent-Skills-for-Empirical-Research/skills/43-wentorai-research-plugins/skills/analysis/wrangling/data-cog-guide ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

1,232 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT