skills/can-clean-up-mess/SKILL.md
LLM-driven data preparation pipeline for cleaning, integrating, and enriching messy datasets. Use when the user says 'clean this data', 'fix this CSV', 'match these schemas', 'deduplicate these records', 'impute missing values', or 'annotate this table'.
npx skillsauth add ndpvt-web/arxiv-claude-skills can-clean-up-messInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to act as an agentic data preparation system that applies prompt-driven techniques from the survey "Can LLMs Clean Up Your Mess?" (arXiv:2601.17058). Instead of writing bespoke rule-based cleaning scripts, Claude uses structured serialization, few-shot prompting, chain-of-thought reasoning, and iterative detect-verify-repair loops to standardize formats, detect and correct errors, impute missing values, match entities across datasets, align schemas, and annotate columns with semantic types. The approach replaces fragile regex/rule pipelines with context-aware LLM reasoning grounded in the actual data.
The paper identifies a paradigm shift from deterministic rule-based data preparation to prompt-driven, context-aware, agentic workflows. The core insight is that LLMs can understand data semantics — they recognize that "NYC", "New York City", and "new york" are the same entity, that a zip code constrains a city name, and that a column of dates in mixed formats should converge to ISO 8601 — without needing hand-coded rules for each case.
Three interlocking strategies make this practical. First, structured serialization: tabular data is converted into natural-language or semi-structured representations (row-as-sentence, column-sample lists, or JSON records) that fit within prompt context. Selective context — choosing only the columns most correlated with the target via Pearson, Cramer's V, or eta correlation — keeps prompts focused and cost-efficient. Second, iterative detect-verify-repair loops: rather than making a single pass, the LLM cycles through detection (flag suspicious values), self-verification (confirm the flagged values are truly errors given the full row context), and repair (generate corrected values), reducing hallucinated corrections. Third, code synthesis over direct generation: for repeatable transformations, the LLM generates executable cleaning functions (Python/SQL) that can be validated, tested, and reused, rather than producing corrected values one-by-one — this is both cheaper and auditable.
For integration tasks like entity matching, the paper highlights batch-clustering prompts (process groups of candidate pairs together so the LLM can reason about inter-pair relationships) and retrieval-augmented matching (retrieve similar resolved pairs as few-shot examples). For enrichment, self-reflection annotation — where the LLM annotates, then reviews its own annotations for consistency — improves label quality without human review.
Profile the data: Load the dataset and generate a structural profile — column names, inferred types, cardinality, null rates, sample values (10-20 per column), and basic statistics. This is the foundation for every subsequent step.
Serialize strategically: Convert the relevant portion of data into a prompt-friendly format. For row-level tasks (error detection, imputation), serialize individual rows as key-value pairs with column context. For column-level tasks (standardization, annotation), serialize sampled values from the target column grouped as a list. Keep total token count under control by sampling representative values via clustering or stratified sampling.
Detect issues with chain-of-thought prompting: Ask the LLM to examine serialized data and identify problems — inconsistent formats, likely errors, missing patterns — using explicit step-by-step reasoning. Prompt the LLM to state what the expected format/range is before flagging deviations. This reduces false positives compared to zero-shot detection.
Verify detections against row context: For each flagged issue, re-prompt with the full row (and optionally neighboring rows or correlated columns) to confirm the detection. Use a "self-consistency" check: if the LLM confirms the error in 2 out of 3 independent verification prompts, proceed to repair.
Generate repair as executable code when possible: Instead of asking the LLM to output corrected values directly, ask it to generate a Python function or SQL expression that performs the transformation. For example: def standardize_date(val): ... or UPDATE t SET city = 'New York' WHERE zip = '10001' AND city = 'NYC'. Validate the generated code against the sample data before applying it to the full dataset.
For imputation, select context columns by correlation: Compute pairwise correlation between the column with missing values and all other columns. Include only the top-k most correlated columns in the imputation prompt, plus the target column. Serialize the k nearest complete rows (by similarity on the correlated columns) as few-shot context.
For entity matching, use batch-clustering prompts: Serialize candidate pairs in batches of 5-10 per prompt. Include a brief instruction specifying matching criteria and 2-3 labeled examples (one match, one non-match, one ambiguous). Ask the LLM to classify each pair and explain its reasoning, then cluster matched entities.
For schema matching, provide both schemas with sample data: Serialize both schemas side-by-side with 3-5 sample values per column. Prompt the LLM to propose column mappings with confidence scores and brief justifications. Cross-validate by asking it to generate a test query that joins the two tables on the proposed mapping.
Validate all outputs against the source data: After any transformation, run automated checks — row count preservation, type consistency, value range constraints, referential integrity. Surface any violations back to the LLM for a second repair pass.
Emit the final pipeline as reproducible code: Consolidate all generated cleaning functions, mappings, and transformations into a single executable script (Python or SQL) with comments explaining each step. This makes the pipeline auditable and re-runnable on future data batches.
Example 1: Standardize inconsistent date formats in a CSV
User: "This CSV has dates in column created_at with mixed formats — some are MM/DD/YYYY, some are YYYY-MM-DD, some are 'Jan 5, 2024'. Standardize them all to ISO 8601."
Approach:
created_at to identify the format variantsstandardize_date(val: str) -> str that converts any of these formats to 'YYYY-MM-DD'. Handle edge cases like two-digit years."Output:
import pandas as pd
from dateutil import parser
def standardize_date(val):
if pd.isna(val) or str(val).strip() == "":
return None
try:
return parser.parse(str(val)).strftime("%Y-%m-%d")
except (ValueError, TypeError):
return None # flag for manual review
df["created_at"] = df["created_at"].apply(standardize_date)
failed = df[df["created_at"].isna()]
print(f"Standardized {len(df) - len(failed)} rows, {len(failed)} need review")
Example 2: Impute missing city names using zip codes and state
User: "My customer table has 15% missing values in the city column but zip_code and state are mostly complete. Can you fill in the missing cities?"
Approach:
zip_code has near-perfect correlation with city — use it as primary contextzip_code, take the most frequent city for eachOutput:
# Step 1: Deterministic lookup from existing data
zip_city_map = (
df.dropna(subset=["city"])
.groupby("zip_code")["city"]
.agg(lambda x: x.mode().iloc[0])
.to_dict()
)
# Step 2: Fill from lookup
mask = df["city"].isna()
df.loc[mask, "city"] = df.loc[mask, "zip_code"].map(zip_city_map)
# Step 3: LLM fallback for remaining gaps
still_missing = df[df["city"].isna()]
print(f"Filled {mask.sum() - len(still_missing)} via lookup, "
f"{len(still_missing)} need LLM imputation")
# ... LLM imputation loop for remaining rows ...
Example 3: Entity matching across two product catalogs
User: "I have two product CSVs from different suppliers. I need to find which rows in catalog A refer to the same product as rows in catalog B. They use different naming conventions."
Approach:
Pair 1: A={"name": "Sony WH-1000XM5 Wireless NC Headphones", "brand": "Sony", "price": 348}
B={"name": "SONY WH1000XM5 Noise Cancelling", "brand": "SONY", "price": 349.99}
→ Match? (explain reasoning)
Output:
| catalog_a_id | catalog_b_id | confidence | reasoning |
|-------------|-------------|------------|----------------------------------------|
| A-1042 | B-887 | 0.95 | Same model number WH-1000XM5, ~$1 diff |
| A-2210 | B-1103 | 0.82 | Same brand+category, similar specs |
| A-3301 | B-2045 | 0.60 | Similar name but different capacity |
created_date could map to date_added or registration_date). Generate test join queries for each candidate mapping and compare result quality.Paper: "Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs" — Zhou et al., arXiv:2601.17058 (2026). Read for: the full task taxonomy (cleaning/integration/enrichment), comparison of prompt-based vs. fine-tuned vs. agentic approaches per task, evaluation datasets and metrics, and the cost-quality tradeoff analysis. Repository: https://github.com/weAIDB/awesome-data-llm
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".