skills/benchmarking-text-to-python-against-text-to-sql/SKILL.md
Generate correct Python/Pandas code from natural language questions over tabular data, applying the Logic Completion Framework (LCF) to resolve ambiguities that SQL handles implicitly. Use when: 'query this CSV with Python', 'convert this SQL to pandas', 'answer this question from the database using Python', 'write pandas code for this data question', 'translate SQL query to Python', 'analyze this table and answer questions about it'.
npx skillsauth add ndpvt-web/arxiv-claude-skills benchmarking-text-to-python-against-text-to-sqlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to generate reliable Python/Pandas code from natural language questions over tabular data by applying the Logic Completion Framework (LCF) from the BIRD-Python benchmark. The core insight: SQL databases silently handle NULL propagation, type coercion, case-insensitive matching, and aggregate semantics, but Python requires every one of these behaviors to be coded explicitly. When users ask data questions in natural language, the ambiguity that SQL resolves automatically becomes a source of bugs in Python. LCF systematically identifies and fills these gaps by injecting the missing domain logic before code generation.
The Paradigmatic Divergence Problem. SQL is declarative: a SELECT ... WHERE col = value implicitly skips NULLs, coerces types, and applies database-default collation. Python is procedural: df[df['col'] == value] will include NaN comparisons (returning False but not filtering predictably), won't coerce "42" to 42, and is case-sensitive by default. The BIRD-Python benchmark (Hu et al., 2026) demonstrates that most Text-to-Python failures stem not from faulty code generation but from these unspecified implicit behaviors. Models generate syntactically correct Pandas that silently produces wrong answers.
The Logic Completion Framework (LCF) addresses this by inserting an explicit logic resolution step before code generation. For every natural language question, LCF identifies which SQL-implicit behaviors are relevant (NULL handling, type casting, string collation, aggregation semantics, sort-order NULL placement) and specifies them as concrete constraints. For example, if a question asks "how many employees have a department?", LCF determines that this maps to COUNT(department) (excluding NULLs), not COUNT(*) (including all rows), and injects that specification into the prompt context. This transforms an ambiguous question into an unambiguous logical specification that the code generator can faithfully translate.
Why it works. The paper's experiments show that when these latent domain rules are made explicit, Text-to-Python achieves parity with Text-to-SQL across all tested LLMs. The gap was never about Python being harder to generate -- it was about missing context. This means any system generating Pandas from natural language can dramatically improve accuracy by systematically auditing for implicit SQL behaviors and resolving them before writing code.
Parse the data schema. Read the table(s) involved -- column names, dtypes, sample rows. Identify columns with NULLs/NaNs, mixed types (strings that look numeric), and text columns that may need case-insensitive treatment.
Decompose the natural language question into logical operations. Map the question to a sequence of relational operations: filter, join, group, aggregate, sort, limit. Write these out in plain English before touching Pandas.
Audit for implicit SQL behaviors. For each operation identified in step 2, check whether any of these six categories apply:
WHERE col = X silently excludes NULLs; Pandas does not."42" vs integer 42)? SQL auto-casts; Pandas raises errors or silently mismatches.LIKE-style patterns? SQL defaults to case-insensitive in most databases; Pandas str.contains() is case-sensitive.COUNT mean count-all-rows or count-non-null? Does AVG skip NULLs? Pandas .mean() skips NaN by default (matching SQL), but .count() also skips NaN (matching COUNT(col), not COUNT(*)).Resolve each ambiguity with an explicit decision. For each implicit behavior found, write a one-line specification: e.g., "Filter out rows where department is NaN before counting" or "Compare names case-insensitively". These are your logic completions.
Generate the Pandas code with logic completions embedded. Write the code so that each resolved behavior is visible as an explicit operation -- a .dropna(), a .str.lower(), a na=False parameter. Do not bury these in chained operations where they're invisible.
Add defensive type handling. Before comparisons or joins, explicitly cast columns to their expected types: df['col'] = pd.to_numeric(df['col'], errors='coerce') or df['col'] = df['col'].astype(str).str.strip().
Test against edge cases. Mentally (or actually) verify: What happens if the filtered column is all NaN? What if the join key has type mismatches? What if the string column has leading/trailing whitespace?
Return the result in the format the user expects. If they asked a question, return the scalar answer. If they asked for a table, return the DataFrame. Include a brief comment explaining any non-obvious logic completions applied.
Example 1: Counting with NULL-sensitive semantics
User: "How many employees have a manager assigned?"
Approach:
1. Parse schema: employees table has columns [id, name, manager_id, department].
manager_id has NaN for top-level employees.
2. Logical operation: COUNT where manager_id is present.
3. Audit: NULL handling -- "have a manager assigned" means COUNT(manager_id), not COUNT(*).
4. Logic completion: Exclude rows where manager_id is NaN.
Code:
import pandas as pd
df = pd.read_csv("employees.csv")
# Logic completion: count only non-null manager_id values (matches SQL COUNT(manager_id))
result = df['manager_id'].notna().sum()
print(result)
Compare to the SQL equivalent that implicitly does this:
SELECT COUNT(manager_id) FROM employees;
Example 2: Case-insensitive string filtering with type coercion
User: "Find all products where the category contains 'electronics' and price is above 100"
Approach:
1. Parse schema: products table has [id, name, category (str), price (mixed: some stored as str)].
2. Logical operations: string LIKE filter on category + numeric comparison on price.
3. Audit:
- String collation: "contains 'electronics'" should be case-insensitive (SQL LIKE default).
- Type coercion: price column has some string values like "150.00" that need numeric cast.
- NULL handling: skip rows where category or price is NaN.
4. Logic completions:
- Use case-insensitive matching for category.
- Coerce price to numeric, treating non-parseable values as NaN.
- Exclude NaN rows from both filters.
Code:
import pandas as pd
df = pd.read_csv("products.csv")
# Logic completion: coerce price to numeric (SQL auto-casts string-to-number)
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Logic completion: case-insensitive match (SQL LIKE is case-insensitive by default)
# Logic completion: na=False excludes NaN rows (SQL WHERE skips NULLs)
mask = (
df['category'].str.contains('electronics', case=False, na=False)
& (df['price'] > 100)
)
result = df[mask]
print(result)
Example 3: SQL-to-Pandas conversion with aggregation and sorting
User: "Convert this SQL to Pandas:
SELECT department, COUNT(*) as cnt FROM employees WHERE salary > 50000 GROUP BY department ORDER BY cnt DESC LIMIT 5;"
Approach:
1. Decompose: filter -> group -> aggregate -> sort -> limit.
2. Audit:
- NULL handling: WHERE salary > 50000 in SQL excludes NULLs implicitly. Pandas comparison
with NaN returns False, which happens to match SQL here -- but only by accident.
- COUNT(*) counts all rows in each group (including NULLs in other columns). Use .size(), not .count().
- ORDER BY DESC with NULLs: not applicable here since count is always non-null.
3. Logic completions:
- Explicitly drop NaN salary before filtering to match SQL WHERE behavior.
- Use .size() for COUNT(*) semantics, not .count() which would give per-column non-null counts.
Code:
import pandas as pd
df = pd.read_csv("employees.csv")
# Logic completion: explicitly exclude NaN salary (matches SQL WHERE NULL exclusion)
filtered = df[df['salary'].notna() & (df['salary'] > 50000)]
# Logic completion: .size() matches COUNT(*) -- counts all rows per group
result = (
filtered
.groupby('department')
.size()
.reset_index(name='cnt')
.sort_values('cnt', ascending=False)
.head(5)
)
print(result)
Do:
.isna().sum() audit prevents silent wrong answers.na=False in all .str.contains(), .str.startswith(), and .str.match() calls. The default behavior (returning NaN for NaN inputs) causes filtering bugs..size() for COUNT(*) semantics (count all rows in group) and .count() for COUNT(col) semantics (count non-null values). These are not interchangeable.Avoid:
.mean() skips NaN like SQL AVG), but relying on coincidence is fragile. Be explicit.df['col'] == value on columns with NaN without first deciding whether NaN rows should be included or excluded. The comparison returns False for NaN, which silently excludes them -- correct for SQL-style WHERE, but wrong if NaN is meaningful.| Problem | Symptom | Fix |
|---------|---------|-----|
| Mixed types in column | TypeError on comparison or silent wrong filter | Cast with pd.to_numeric(errors='coerce') or .astype(str) before comparing |
| NaN in join key | Unexpected row drops or duplications after merge | Filter NaN from join keys before merging: df[df['key'].notna()] |
| Case mismatch in string filter | Zero results when data exists | Add case=False to string methods, or .str.lower() both sides |
| COUNT(*) vs COUNT(col) confusion | Wrong aggregate counts | Use .size() for all-row count, .count() for non-null count |
| Whitespace in string columns | Join/filter misses on matching values | Apply .str.strip() before comparison |
| Pandas .groupby() drops NaN keys | Missing groups in output | Pass dropna=False to .groupby() if NULL groups matter |
Hu, H., Hou, C., Cao, B., & Li, R. (2026). Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity. arXiv:2601.15728v2. https://arxiv.org/abs/2601.15728v2
Key takeaway: The performance gap between Text-to-SQL and Text-to-Python is not about code generation quality -- it is about unresolved implicit logic. Systematically auditing for six categories of SQL-implicit behavior (NULLs, types, collation, aggregation, sort order, deduplication) and making them explicit closes the gap entirely.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".