skills/birdturk-adaptation-bird-text-to-sql/SKILL.md
Adapt Text-to-SQL systems and benchmarks for non-English, morphologically rich languages using controlled translation pipelines and agentic multi-stage reasoning. Triggers: 'translate SQL benchmark to Turkish', 'cross-lingual text-to-SQL', 'adapt BIRD dataset to another language', 'build multilingual SQL generation', 'text-to-SQL for low-resource language', 'non-English database querying'
npx skillsauth add ndpvt-web/arxiv-claude-skills birdturk-adaptation-bird-text-to-sqlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to adapt English Text-to-SQL benchmarks and systems to morphologically rich, low-resource languages using the BIRDTurk methodology. The core technique is a controlled translation pipeline that adapts natural language questions, schema identifiers, database values, and hints to a target language while strictly preserving SQL logical structure and execution semantics. It also applies agentic multi-stage reasoning, which the paper demonstrates is the most robust approach for cross-lingual SQL generation, outperforming both direct prompting and standard fine-tuning.
BIRDTurk introduces a controlled translation pipeline with a critical constraint: SQL queries and database execution semantics must remain identical before and after translation. This means you translate the natural language question, the schema identifiers (table names, column names), database cell values, and evidence hints, but the SQL query structure stays unchanged. The mapping between original and translated identifiers is maintained as a bijective function so every translated column name maps back to exactly one original column. Translation quality is validated statistically using Central Limit Theorem sampling (n >= 30 per stratum) to achieve 95% confidence, reaching 98.15% accuracy on human evaluation.
The paper evaluates three approaches for generating SQL from translated inputs: (1) inference-based prompting (zero-shot and few-shot with schema context), (2) agentic multi-stage reasoning (decomposing the problem into schema understanding, intermediate SQL generation, execution-based validation, and iterative refinement), and (3) supervised fine-tuning on translated training data. The key finding is that agentic multi-stage reasoning demonstrates the strongest cross-lingual robustness. By breaking the problem into discrete stages — schema analysis, query planning, SQL generation, execution feedback — each stage can compensate for linguistic ambiguity introduced by morphological complexity. Direct prompting suffers the largest performance drop because the model must handle linguistic parsing and SQL generation in a single pass.
Supervised fine-tuning on standard multilingual baselines (e.g., mBERT-family models) struggles with Turkish due to underrepresentation in pretraining data. However, modern instruction-tuned models (GPT-4, Qwen2) scale effectively with fine-tuning, showing that the base model's multilingual capacity is a prerequisite. The consistent finding across all approaches: Turkish introduces 10-25% execution accuracy degradation compared to English, driven by agglutinative morphology, free word order, and vowel harmony creating token distributions unseen during pretraining.
Inventory the source benchmark structure. Catalog all translatable components: natural language questions, schema identifiers (table names, column names), column descriptions, database cell values containing natural language text, and evidence/hint fields. Do NOT translate SQL keywords, operators, or structural tokens.
Build a schema translation dictionary. For each database in the benchmark, create a bijective mapping from English identifiers to target-language identifiers. Ensure every translated name is a valid SQL identifier (no spaces unless quoted, no reserved words). For agglutinative languages, prefer root forms over inflected forms for column names.
Translate natural language questions with schema grounding. Translate each question while ensuring that references to schema elements use the exact translated identifier from your dictionary. Preserve the question's logical intent — the same SQL query must answer both the English and translated question.
Translate database cell values selectively. Only translate cell values that contain natural language text (e.g., city names, descriptions). Leave numeric values, codes, and IDs untouched. Update any WHERE clause string literals in the SQL to match the translated cell values.
Translate evidence hints and descriptions. These provide domain context for SQL generation. Translate them while preserving technical accuracy — a hint about date formats or business rules must remain semantically identical.
Validate translation quality statistically. Sample at least 30 translated question-SQL pairs per domain stratum. Have human evaluators check: (a) the translated question is natural in the target language, (b) the SQL query still correctly answers the translated question against the translated database, (c) execution results are identical.
Run execution-based validation. Execute every SQL query against the translated database and compare results to the original. Any execution mismatch indicates a translation error in cell values or schema identifiers.
Implement agentic multi-stage reasoning. Structure the pipeline as: Stage 1 (Schema Analysis) — parse the translated schema, identify relevant tables/columns, resolve morphological variants; Stage 2 (Query Planning) — decompose the natural language question into sub-goals mapped to SQL operations; Stage 3 (SQL Generation) — produce candidate SQL; Stage 4 (Execution Validation) — run the SQL, check for errors or empty results, refine if needed.
Design prompts with explicit schema context. Include the full translated schema (table names, column names, types, sample values) in the prompt. For morphologically rich languages, also include the original English identifier as a disambiguation aid in parentheses.
Evaluate with execution accuracy (EX). The primary metric is whether the generated SQL produces the same result set as the gold SQL when executed against the database. Do not rely on exact-match string comparison of SQL, as valid queries can have many equivalent forms.
Example 1: Adapting a database schema from English to Turkish
User: "I have a BIRD-format database with tables employees(id, name, department, salary) and departments(id, dept_name, location). Adapt this schema for Turkish Text-to-SQL evaluation."
Approach:
Output:
-- Schema Translation Dictionary
-- employees -> calisanlar
-- id -> id (universal, no translation needed)
-- name -> isim
-- department -> departman
-- salary -> maas
-- departments -> departmanlar
-- dept_name -> departman_adi
-- location -> konum
-- Original English question:
-- "What is the average salary of employees in the Sales department?"
-- Translated Turkish question:
-- "Satis departmanindaki calisanlarin ortalama maasi nedir?"
-- SQL (unchanged structure, translated identifiers):
SELECT AVG(maas) FROM calisanlar
WHERE departman = 'Satis';
-- Note: 'Sales' cell value translated to 'Satis' in the database
-- and in the WHERE clause literal
Example 2: Agentic multi-stage pipeline for Turkish Text-to-SQL
User: "Build a pipeline that takes a Turkish natural language question and generates SQL against a translated schema."
Approach:
Output:
import openai
def text_to_sql_agentic(question_tr: str, schema: dict, db_connection) -> str:
"""Multi-stage agentic Text-to-SQL for Turkish."""
# Stage 1: Schema Analysis
schema_prompt = f"""Given this database schema:
{format_schema(schema)}
Turkish question: {question_tr}
Identify which tables and columns are relevant to answering
this question. For each column, note its data type and any
value constraints. Turkish morphology note: the question may
use agglutinative suffixes (-daki, -nin, -larin) that modify
base words matching column names."""
relevant_schema = llm_call(schema_prompt)
# Stage 2: Query Planning
plan_prompt = f"""Relevant schema elements:
{relevant_schema}
Turkish question: {question_tr}
Decompose this question into SQL operations:
- What aggregation is needed? (COUNT, AVG, SUM, etc.)
- What filtering conditions apply?
- Are JOINs required? Between which tables?
- What is the expected result shape?"""
query_plan = llm_call(plan_prompt)
# Stage 3: SQL Generation
sql_prompt = f"""Based on this query plan:
{query_plan}
Full schema: {format_schema(schema)}
Generate the SQL query. Use exact column and table names
from the schema. Do not invent column names."""
candidate_sql = llm_call(sql_prompt)
# Stage 4: Execution Validation
try:
result = db_connection.execute(candidate_sql)
if result.rowcount == 0:
# Refine: empty result may indicate wrong filter
refined_sql = refine_with_feedback(
candidate_sql, "Query returned 0 rows", schema, question_tr
)
return refined_sql
return candidate_sql
except Exception as e:
# Refine: execution error indicates structural problem
refined_sql = refine_with_feedback(
candidate_sql, str(e), schema, question_tr
)
return refined_sql
Example 3: Validating translation quality with statistical sampling
User: "I translated 12,000 question-SQL pairs to Korean. How do I validate the translation quality?"
Approach:
Output:
import math
import random
def compute_sample_size(population_size: int, confidence_z: float = 1.96,
margin_error: float = 0.05, estimated_p: float = 0.5) -> int:
"""Compute required sample size for translation validation.
BIRDTurk uses 95% confidence (z=1.96), 5% margin of error."""
n0 = (confidence_z ** 2 * estimated_p * (1 - estimated_p)) / (margin_error ** 2)
# Finite population correction
n = n0 / (1 + (n0 - 1) / population_size)
return math.ceil(n)
# For 12,000 pairs: sample size ~= 373
sample_size = compute_sample_size(12000)
# Stratify by domain to ensure coverage
def stratified_sample(pairs_by_domain: dict, total_sample: int) -> list:
"""Sample proportionally from each domain."""
total = sum(len(v) for v in pairs_by_domain.values())
sample = []
for domain, pairs in pairs_by_domain.items():
k = max(1, round(total_sample * len(pairs) / total))
sample.extend(random.sample(pairs, min(k, len(pairs))))
return sample
# Evaluation criteria per sample:
# 1. Is the translated question natural and grammatical? (Y/N)
# 2. Does the SQL still answer the translated question? (Y/N)
# 3. Does execution produce identical results? (Y/N)
# Target: >= 95% accuracy across all three criteria
Do:
maas (salary)) to help the model bridge linguistic gaps during early development.Avoid:
SELECT, WHERE, JOIN remain in English regardless of target language.departman_adi vs proje_adi instead of both becoming ad).WHERE filter uses the English cell value against a translated database. The refinement loop should detect and correct the language mismatch.Paper: BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish (EACL 2026 SIGTURK) Dataset: metunlp/birdturk on HuggingFace Key takeaway: Agentic multi-stage reasoning (schema analysis → query planning → SQL generation → execution validation) provides the strongest cross-lingual robustness for Text-to-SQL, outperforming both direct prompting and fine-tuning on morphologically rich, low-resource languages.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".