skills/agentsm-semantic-memory-agentic/SKILL.md
Agentic Text-to-SQL with semantic memory that captures and reuses structured execution traces. Use when: 'write SQL for this database', 'query this schema', 'generate SQL from natural language', 'text to SQL with complex schema', 'help me explore this database and answer questions', 'build a reusable SQL generation pipeline'.
npx skillsauth add ndpvt-web/arxiv-claude-skills agentsm-semantic-memory-agenticInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to convert natural language questions into SQL by building and leveraging semantic memory -- structured, reusable records of prior database exploration and query execution. Rather than treating each question independently, Claude captures execution traces as interpretable structured programs (markdown with semantic headers), stores them per-database, and retrieves relevant traces to guide future reasoning. This eliminates redundant schema exploration, reduces token usage by ~25%, and produces more accurate SQL by systematically reusing proven reasoning paths. The technique is drawn from AgentSM (Biswal et al., 2026), which achieved state-of-the-art 44.8% on Spider 2.0 Lite.
Structured Semantic Memory over Raw Traces. Traditional agentic Text-to-SQL systems either discard execution history or store it as raw scratchpads that degrade retrieval quality ("lost-in-the-middle" effect). AgentSM instead parses each execution trajectory into structured sections with descriptive headers -- separating data exploration (schema inspection, column sampling), query execution (CTE construction, joins, filtering), and validation (result checking, error correction). Each section gets an LLM-generated natural language summary. This structured format achieves ~2x the accuracy of unstructured traces in retrieval experiments.
Database-Scoped Retrieval with Semantic Similarity. When a new question arrives, the system filters the memory store to traces from the same database, then ranks by semantic similarity between the new question and stored questions. This works because semantically similar questions tend to probe the same tables and require similar exploration paths. The retrieved trace's exploration phase is injected into the agent's context before it begins work, allowing it to skip redundant schema discovery and jump directly to query construction.
Synthetic Memory Bootstrapping. For databases with no prior query history, AgentSM synthesizes curated questions using the schema and any available documentation, then executes them to populate the memory store. Each database gets at least one synthetic question, with additional questions allocated proportionally to schema complexity. This ensures the memory is useful from the first real query.
Catalog the target database schema. Collect all table names, column names, data types, primary/foreign keys, and any documentation or external knowledge files. For large schemas, use vector search (e.g., embeddings with FAISS) to index schema elements for later retrieval.
Check semantic memory for relevant prior traces. Filter stored traces to those from the same database. Rank by semantic similarity between the new natural language question and stored questions. If a high-similarity match exists (same tables likely needed), retrieve its structured trace.
Inject the exploration phase from the retrieved trace. Load the "data exploration" sections of the matched trace into context. This gives the agent pre-built knowledge of relevant tables, column meanings, data distributions, and edge cases -- without re-executing exploratory queries.
Perform schema linking. Using a dedicated schema-linking pass (with vector search over schema embeddings), identify the specific tables and columns relevant to the current question. Budget this step tightly (aim for ~5 tool calls maximum) to avoid runaway exploration. Cross-reference with the retrieved trace to validate table selections.
Generate the SQL query. With the linked schema and exploration context in hand, construct the SQL. Use CTEs for complex multi-step logic. Match the target SQL dialect (Snowflake, PostgreSQL, BigQuery, etc.). The agent should integrate SQL generation directly rather than delegating to a sub-agent, to avoid context loss across handoffs.
Execute and validate the query. Run the generated SQL against the database. Check that results are non-empty, column types match expectations, and aggregation logic is sound. If execution fails, diagnose the error (syntax, missing table, type mismatch) and retry with corrections.
Classify and store the execution trace. Parse the full trajectory into phases using pattern matching: file/schema operations map to "exploration," CTE/JOIN/WHERE clauses map to "execution," result inspection maps to "validation." Assign each section a descriptive header and natural language summary.
Build composite tools from recurring patterns. If certain tool sequences repeat frequently across traces for this database (e.g., "get file extension then get DDL" always co-occur), merge them into a single composite tool. Only combine tools within the same reasoning phase to preserve semantic coherence.
Synthesize coverage questions for new databases. When encountering a database with no stored traces, generate 1+ synthetic natural language questions that exercise the schema's key tables and relationships. Execute these to populate the memory store before handling real user queries.
Return the final answer with provenance. Present the SQL query, execution results, and a brief explanation of which memory traces (if any) informed the reasoning. This transparency helps users trust and debug the output.
Example 1: Repeated queries against the same e-commerce database
User: "How many orders were placed last month by customers in California?"
(Database: e-commerce with 47 tables, Snowflake dialect)
Approach:
1. Check semantic memory -- find a prior trace for "total revenue by state
last quarter" on this same database.
2. Load the exploration phase from that trace. It already documents:
- orders table uses `created_at` (TIMESTAMP_NTZ), not `order_date`
- customer state is in `customers.billing_state`, not `customers.state`
- orders join to customers via `orders.customer_id = customers.id`
3. Skip schema exploration entirely. Go straight to query construction:
SELECT COUNT(DISTINCT o.id) AS order_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.billing_state = 'CA'
AND o.created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND o.created_at < DATE_TRUNC('month', CURRENT_DATE);
4. Execute, validate (non-zero result, reasonable magnitude), store trace.
Output:
| order_count |
|-------------|
| 3,847 |
Memory reuse saved ~8 exploration steps and ~40% of token budget.
Example 2: First query on an unfamiliar database
User: "What's the average response time for critical tickets this year?"
(Database: IT service management, PostgreSQL, no prior traces)
Approach:
1. No semantic memory exists for this database. Trigger synthetic bootstrapping:
- Read schema: find tables `tickets`, `ticket_priorities`, `ticket_events`,
`sla_policies`, `agents`, `departments` (23 tables total).
- Generate synthetic question: "List all ticket categories with their
average resolution time."
- Execute synthetic question to populate memory with exploration of
ticket lifecycle tables.
2. Now handle the real question. Retrieve the synthetic trace's exploration:
- `tickets.priority_id` links to `ticket_priorities.id`
- `ticket_priorities.name` contains 'Critical', 'High', 'Medium', 'Low'
- Response time = `tickets.first_response_at - tickets.created_at`
- Timestamps are `TIMESTAMPTZ`
3. Schema-link: tickets, ticket_priorities. Generate SQL:
SELECT AVG(EXTRACT(EPOCH FROM (t.first_response_at - t.created_at)) / 3600)
AS avg_response_hours
FROM tickets t
JOIN ticket_priorities tp ON t.priority_id = tp.id
WHERE tp.name = 'Critical'
AND t.created_at >= '2026-01-01';
4. Execute, validate, store full trace for future reuse.
Output:
| avg_response_hours |
|--------------------|
| 2.34 |
Example 3: Multi-step analytical query with trace reuse
User: "Which product categories have declining month-over-month revenue
for 3+ consecutive months?"
(Database: retail analytics, BigQuery)
Approach:
1. Retrieve prior trace for "monthly revenue by category" on this database.
Exploration phase reveals:
- Revenue = `order_items.quantity * order_items.unit_price`
- Category via `products.category_id -> categories.name`
- Use `orders.order_date` for time bucketing
2. Construct multi-CTE query (the execution pattern from memory guides
CTE structure):
WITH monthly_rev AS (
SELECT c.name AS category,
DATE_TRUNC(o.order_date, MONTH) AS month,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.id
JOIN products p ON oi.product_id = p.id
JOIN categories c ON p.category_id = c.id
GROUP BY 1, 2
),
with_lag AS (
SELECT *, LAG(revenue) OVER (PARTITION BY category ORDER BY month) AS prev_rev
FROM monthly_rev
),
declining AS (
SELECT category, month,
CASE WHEN revenue < prev_rev THEN 1 ELSE 0 END AS is_decline
FROM with_lag
),
streaks AS (
SELECT category, month, is_decline,
SUM(CASE WHEN is_decline = 0 THEN 1 ELSE 0 END)
OVER (PARTITION BY category ORDER BY month) AS streak_group
FROM declining
)
SELECT DISTINCT category
FROM (
SELECT category, streak_group, COUNT(*) AS consecutive_declines
FROM streaks WHERE is_decline = 1
GROUP BY 1, 2
HAVING COUNT(*) >= 3
);
3. Execute, validate against known business context, store trace with
"multi-month trend analysis" header for future similar questions.
Output:
| category |
|--------------|
| Electronics |
| Home & Garden|
DATE_TRUNC argument order differs between PostgreSQL and BigQuery; Snowflake requires ILIKE for case-insensitive matching; BigQuery uses backtick-quoted table names.Biswal, A., Lei, C., Qin, X., Li, A., & Narayanaswamy, B. (2026). AgentSM: Semantic Memory for Agentic Text-to-SQL. arXiv:2601.15709. https://arxiv.org/abs/2601.15709
Key takeaway: Section 3 (Semantic Memory Construction) and Section 4 (Trajectory Reading) detail how to structure, store, and retrieve execution traces. Table 1 shows the 2x accuracy gain of structured vs. unstructured trace formats. Algorithm 1 provides the synthetic question generation procedure for cold-start databases.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".