skills/datacross-unified-benchmark-agent/SKILL.md
Cross-modal data analysis agent that unifies structured sources (SQL, CSV, JSON) with unstructured visual documents (scanned PDFs, invoice images, chart screenshots) using divide-and-conquer sub-agents and iterative code generation. Triggers: 'analyze data from these mixed sources', 'combine CSV with scanned PDF table', 'extract table from image and join with database', 'cross-modal data analysis', 'zombie data activation', 'analyze heterogeneous data files together'
npx skillsauth add ndpvt-web/arxiv-claude-skills datacross-unified-benchmark-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to perform unified analysis across heterogeneous data modalities — combining directly queryable structured sources (SQL databases, CSV files, JSON) with high-value information locked inside unstructured visual documents (scanned reports, invoice images, chart screenshots). It implements the DataCrossAgent divide-and-conquer framework: assign specialized sub-agents per data source, profile each source independently via iterative code generation (reReAct), score sources by priority, then cross-pollinate findings across modalities to produce factually grounded, insight-driven analysis.
Divide-and-Conquer with Specialized Sub-Agents. Rather than feeding all heterogeneous data into a single prompt, DataCrossAgent assigns each data source its own specialized "sub-agent" context. Each sub-agent performs deep exploration of its assigned source — profiling schemas, generating heuristic questions, and writing executable Python code to extract insights. This isolation prevents errors in one modality (e.g., OCR noise from a scanned PDF) from contaminating analysis of clean structured data.
reReAct (Recursive Reasoning-Act). Standard ReAct chains a single Thought-Action-Observation loop. reReAct adds a dual-loop structure: an outer reasoning layer decomposes the full task into a tree of sub-problems, while an inner loop handles iterative code generation, execution, error detection, and self-repair for each sub-problem. When code fails (wrong column name, type mismatch, empty DataFrame), the inner loop retries with corrected code rather than abandoning the entire chain. Ablation studies show this mechanism alone accounts for ~20% of factuality improvement.
Priority Scoring and Cross-Pollination. After independent exploration, sources are ranked using a hybrid priority score: S_priority = 0.4 * S_obj + 0.3 * S_sem + 0.3 * S_LLM, where S_obj measures data richness (completeness, column count, temporal indicators), S_sem measures keyword overlap between the analysis goal and source schema, and S_LLM is a high-level relevance judgment. High-scoring sources become "Primary Data" (the analytical pivot), while others become "Auxiliary Data." A Cross-Analysis Agent then generates executable code to physically merge, correlate, and statistically test hypotheses across sources.
Inventory all data sources. List every file the user provides. Classify each as structured (CSV, SQL, JSON, Excel) or unstructured-visual (scanned PDF, image of table, chart screenshot, invoice photo). Record file paths, formats, and apparent domains.
Extract tables from visual documents. For each image or scanned PDF containing tabular data, use a classify-then-extract strategy:
Profile each structured source independently. For each CSV/SQL/JSON file, generate and execute Python code to:
utf-8 -> latin1 -> gbk)Score sources by priority. Compute a hybrid priority score for each source:
(1 - missing_rate) * column_richness * row_richness * temporal_bonusDeep-explore primary sources via reReAct. For each primary source, iteratively:
Generate a Cross-Source Analysis Checklist. Based on findings from each source, formulate specific hypotheses that require cross-source verification. Example: "Does the revenue decline in Q3 (from CSV) correlate with the shipping delays mentioned in the scanned logistics report (from image)?"
Execute cross-source merges and tests. Write Python code to:
Synthesize findings into a structured report. Combine results into a coherent analysis covering:
Validate factuality. Re-check all numerical claims against source data. Ensure no hallucinated statistics by tracing each number back to executed code output.
Present results with reproducible code. Provide the user with both the narrative analysis and the complete Python code used, so they can re-run, modify, or extend the analysis.
Example 1: Financial Report + Transaction CSV
User: I have Q3_transactions.csv with daily sales data and this scanned_annual_report.png
which shows a summary table of quarterly performance. Can you tell me whether the
transaction-level data is consistent with the annual report figures?
Approach:
1. Extract the quarterly summary table from scanned_annual_report.png into a DataFrame
with columns [Quarter, Revenue, Expenses, Net_Profit].
2. Load Q3_transactions.csv, handle encoding, inspect schema.
3. Aggregate Q3 transactions: total revenue = sum of 'amount' where type='sale',
total expenses = sum of 'amount' where type='expense'.
4. Compare aggregated Q3 totals against the Q3 row extracted from the image.
5. Report discrepancies with exact figures and percentage differences.
Output:
## Consistency Check: Q3 Transactions vs. Annual Report
| Metric | CSV Aggregated | Report (extracted) | Delta | Status |
|------------|---------------:|-------------------:|--------:|--------|
| Revenue | $2,341,500 | $2,350,000 | +0.36% | Match |
| Expenses | $1,890,200 | $1,875,000 | -0.81% | Review |
| Net Profit | $451,300 | $475,000 | +5.25% | Flag |
Revenue aligns within rounding tolerance. Expense discrepancy of $15,200
suggests the scanned report may exclude certain line items. Net profit
divergence follows from the expense gap. Recommend auditing expense
categories not present in the CSV (e.g., depreciation, amortization).
Example 2: Healthcare — Patient Records + Lab Report Images
User: I have patient_visits.csv and three photos of lab result printouts
(lab_001.jpg, lab_002.jpg, lab_003.jpg). Which patients have abnormal lab
values and how do those correlate with their visit frequency?
Approach:
1. Extract structured tables from each lab image:
- Identify patient_id, test_name, result_value, reference_range, flag
2. Load patient_visits.csv, profile schema (patient_id, visit_date, diagnosis).
3. Score sources: patient_visits.csv is Primary (richer, more rows),
lab images are Auxiliary (fewer records but critical clinical values).
4. Merge on patient_id after normalizing ID formats (strip leading zeros).
5. Flag abnormal results where result_value falls outside reference_range.
6. Compute visit frequency per patient, correlate with abnormal flag count.
Output:
## Cross-Modal Patient Analysis
Extracted 47 lab results from 3 images (42 parsed successfully, 5 flagged
for low-confidence extraction).
| Patient | Abnormal Labs | Visit Freq (6mo) | Correlation |
|---------|--------------|-------------------|-------------|
| P-1042 | 3 (glucose, HbA1c, triglycerides) | 8 visits | High-freq + metabolic flags |
| P-0887 | 1 (WBC elevated) | 12 visits | Highest freq, possible chronic condition |
| P-1205 | 2 (ALT, AST) | 2 visits | Low freq despite liver markers — follow up |
[Full code provided below for reproducibility]
Example 3: Multi-Source Business Intelligence
User: Analyze these together: sales.db (SQLite), customer_feedback.json,
and this screenshot of a competitor pricing table (competitor_prices.png).
What pricing opportunities exist?
Approach:
1. Extract competitor pricing table from image → DataFrame with
[product, competitor_price, features].
2. Query sales.db: SELECT product, avg(price), sum(quantity), sum(revenue)
GROUP BY product.
3. Parse customer_feedback.json: extract sentiment scores and
price-related complaint frequency per product.
4. Priority scoring: sales.db is Primary (richest, most granular),
competitor image is Auxiliary, feedback is Auxiliary.
5. Cross-pollinate: join all three on product name (fuzzy match where needed).
6. Identify products where our_price > competitor_price AND sentiment < 0.5.
7. Identify products where our_price < competitor_price AND high volume
(potential margin opportunity).
Output:
## Pricing Opportunity Analysis
### Overpriced (risk of churn):
- Widget Pro: Ours $49.99 vs Competitor $39.99 (+25%), sentiment 0.31,
18% of complaints mention "too expensive"
→ Recommend price reduction to $42.99
### Underpriced (margin opportunity):
- Basic Plan: Ours $9.99 vs Competitor $14.99 (-33%), sentiment 0.82,
highest volume product, zero price complaints
→ Recommend test increase to $11.99 (projected +$48K/quarter)
utf-8 -> latin1 -> gbk). Real-world files frequently have encoding issues.| Error | Cause | Recovery |
|-------|-------|----------|
| KeyError on column access | Column name mismatch between assumed schema and actual data | Re-inspect actual columns with df.columns.tolist(), use fuzzy matching or manual mapping |
| Empty DataFrame after merge | Join keys don't align (format mismatch) | Normalize join keys: strip whitespace, standardize case, convert date formats before retry |
| Image extraction returns garbled text | Low-resolution scan, complex table layout, or watermarks | Fall back to describing the image semantically; ask user to provide higher-quality scan if critical |
| Encoding error on CSV load | Non-UTF-8 file encoding | Try encoding fallback chain; if all fail, read as binary and detect with chardet |
| Type mismatch in aggregation | Numeric columns stored as strings (e.g., "$1,234") | Strip currency symbols and commas, cast to float before aggregation |
| Code execution timeout | Large dataset or expensive join | Sample data first to validate logic, then run on full dataset; use chunked processing if needed |
Paper: DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis — Qi, Liu, Zhang (2026). Look for: the reReAct dual-loop mechanism (Section 4), the hybrid priority scoring formula (Section 4.2), and the cross-pollination checklist pattern (Section 4.3) for the core implementation details that drive the 29.7% factuality improvement over single-agent baselines.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".