skills/decoupling-skeleton-flesh-multimodal/SKILL.md
Disentangled structure-content reasoning for table images and structured data. Separates table skeleton (layout/structure) from flesh (cell content) to answer questions accurately. Use when: 'analyze this table image', 'answer questions about this spreadsheet screenshot', 'extract data from this table photo', 'reason over this financial table', 'compare values in this table image', 'what does this table show'.
npx skillsauth add ndpvt-web/arxiv-claude-skills decoupling-skeleton-flesh-multimodalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to reason over table images and structured tabular data by applying the DiSCo + Table-GLS framework from Zhu et al. (2026). The core idea: instead of trying to understand a table all at once, explicitly decouple structure (skeleton) from content (flesh) first, then use a global-to-local reasoning pipeline that narrows focus from the full table to a minimal evidence sub-table before answering. This dramatically reduces errors from misaligned rows/columns, merged cells, and complex hierarchical headers.
Large vision-language models tend to entangle structure and content when reading tables. They might correctly OCR individual cell values but misattribute which row or column a value belongs to--especially in tables with merged headers, irregular spacing, or dense numeric data. This leads to confidently wrong answers.
The DiSCo framework decouples table understanding into two alignment phases:
Structural Abstraction (Skeleton): Recognize the table's layout independent of content. Identify row/column boundaries, header hierarchy, span markers, and cell grid coordinates. Think of this as producing an anonymized template: Row 1, Col 1: [CELL] | Row 1, Col 2: [CELL] | ... with headers and merge spans preserved but values blanked out. This forces attention to layout geometry.
Semantic Grounding (Flesh): Bind actual cell content to structural coordinates at two granularities:
This separation ensures the model builds a reliable structural map before filling in content, preventing the common failure of reading correct values from incorrect cells.
Once structure and content are disentangled, Table-GLS performs reasoning in three stages:
Receive the table input. Accept a table image (PNG/JPG/PDF screenshot) or a structured table (HTML/Markdown/CSV). If the input is an image, read it visually; if structured text, parse it directly.
Extract the skeleton (structural abstraction). Map out the table's geometry: number of rows and columns, header rows, header columns, any merged/spanning cells, and hierarchical header levels. Produce a coordinate grid like (row_index, col_index) -> cell_boundary. Do NOT read cell values yet--focus only on layout.
Annotate semantic roles globally. For each column header and row label, summarize its semantic role: what kind of data it contains, units, time periods, categories. Example: "Col 2 = 'Revenue (USD, millions)', Col 3 = 'YoY Growth (%)'."
Ground content locally. For each cell in the grid, bind the actual value to its structural coordinate. Produce entries like: Row 3, Col 2: 14.7. Verify a sample of bindings by cross-referencing headers--e.g., confirm that the value at (Row 3, Col 2) makes sense given "Row 3 = Q3 2025" and "Col 2 = Revenue."
Parse the user's question. Identify what the question asks for: specific lookup, comparison, aggregation, trend analysis, or multi-step computation. Determine which columns and rows are relevant.
Global structure exploration. Based on the question, select the target column headers and row label conditions that scope the relevant region. Output this as a structured plan: { "reasoning": "...", "target_columns": [...], "target_rows": [...] }.
Self-refine and extract sub-table. Critically review the plan: are the selected columns/rows sufficient? Are any missing? Revise if needed, then extract a minimal sub-table containing only the relevant cells in Row R, Col C: value format.
Evidence-grounded reasoning. Using ONLY the extracted sub-table, perform step-by-step reasoning to answer the question. Every arithmetic operation or comparison must cite specific cell coordinates. Do not hallucinate values outside the sub-table.
Format and return the answer. Present the answer clearly, followed by the supporting evidence (the sub-table and reasoning chain). If the question requires a specific format (number, percentage, name), comply exactly.
Handle uncertainty. If the table image is ambiguous (blurry cells, unclear boundaries), flag which specific cells have uncertain readings and state confidence levels rather than guessing silently.
Example 1: Financial Table Question Answering
User: [Provides a screenshot of a quarterly earnings table] "Which quarter had the highest net income, and by how much did it exceed the next highest?"
Approach:
Row 1, Col 5 (Q1, Net Income): 12.3
Row 2, Col 5 (Q2, Net Income): 15.1
Row 3, Col 5 (Q3, Net Income): 14.7
Row 4, Col 5 (Q4, Net Income): 18.2
Output:
Q4 2025 had the highest net income at $18.2M, exceeding Q2 ($15.1M)
by $3.1M.
Evidence sub-table:
Q1: $12.3M | Q2: $15.1M | Q3: $14.7M | Q4: $18.2M
Example 2: Complex Table with Merged Headers
User: [Provides a table image with two-level headers: "2024" spanning cols 2-3, "2025" spanning cols 4-5, with sub-headers "H1" and "H2" under each year] "What was the total sales across all periods for Product B?"
Approach:
Row 4, Col 2 (Product B, 2024/H1): 340
Row 4, Col 3 (Product B, 2024/H2): 410
Row 4, Col 4 (Product B, 2025/H1): 385
Row 4, Col 5 (Product B, 2025/H2): 460
Output:
Total sales for Product B across all periods: 1,595
Breakdown:
2024 H1: 340 | 2024 H2: 410 | 2025 H1: 385 | 2025 H2: 460
Sum: 340 + 410 + 385 + 460 = 1,595
Example 3: Table Image to Structured Data Conversion
User: [Provides a photo of a handwritten lab results table] "Convert this to a CSV I can use in Python."
Approach:
Output:
Sample ID,pH,Temp (C),Result
A-001,7.2,23.5,Positive
A-002,6.8,24.1,Negative
A-003,7.4,22.9,Positive
A-004,6.9,23.8,Negative
A-005,7.1,24.3,Positive
Note: Cell (Row 3, Col 2) was difficult to read -- interpreted as
"7.4" but could be "7.1". Please verify against the original.
| Problem | Solution | |---|---| | Blurry or low-resolution table image | Extract skeleton first (grid lines are often still visible). For unclear cell values, provide best-guess with explicit uncertainty markers: "Row 2, Col 3: ~47 (uncertain, possibly 41)" | | Merged cells break grid assumptions | During skeleton extraction, explicitly record span information: "Cell at (1,2) spans columns 2-4". Adjust coordinate system to use composite headers. | | Question requires data outside the visible table | State clearly which cells are needed but not present. Do not fabricate values. Suggest what additional information the user should provide. | | Inconsistent/contradictory cell values | Flag the inconsistency with coordinates: "Row 3 total (Col 6) shows 150, but summing Cols 2-5 gives 148. Proceeding with individual cell values." | | Table is rotated or skewed in image | Note the orientation issue, attempt to read with adjusted orientation, and flag reduced confidence in structural mapping. |
Paper: Zhu, Y., Bai, X., Chen, K., Xiang, Y., & Pan, Y. (2026). Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance. arXiv:2602.03491v1. https://arxiv.org/abs/2602.03491v1
What to look for: Section 3 details the DiSCo alignment framework (structural abstraction via anonymized templates, global/local content grounding). Section 4 describes the Table-GLS three-stage reasoning pipeline (GSE, SSE, EGR). Appendix C contains the exact prompt templates used at each stage.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".