skills/fmbench-adaptive-output-formatting/SKILL.md
Adaptive Markdown output formatting that balances semantic fidelity with structural correctness. Applies the FMBench methodology to detect and fix broken lists, malformed tables, inconsistent headings, and invalid code blocks in LLM-generated Markdown. Use this skill when asked to: - "Format this output as clean Markdown" - "Fix the formatting in this document" - "Generate a well-structured report with tables and code blocks" - "Convert this content into properly nested Markdown" - "Clean up this Markdown — the lists and headings are broken" - "Reformat this response so it renders correctly"
npx skillsauth add ndpvt-web/arxiv-claude-skills fmbench-adaptive-output-formattingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to produce and repair Markdown output that is both semantically faithful to the intended content and structurally correct according to CommonMark specification. Drawing from the FMBench framework (Wang, Zhou & Ding, 2026), the core insight is that formatting errors in LLM output — broken nested lists, malformed tables, inconsistent heading hierarchies, unbalanced code fences — are not cosmetic annoyances but structural defects that break downstream rendering, tool integration, and automated parsing. The skill implements a systematic validation-and-correction pipeline that treats semantic preservation and structural correctness as dual objectives to be balanced, not traded off.
The Semantic-Structural Dual Objective. FMBench reveals that LLM outputs suffer from an inherent tension: optimizing for content accuracy (semantic fidelity) can produce structurally malformed Markdown, while enforcing rigid structural templates can distort or omit content. The paper measures this with two independent scores — a semantic score (BERTScore-F1 between output and reference content) and a structure score (similarity of abstractive structural summaries). The practical takeaway: never sacrifice content to fix formatting, and never ignore formatting to preserve content. Both must be addressed in sequence.
Atomic Unit Segmentation + Structure-Constrained Construction. The FMBench pipeline works by first decomposing content into atomic units (individual headings, sentences, metadata fields, code snippets) using conservative rule-based splitting. It then consumes these units strictly in their original order while fitting them into a target Markdown structure specification. This "content-order-preserving" constraint prevents the common failure mode where reformatting scrambles or silently drops content. The structural specification defines section hierarchy depth, permitted block types, and nesting rules — all determined before content placement begins.
Complexity-Aware Validation. The paper finds that formatting difficulty is primarily driven by list nesting depth and item count — not by section count or blockquote usage. This means validation effort should concentrate on nested lists and tables (the highest error-rate structures) rather than spreading attention uniformly across all Markdown elements.
Audit the input content for atomic units. Identify every discrete content element: headings, paragraph sentences, list items, table rows, code blocks, blockquotes, and metadata. Do not merge or reorder these units — preserve their original sequence as a strict constraint.
Define the target structural specification. Before writing any Markdown, establish the document skeleton: heading hierarchy depth (H1 through H4 max), which block types are permitted (ordered lists, unordered lists, tables, fenced code blocks, blockquotes), and nesting limits (e.g., lists nested no more than 3 levels deep).
Validate heading hierarchy consistency. Ensure headings follow a monotonically descending structure without skipping levels. An H1 must precede any H2; an H2 must precede any H3. Never jump from H1 to H3. If the input violates this, insert or adjust intermediate headings.
Normalize list structures. For each list, verify: (a) consistent marker usage (all - or all * for unordered; all 1. for ordered), (b) correct indentation at each nesting level (4 spaces or 1 tab per level), (c) blank lines before and after the list block, (d) no orphaned list items that break the parent-child nesting tree.
Validate table formatting. Ensure every table has: (a) a header row, (b) a separator row with --- alignment markers matching column count, (c) consistent column count across all rows, (d) pipe characters | at start and end of every row. Pad cells to equalize column widths for readability.
Verify code block integrity. Check that every opening ``` fence has a matching closing fence. Validate that language identifiers are present and correct (e.g., ```python not ```py). Ensure no content outside the fences is accidentally captured inside, and no code inside the fences leaks out.
Enforce blank line boundaries. Insert blank lines before and after every block-level element (headings, lists, tables, code blocks, blockquotes). This is the single most common source of rendering failures in mixed-content Markdown — a missing blank line before a list causes it to merge with the preceding paragraph.
Perform a semantic preservation check. After structural correction, compare the reformatted output against the original content. Verify that no sentences, list items, table rows, or code snippets were dropped, duplicated, or reordered. Content fidelity is non-negotiable.
Stress-test nested structures. Focus validation effort on the highest-risk elements: deeply nested lists (3+ levels) and complex tables (5+ columns or merged conceptual groups). These are where formatting errors concentrate according to FMBench's difficulty analysis.
Render-test mentally or explicitly. Walk through the output as a CommonMark parser would. Check that each element would render as intended in a standard Markdown viewer. If any ambiguity exists, add explicit structural cues (blank lines, indentation) rather than relying on parser leniency.
Example 1: Fixing a broken nested list
User: "Fix the formatting in this Markdown — the nested list isn't rendering right"
## Features
- Authentication
- OAuth 2.0 support
- JWT token management
- Access tokens
- Refresh tokens
- Session handling
- Database
- PostgreSQL with connection pooling
- Redis caching layer
- API Layer
- REST endpoints
- GraphQL schema
Approach:
Output:
## Features
- Authentication
- OAuth 2.0 support
- JWT token management
- Access tokens
- Refresh tokens
- Session handling
- Database
- PostgreSQL with connection pooling
- Redis caching layer
- API Layer
- REST endpoints
- GraphQL schema
Example 2: Generating a well-structured report with mixed content
User: "Write a deployment checklist as Markdown with a summary table, nested task lists, and code examples"
Approach:
Output:
# Deployment Checklist
## Summary
| Phase | Tasks | Owner | Status |
| ----------- | ----- | -------- | ------- |
| Pre-deploy | 4 | Backend | Pending |
| Deploy | 3 | DevOps | Pending |
| Post-deploy | 2 | QA | Pending |
## Pre-deploy Tasks
- Run test suite
- Unit tests
- Integration tests
- Load tests (threshold: p99 < 200ms)
- Database migrations
- Back up production database
- Run migrations in staging first
```bash
./scripts/run-migrations.sh --env staging --dry-run
git tag -a v2.4.0 -m "Release 2.4.0" && git push origin v2.4.0
**Example 3: Reformatting a malformed table**
User: "This table is broken, fix it"
```markdown
| Name | Role | Team
| --- | ---
Alice | Engineer | Platform |
| Bob | Designer | Product
| Carol | PM | Platform |
Approach:
Output:
| Name | Role | Team |
| ----- | -------- | -------- |
| Alice | Engineer | Platform |
| Bob | Designer | Product |
| Carol | PM | Platform |
Do:
Avoid:
- and *, or mixing 1. and - at the same nesting level)Ambiguous nesting: When input content has ambiguous parent-child relationships in lists, default to the shallowest valid nesting. Ask the user for clarification rather than guessing deep nesting that may misrepresent the hierarchy.
Column count mismatch in tables: When table rows have inconsistent cell counts, pad shorter rows with empty cells rather than truncating longer rows. Never silently drop data.
Unclosed code fences: If a code block lacks a closing fence, close it at the next heading or end-of-document boundary. Flag this to the user — an unclosed fence typically indicates content that was accidentally captured inside the block.
Content loss during reformatting: If any atomic unit from the original cannot be placed within the target structure without distortion, preserve it as-is in a dedicated section and flag the conflict rather than silently dropping it. Semantic fidelity takes precedence over structural perfection.
Heading hierarchy violations in source: When the input has heading level jumps (e.g., H1 directly to H4), insert corrective intermediate headings only if the user approves. Otherwise, normalize to the nearest valid level while noting the adjustment.
Paper: FMBench: Adaptive Large Language Model Output Formatting — Wang, Zhou & Ding (2026). Focus on Section 3 (pipeline architecture and atomic unit segmentation), Table 1 (semantic vs. structural score trade-offs), and Figure 2 (complexity analysis showing nested lists as the primary difficulty driver).
Code: github.com/FudanCVL/FMBench
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".