skills/paper2markdown/SKILL.md
Convert academic papers, theses, and technical documents from .docx to clean Markdown with proper LaTeX formulas, equation numbering, figures, and tables. Triggers when the user mentions converting a .docx paper/thesis/document to Markdown, extracting formulas from Word to LaTeX, or cleaning up pandoc-generated Markdown with formula artifacts. Also trigger when user says "docx转md", "word转markdown", "论文转markdown", "公式转latex", or mentions pandoc output has broken/escaped formulas.
npx skillsauth add kit101/skillz paper2markdownInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Convert academic papers (.docx) to clean, publication-ready Markdown with proper LaTeX formulas, equation numbering, figures, and tables.
Given source file <name>.docx, output is:
<name>.md # Clean Markdown output
<name>.artifacts/ # Extracted images and media
├── image1.name.png
├── image2.name.png
└── ...
All commands below use INPUT=<name>.docx and OUTPUT=<name> as placeholders.
This skill works standalone (using Python zipfile for XML access) but benefits from
docx skill for reliable XML analysis and image extraction:
| Scenario | With docx skill | Standalone fallback |
|----------|----------------|--------------------|
| Unpack & read XML | python {docx-skill}/scripts/office/unpack.py INPUT unpacked/ | zipfile.ZipFile(INPUT).read('word/document.xml') |
| Image relationship check | Inspect word/_rels/document.xml.rels for orphan images | Parse rels XML via zipfile |
| Tracked changes | pandoc --track-changes=all | Same (pandoc handles it) |
.docx → pandoc extraction → formula detection → formula cleanup →
table conversion → figure handling → equation numbering → final polish
Important: The formula source may be OMML/MathType (producing pandoc escape artifacts) OR plain Unicode text with no escaping. Detect which case you're in before choosing cleanup strategy (see Phase 2).
# Set variables
INPUT="model.docx"
OUTPUT="model"
# Convert docx to raw Markdown
pandoc "$INPUT" -t markdown --wrap=none -o "$OUTPUT.raw.md"
This produces raw Markdown with potential escaping artifacts. Do NOT use
+tex_math_dollars — equations in .docx are OMML/MathType, not native LaTeX,
so it has no effect.
Also extract images (even if pandoc fails, try manual unzip):
# Attempt pandoc extraction
mkdir -p "$OUTPUT.artifacts"
pandoc "$INPUT" --extract-media="$OUTPUT.artifacts" -t markdown --wrap=none -o /dev/null
# If pandoc produced no files, manually unzip
if [ -z "$(ls -A $OUTPUT.artifacts 2>/dev/null)" ]; then
unzip -o "$INPUT" "word/media/*" -d "$OUTPUT.artifacts/"
mv "$OUTPUT.artifacts/word/media/"* "$OUTPUT.artifacts/" 2>/dev/null
rm -rf "$OUTPUT.artifacts/word"
fi
# Move pandoc's media subdirectory up if needed
mv "$OUTPUT.artifacts/media/"* "$OUTPUT.artifacts/" 2>/dev/null
rmdir "$OUTPUT.artifacts/media" 2>/dev/null
Check the raw Markdown for formula encoding. Not all papers use OMML/MathType — many Chinese academic papers have formulas as plain Unicode text with no LaTeX escaping at all.
# Check for OMML/MathType artifacts
grep -c '\\\[' "$OUTPUT.raw.md" # count display math escapes
grep -c '\\\$' "$OUTPUT.raw.md" # count inline math escapes
# Check for Unicode math symbols (plain text formulas)
grep -c '[∑∏∫∂ΔΓΘΩαβγ]' "$OUTPUT.raw.md"
| Detection result | Cleanup strategy |
|-----------------|-----------------|
| Many \[ / \$ | Fix pandoc escaping artifacts (OMML path below) |
| Many Unicode math, no \[ | Convert Unicode → LaTeX, wrap in $...$/$$...$$ (Unicode path) |
| Mixed | Process OMML artifacts first, then convert remaining Unicode |
| Pandoc output | Correct LaTeX | Notes |
|---|---|---|
| \\[ ... \\] | $$ ... $$ | Display math |
| \$ ... \$ | $ ... $ | Inline math |
| {{X}_{y}} | X_{y} | Double braces → single |
| {{X}\_{y}} | X_{y} | Escaped underscore in braces |
| \~X | \tilde{X} | Tilde commands |
| \~{{X}_{y}} | \tilde{X}_{y} | Combined pattern |
| \\frac, \\sum, \\max, \\min | \frac, \sum, \max, \min | Double backslash → single |
| \\left{ ... \\right} | \{ ... \} or \left\{ ... \right\} | Brace escaping |
| \\text{ } | \; or remove | Excess spacing |
| \\begin{aligned} etc. | \begin{aligned} | Environment commands |
.wmf formula images: Inline variables rendered as images in Word.
Convert to inline LaTeX: $t$, $r$, etc. Remove .wmf files after conversion.\mathcal, \mathbb, \mathbf: Preserve as-is\text{...} with Chinese: Preserve and fix spacing\begin{cases}: Ensure proper syntax with \text{} for Chinese labels\left\{ ... \right\} → \{ ... \} for simpler cases\tilde, \hat, \bar commands workX_{i,j} not {{X}_{i,j}}\frac{a}{b} not \\frac{a}{b}\text{ } blocks$$ ... $$ pairs are properly closedWhen formulas are plain Unicode (common in Chinese academic .docx), the task is:
convert symbols to LaTeX AND wrap in $...$ or $$...$$.
Display formulas (wrap in $$...$$):
= or advanced operatorsInline formulas (wrap in $...$):
A^{0}, X^{M}, a_{i}F(·), F(A^{0})| Unicode | LaTeX | Unicode | LaTeX |
|---------|-------|---------|-------|
| Γ | \Gamma | Δ | \Delta |
| Ω | \Omega | θ | \theta |
| ε | \varepsilon | η | \eta |
| ∑ | \sum | ∏ | \prod |
| ∂ | \partial | ∞ | \infty |
| ≤ | \leq | ≥ | \geq |
| ∈ | \in | → | \rightarrow |
| · | \cdot | × | \times |
After individual variable wrapping, scan for adjacent $...$ blocks that should be
one expression. Merge patterns like:
F($A^{0}$) → $F(A^{0})$$P_{i}($x_{i}$)$ → $P_{i}(x_{i})$$a_{i}$^{E} → $a_{i}^{E}$When formulas contain both set notation {a, b} (needs \{, \}) AND LaTeX groups
^{X} (needs {, }), the order matters:
1. Fix pandoc escapes (\^0 → ^{0}, \_i → _{i}) # now has { and }
2. Protect ^{...} and _{...} groups (mark as safe) # these { } must survive
3. Escape remaining { } → \{ \} # set notation braces
4. Restore protected groups
NEVER globally replace('{','\\{') after step 1 — it destroys LaTeX group braces.
$$...$$, all inline use $...$A^{0} without $ wrappingF($A^{0}$)$$ and $ pairs are balancedPandoc may output tables as grid tables (ASCII art with +/-/| separators).
Convert ALL tables to standard Markdown pipe-table format:
| Column A | Column B | Column C |
|----------|----------|----------|
| Value 1 | Value 2 | Value 3 |
$Y_t$, $\Delta \tilde{Y}_t$|---| for header separatorImages should already be extracted to $OUTPUT.artifacts/ from Phase 1.
Not all files in word/media/ are referenced in the document. Check:
python3 -c "
import zipfile, re
with zipfile.ZipFile('$INPUT') as z:
doc = z.read('word/document.xml').decode()
rels = z.read('word/_rels/document.xml.rels').decode()
# Find image rIds in relationships
img_rids = set(re.findall(r'Id=\"(rId\d+)\".*?Target=\"media/', rels))
# Find rIds actually referenced in document body
used_rids = set(re.findall(r'r:embed=\"(rId\d+)\"', doc))
orphan = img_rids - used_rids
if orphan:
print(f'Orphan images (not referenced): {orphan}')
"
Delete orphan images from $OUTPUT.artifacts/ if confirmed.
.wmf files): These are inline formulas rendered as
images in Word. Convert to LaTeX inline ($...$), then delete the .wmf file..png, .jpg): Actual charts, diagrams, screenshots.Rename figures to imageN.figure_name.png format, matching the Word internal
sequence number. Remove spaces from filenames (they break Markdown URLs):
cd "$OUTPUT.artifacts"
for f in *.png; do mv "$f" "$(echo $f | sed 's/ //g')"; done
<figure> for semanticsReplace plain  with HTML <figure> elements:
<figure>
<img src="./模型.artifacts/image3.图1持续改善型.png" alt="持续改善型">
<figcaption align="center">图 1 持续改善型</figcaption>
</figure>
./$OUTPUT.artifacts/imageN.name.pngalt text: remove "图N " prefix, keep only the descriptive name<figcaption>: keep the full "图 N 名称" with align="center"{width="..." height="..."} attributesWord internal image numbering may skip numbers (formula images are also numbered). The actual figure images will have gaps in their sequence. Options:
Ask the user which approach they prefer, or default to renumbering for clarity.
Word stores equation numbers as SEQ MTEqn field codes, which pandoc cannot
preserve. Since the numbering is sequential, recover it:
python3 -c "
import re
with open('unpacked/word/document.xml') as f:
xml = f.read()
count = xml.count('SEQ MTEqn \\\\h') # formula markers
print(f'Formulas: {count}')
"
Append \qquad (N) before the closing $$ of each display formula.
This format works across ALL Markdown renderers (GitHub, MathJax, KaTeX).
$$ z_{m,t} = \frac{x_{m,t} - \min x_{m,t}}{\max x_{m,t} - \min x_{m,t} + \varepsilon} \qquad (1) $$
$$...$$), not inline ($...$)\begin{aligned}, \begin{cases} etc., place \qquad (N) on the
last line before \end{...}perl -i -0777 -pe '
my $n = 0;
s{(\$\$)(.+?)(\$\$)}{
$n++;
if ($2 =~ /\\tag\{/ || $2 =~ /\\qquad\s*\(/) {
"$1$2$3";
} else {
"$1$2 \\qquad ($n)$3";
}
}gse;
' "$OUTPUT.md"
WARNING: If you need to skip a formula, do it BEFORE batch numbering, then renumber all formulas after the skip point.
# Check $$ and $ pairs are balanced
python3 -c "
import re
text = open('$OUTPUT.md').read()
dd = text.count('\$\$')
s = re.sub(r'\$\$.*?\$\$', '', text, flags=re.DOTALL).count('\$')
print(f'\$\$: {dd} (even={dd%2==0}) single \$: {s} (even={s%2==0})')
eqns = re.findall(r'\\\\qquad \((\d+)\)', text)
print(f'Equations numbered: {len(eqns)}, range: {eqns[0]}-{eqns[-1] if eqns else 0}')
"
{width="..." height="..."} from image references\[ \] → [ ] inside $$...$$ blocks (pandoc may escape brackets)$$...$$ pairs are balancedgrep -c '\\qquad (' "$OUTPUT.md" matches expected formula countreplace('{','\\{') after fix_escapes
destroys LaTeX group braces ^{X} → ^\{X\}. Always protect groups first.'$\\Gamma$'
cause bad escape \\G errors. Use lambda: lambda m: '$\\\\Gamma$'.[ ] may become \[ \] in raw output.
Inside $$...$$, these create invalid nested display math. Fix to [ ].sed -e 's/a/b/' -e 's/b/c/' turns a → c.
Use single-pass perl with hash lookup or temporary markers..wmf vs .png: .wmf files are formula fragments, not figures.\| or HTML entities if needed.\text{}: Pandoc may double-escape; check manually.SEQ MTEqn \c in fldSimple format are cross-references,
not formula counter increments. Don't count them as separate formulas.word/media/ may have no references in
document.xml. Check document.xml.rels before assuming all are figures.$...$ blocks (like F($A^{0}$)) and merge into one expression.INPUT="model.docx"
OUTPUT="model"
# Convert docx to raw md + extract images
pandoc "$INPUT" -t markdown --wrap=none -o "$OUTPUT.raw.md"
mkdir -p "$OUTPUT.artifacts"
pandoc "$INPUT" --extract-media="$OUTPUT.artifacts" -t markdown --wrap=none -o /dev/null
# Fallback: unzip -o "$INPUT" "word/media/*" -d "$OUTPUT.artifacts/"
# Remove formula image fragments
rm -f "$OUTPUT.artifacts/"*.wmf
# Remove spaces from image filenames
cd "$OUTPUT.artifacts"
for f in *.png; do mv "$f" "$(echo $f | sed 's/ //g')"; done
# Count formulas in original docx
python3 -c "
import re
# Use zipfile or unpacked/word/document.xml
with open('unpacked/word/document.xml') as f: xml = f.read()
print('Formulas:', xml.count('SEQ MTEqn \\\\h'))
"
# Clean Markdown is written to $OUTPUT.md
development
Convert Markdown files to PDF with Chinese formatting, LaTeX math formulas, and embedded images. Use this skill whenever the user wants to turn a .md file into a PDF document, especially when the Markdown contains Chinese text, math formulas ($$ or $), tables, or images. Triggers include:"md转pdf", "markdown to pdf", "导出pdf", "生成pdf", "convert md to pdf", "转为pdf", or when the user mentions a .md file and asks to make it a PDF.
development
Convert Markdown files to professional .docx documents with Chinese formatting, table styling, automatic TOC, headers/footers, and page layout control. Use this skill whenever the user wants to turn a .md file into a Word document, especially for software requirement specifications, technical reports, requirements documents, or any structured Chinese document. Triggers include: "convert md to docx", "md转docx", "markdown to word", "生成word文档", "导出docx", "make a word doc from markdown".
development
Compare two markdown files and generate a structured changelog in open-source standard format (date-based versions, Summary/Added/Changed/Removed tables). Use whenever the user wants to compare .md file versions, create a changelog or release notes, document version differences, or mentions "changelog", "版本差异", "修订日志", "diff", "变更记录" in context of markdown files.
tools
SSL证书检查器是一个用于检查SSL证书有效期的agent技能,它可以检查指定域名的SSL证书是否即将过期,返回检查结果,使用mcp-email发送即将过期的警告邮件给订阅者。