framework_eng/skills/tool-usage/content-generation/docx-convert/SKILL.md
Use for converting Word documents (.docx) to Markdown with extracted images (requirements, spec, instructions, vendor documentation). Helps obtain GFM text via pandoc with post-processing of HTML tables and image paths.
npx skillsauth add steelmorgan/1c-agent-based-dev-framework docx-convertInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A thin wrapper around pandoc for converting .docx to GitHub-Flavored Markdown with extraction of embedded images. It also post-processes the result: fixes image paths and turns HTML tables (which pandoc leaves as-is in complex cases) into Markdown pipe tables.
| Situation | Action |
|----------|----------|
| The client sent requirements in .docx, and you need to put them in the repository as md | docx2md.sh input.docx |
| Vendor documentation is in Word, and you need to feed it to the agent | docx2md.sh input.docx output_dir |
| The document has complex tables and styles — pandoc skips them | mammoth (see below) |
| You only need the text content without images | pandoc input.docx --to=gfm -o out.md directly |
pandoc --to=html, the script is not needed.pandoc --to=pdf (requires LaTeX), the script is not needed.pandoc ≥ 3.x — the main converterpython3 — post-processing (html_tables_to_md.py)mammoth (python) — an optional alternative for complex tables# The result is placed next to the file, in a directory with the same name without the extension
bash framework/skills/tool-usage/content-generation/docx-convert/docx2md.sh "/path/to/file.docx"
# With an explicit output directory
bash framework/skills/tool-usage/content-generation/docx-convert/docx2md.sh "/path/to/file.docx" "/path/to/output"
Result:
output/document.md — text in GFM with pipe tablesoutput/images/ — all images from the document (png/jpeg/emf/wmf)When used from a project where the framework is installed via symlinks, the script path is:
.claude/skills/docx-convert/docx2md.sh.
pandoc input.docx --from=docx --to=gfm --wrap=none -o output.md
pandoc input.docx --from=docx --to=gfm --wrap=none \
--extract-media=./images \
-o output.md
python3 -c "
import mammoth, pathlib
result = mammoth.convert_to_markdown(open('input.docx', 'rb'))
pathlib.Path('output.md').write_text(result.value)
"
.doc (legacy format) — pandoc accepts only .docx. First resave it through LibreOffice/Word.html_tables_to_md.py) handles only HTML tables and <img> tags left by pandoc; the rest of the HTML is kept as-is.testing
MUST use BEFORE making a judgment about the cause of a conflict, a test failure, or an artifact dispute. Defines the end-to-end verification method L1→L6 and the classification of the first broken link.
development
MUST use AFTER a work cycle with ≥2 iterations (wrote → error → fixed → success). Provides the retrospective procedure and the format for recording practice/anti-patterns in references/learned-patterns.md or {project}/.context/learned-patterns.md.
tools
MUST use WHEN you are writing reusable knowledge into RLM (pattern / architectural decision / stable domain fact) OR reading it before a non-trivial task/solution in the domain. Provides the breakdown of native-push vs RLM-pull, tools for writing and reading RLM, H-MEM levels, and hygiene.
testing
MUST use WHEN the task is classified as simple (< 20 lines, 1 file, no new metadata objects, no architectural decisions). Provides a short cycle of 3 steps with a guard on the self path and mandatory verify.