instances/xiaodazi/skills/mineru-pdf/SKILL.md
Parse PDF documents locally into structured Markdown/JSON using MinerU. CPU-only, privacy-first.
npx skillsauth add malue-ai/dazee-small mineru-pdfInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
本地解析 PDF 文档为结构化 Markdown 或 JSON,保留标题层级、表格、列表等结构。CPU 运行,数据不出本机。
| 工具 | 擅长 | 局限 | |---|---|---| | nano-pdf | 简单文本提取、PDF 元数据 | 不保留结构 | | pdf-toolkit | 合并/拆分/加密/水印 | 不做内容解析 | | mineru-pdf | 结构化解析(标题/表格/列表) | 安装包较大 |
优先使用 mineru-pdf 做内容提取,pdf-toolkit 做文件操作。
pip install magic-pdf
magic-pdf -p /path/to/document.pdf -o /path/to/output/ -m auto
参数说明:
-p:输入 PDF 路径-o:输出目录-m:模式选择
auto:自动判断(推荐)txt:纯文本 PDFocr:扫描件 PDFfrom magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.pipe.UNIPipe import UNIPipe
reader = FileBasedDataReader("")
writer = FileBasedDataWriter(output_dir)
pdf_bytes = reader.read(pdf_path)
pipe = UNIPipe(pdf_bytes, model_list=[], image_writer=writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
解析后在输出目录生成:
*.md:Markdown 格式的结构化内容images/:提取的图片*.json:结构化元数据 引用development
Local web search (Tavily/Exa, requires API Key). For quick searches. If no Key configured or deep research needed, use cloud_agent instead.
development
Get current weather and forecasts (no API key required).
tools
Send WhatsApp messages to other people or search/sync WhatsApp history via the wacli CLI (not for normal user chats).
tools
Start voice calls via the Moltbot voice-call plugin.