skills/qianfanocr-document-intelligence/SKILL.md
Analyze image files, image URLs, PDF files, and PDF URLs when the task requires recognizing, understanding, extracting, answering questions about, or locating content from the visual input. Typical uses include document parsing, layout analysis, element recognition, OCR, key information extraction, chart understanding, and document VQA. Do not use for plain text, structured data, code files, image-processing tasks, or cases where the needed information is already available in text form.
npx skillsauth add baidubce/skills qianfanocr-document-intelligenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill orchestrates visual understanding for images and PDFs. It does not implement a vision model itself. It selects the right analysis mode, prepares inputs, invokes the bundled CLI, and returns a structured result for the upstream agent.
Always follow this order:
QIANFAN_TOKEN is already available.<skill-root>/.env as QIANFAN_TOKEN=....This token preflight takes precedence over all later rules in this skill. Do not read
references/*.md, do not select a mode, and do not call any bundled script until the token check
has passed.
Before first use, make sure QIANFAN_TOKEN is available either in the process environment or in
<skill-root>/.env.
If the token is missing, ask the user in Chinese:
QIANFAN_TOKEN 环境变量未设置。请提供百度千帆 API Key。
如果您暂时没有 API Key,请到 https://cloud.baidu.com/product-s/qianfan_home 注册获取。
If the user provides the key, persist it to <skill-root>/.env before continuing. Do not rely on
a temporary export QIANFAN_TOKEN=... as the only storage mechanism.
Do not assume a bundled default token exists.
scripts/qianfan_ocr_cli.py: send one or more images to the backend VLM.scripts/pdf_to_images.py: convert one or more PDFs into per-page images before calling the VLM.scripts/render_doc_markdown.py: replace document-parsing image placeholders with cropped image files.scripts/run_document_parsing.py: run document parsing end-to-end and always render image placeholders.scripts/run_pdf_document_parsing.py: run PDF document parsing end-to-end and export combined markdown, shared assets, and per-page markdown files.scripts/run_document_parsing_with_layout.py: run document parsing with layout and export markdown, layout JSON, and a layout overlay image.scripts/run_layout_analysis.py: run layout analysis and export _layout.json plus a layout overlay image.scripts/run_element_recognition.py: run element recognition and save the result as a sibling markdown file.Always call scripts by absolute path. In Codex, use the installed absolute skill path instead of a bare relative path.
Examples:
python3 "<skill-root>/scripts/qianfan_ocr_cli.py" "<prompt>" --image <path_or_url>
python3 "<skill-root>/scripts/pdf_to_images.py" <pdf_or_url> --output-dir <dir>
python3 "<skill-root>/scripts/render_doc_markdown.py" parsed.md --image <page_image> --output-dir <assets_dir> --output-markdown <rendered_md>
python3 "<skill-root>/scripts/run_document_parsing.py" <image_or_pdf>
python3 "<skill-root>/scripts/run_pdf_document_parsing.py" <pdf> --pages all
python3 "<skill-root>/scripts/run_document_parsing_with_layout.py" <image_or_pdf>
python3 "<skill-root>/scripts/run_layout_analysis.py" <image_or_pdf>
python3 "<skill-root>/scripts/run_element_recognition.py" <cropped_image_or_pdf> --element-type <text|formula|table>
Trigger this skill only when all of the following are true:
Do not trigger when:
qianfan_ocr_cli.py.--image flags only when cross-image reasoning is required.scripts/pdf_to_images.py.Recommended PDF flow:
python3 "<skill-root>/scripts/pdf_to_images.py" report.pdf --output-dir /tmp/report-pages
python3 "<skill-root>/scripts/qianfan_ocr_cli.py" "<prompt>" --image /tmp/report-pages/report-p001.png
Select exactly one primary mode per call. If needed, make a second, more specific call after an initial pass.
| Mode | Use When | Goal |
|------|----------|------|
| document parsing | Need document structure, text, formulas, tables, and image placeholders from an image/PDF | Output Markdown parsing result |
| layout analysis | Need all layout elements with positions and categories | Output layout elements with bbox and category |
| element recognition | Need precise recognition on cropped elements such as text blocks, formulas, or tables | Output exact recognition for the cropped element |
| document parsing with layout | Need both structural parsing and layout detection in one workflow | Output Markdown parsing plus layout analysis |
| general ocr | Need all visible text without document structure | Extract all visible text lines |
| key information extraction | Need key fields from cards, forms, receipts, invoices, contracts, or similar documents | Extract key information in structured form |
| chart understanding | Need chart captions, structured chart content, or chart QA | Understand and structure chart content |
| doc vqa | Need answers to specific questions about a document image/PDF | Answer questions grounded in the document |
document parsing for full-page document understanding where output should be Markdown and
preserve hierarchy.layout analysis when bounding boxes and categories are the main output.element recognition only after cropping the target region or when the user provides a
single focused element image.scripts/run_element_recognition.py for element recognition so the result is written
next to the source file as a single markdown file without any assets directory.document parsing with layout when both Markdown reconstruction and layout boxes are needed.general ocr for screenshots, signs, posters, and simple document text extraction where
layout is not important.key information extraction for forms, certificates, IDs, invoices, receipts, contracts,
and other field-centric documents.key information extraction, if the user asks for all key-value information or all fields
without naming a concrete field list, use the schema-free prompt path instead of inventing an
explicit schema.chart understanding for plots, dashboards, and chart-heavy report pages.doc vqa for targeted questions such as totals, dates, clauses, page content, or whether a
document contains a specific item.Only after token preflight has passed and after selecting the mode, always read the corresponding
file in references/ before composing the prompt or calling any script.
document parsing -> references/document-parsing.mdlayout analysis -> references/layout-analysis.mdelement recognition -> references/element-recognition.mddocument parsing with layout -> references/document-parsing-with-layout.mdgeneral ocr -> references/general-ocr.mdkey information extraction -> references/key-information-extraction.mdchart understanding -> references/chart-understanding.mddoc vqa -> references/doc-vqa.mdDo not skip this step when a matching reference exists.
When the selected reference contains a prompt template, prompt rule, fixed prompt, output format, or parameter recommendation, use that reference as the primary source of truth.
references/*.md file over ad-hoc prompt writing.If the selected reference defines execution parameters, convert them into actual CLI flags or request fields. Do not leave them as documentation-only notes.
Examples:
min_dynamic_patch = 8 -> pass --min-dynamic-patch 8max_dynamic_patch = 24 -> pass --max-dynamic-patch 24--thinkingBefore running qianfan_ocr_cli.py, verify that the final command includes the parameter settings
required by the selected mode.
element recognition, specify the element type: text, formula, table,
figure caption, seal, signature block, and so on.document parsing outputs that contain  placeholders, run
scripts/render_doc_markdown.py before presenting the Markdown to users who need renderable
local images.scripts/run_document_parsing.py over manually chaining qianfan_ocr_cli.py and
render_doc_markdown.py when the task is standard document parsing.scripts/run_pdf_document_parsing.py for PDF document parsing when the user wants one
markdown for the whole PDF plus per-page markdown files and a shared assets directory. Use
--request-mode joint when selected PDF pages are semantically related and should be sent as
one multi-image request. Use --request-mode batch --concurrency <N> when pages can be parsed
independently and should run concurrently.scripts/run_document_parsing_with_layout.py for document parsing with layout so the
final output includes markdown, _layout.json, and a rendered layout overlay image.--image flags only for cross-page or cross-image reasoning that truly depends on
joint context.scripts/qianfan_ocr_cli.py --batch --concurrency <N> or a dedicated runner instead of sending
all images as one joint request.--thinking only for difficult document understanding tasks with ambiguous reading order or
dense field relationships.Read only the relevant reference file when needed, but do read that file before prompt construction:
references/document-parsing.mdreferences/layout-analysis.mdreferences/element-recognition.mdreferences/document-parsing-with-layout.mdreferences/general-ocr.mdreferences/key-information-extraction.mdreferences/chart-understanding.mdreferences/doc-vqa.mdReturn a structured result instead of raw model prose:
=== VISUAL ANALYSIS RESULT ===
mode: <mode>
confidence: <high|medium|low>
input_type: <image|pdf>
image_count: <N>
page_count: <N or n/a>
answer:
<direct answer or summary>
evidence:
- <directly observed fact>
warnings:
- <uncertainty or limitation>
markdown:
<for document parsing modes>
layout:
- page: <n>
category: <label>
bbox: [x1, y1, x2, y2]
structured_data:
<for key information extraction / chart understanding>
recognized_elements:
<for element recognition>
=== END VISUAL ANALYSIS ===
page_1, page_2, and so on.warnings.development
Generate interactive visualization pages for feasible solutions produced by Famou evolutionary algorithms. Use this skill when the user mentions "Famou visualization", "visualize this solution", "show feasible solution results", "evolution results", "evolve visualization", or provides a Python-code solution (path planning, scheduling, knapsack, TSP, job scheduling, machine learning, etc.) that needs to be displayed visually. Even if the user just says "help me visualize this solution", "draw it out", or "show me the results", trigger this skill immediately whenever the context involves evolutionary algorithms or optimization problem solutions.
testing
Workflow skill for managing famou evolutionary experiment tasks, including public normal mode and public pro hybrid mode. Use this skill when the user mentions "submit experiment", "check experiment status", "delete experiment", "get experiment results", "account info", "quota", "credits", "famou experiment", "upload experiment", "config.yaml experiment", "hybrid mode", or needs to use famou-ctl to manage experiment tasks. Even if the user just says "submit" or "run experiment", trigger this skill whenever the context involves the famou platform.
development
A data analysis skill for understanding datasets, analyzing data, building data processing pipelines, and summarizing analytical results. Use this skill when the user mentions "analyze data", "data processing", "data exploration", "statistical analysis", "data cleaning", "data summarization", "create a data report", "understand this dataset", or "take a look at this CSV/Excel/dataset". Even if the user simply says "help me look at this data" or "analyze this", trigger this skill whenever the context involves a data file or dataset. Also invoke this skill if data analysis is required during Famou problem definition.
testing
Interactive end-to-end Famou workflow for defining, implementing, and solving optimization tasks. The workflow typically proceeds in three stages: (1) understand the data and define the task, producing `problem.md`; (2) implement and validate `evaluator.py`, `init.py`, and `prompt.md` from the task definition; (3) run deep solving through Famou. Trigger this skill whenever the user wants to define, clarify, create, or fix a Famou task; prepare Famou experiment artifacts; write or update `problem.md`, `evaluator.py`, `init.py`, or `prompt.md`; run Famou; do deep solving; or solve an optimization, ML, or search problem with evolutionary methods. Even if the user simply says "help me make a Famou task", "help me solve this", or "run Famou", trigger this skill whenever the surrounding context indicates an optimization or search task. Also trigger when the user describes a combinatorial optimization, scheduling, routing, or ML problem without mentioning Famou — treat it as a potential Famou task.