public/SKILLS/Document Skills/polaris-datainsight-doc-extract/SKILL.md
Extract structured data from Office documents (DOCX, PPTX, XLSX, HWP, HWPX) using the Polaris AI DataInsight Doc Extract API. Use when the user wants to parse, analyze, or extract text, tables, charts, images, or shapes from document files. Invoke this skill whenever the user mentions extracting content from Word, PowerPoint, Excel, HWP, or HWPX files, wants to parse document structure, needs to convert document data for RAG pipelines, or asks about reading tables, charts, or text from Office-format documents — even if they don't explicitly mention "DataInsight" or "Polaris".
npx skillsauth add eric861129/skills_all-in-one polaris-datainsight-doc-extractInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use the Polaris AI DataInsight Doc Extract API to extract text, images, tables, charts, shapes, equations, and more from Word, PowerPoint, Excel, HWP, and HWPX files, returning everything as a structured unifiedSchema JSON. A single API call gives you the full document structure without any manual parsing.
x-po-di-apikey header.unifiedSchema JSON inside.Get an API Key: Sign up at https://datainsight.polarisoffice.com and generate your API key.
Authentication: Include the API key as a header on every request.
Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY
Set the environment variable:
export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"
| Item | Limit | |------|-------| | Supported formats | HWP, HWPX, DOCX, PPTX, XLSX | | Max file size | 25 MB | | Timeout | 10 minutes | | Rate limit | 10 requests per minute |
Endpoint:
POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract
Extract a document with Python:
import requests
import json
import zipfile
import io
def extract_document(file_path: str, api_key: str) -> dict:
with open(file_path, "rb") as f:
response = requests.post(
"https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
headers={"x-po-di-apikey": api_key},
files={"file": f}
)
if response.status_code != 200:
raise Exception(f"API error: {response.status_code} - {response.text}")
# Response is a ZIP file
zip_buffer = io.BytesIO(response.content)
with zipfile.ZipFile(zip_buffer) as z:
json_files = [name for name in z.namelist() if name.endswith('.json')]
if json_files:
with z.open(json_files[0]) as jf:
return json.load(jf)
raise Exception("No JSON found in ZIP")
# Example usage
import os
api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"]
schema = extract_document("report.docx", api_key)
print(f"Extracted {schema['totalPages']} pages")
Extract with curl:
curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
-H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
-F "[email protected]" \
--output result.zip
unzip result.zip -d result/
cat result/*.json | python -m json.tool
Root:
{
"docName": "sample.docx",
"totalPages": 3,
"pages": [ ... ]
}
Page (pages[]):
{
"pageNum": 1,
"pageWidth": 595.3,
"pageHeight": 842.0,
"extractionSummary": {
"text": 5, "image": 2, "table": 1, "chart": 1
},
"elements": [ ... ]
}
Element types (elements[].type):
| type | Description |
|------|-------------|
| text | Text block |
| image | Image |
| table | Table |
| chart | Chart |
| shape | Shape |
| equation | Equation |
| header / footer | Header / Footer |
Common element structure:
{
"type": "text",
"id": "te1",
"boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
"content": { "text": "Body content here" }
}
Table content:
{
"content": {
"html": "<table>...</table>",
"csv": "Header1,Header2\nValue1,Value2",
"json": [
{
"metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
"para": [{ "content": [{ "text": "Cell content" }] }]
}
]
}
}
Chart content:
{
"content": {
"chart_type": "column",
"title": "Annual Sales Comparison",
"x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
"series_names": ["2023", "2024"],
"series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
"csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
}
}
Extract all text:
def get_all_text(schema: dict) -> str:
texts = []
for page in schema.get("pages", []):
for el in page.get("elements", []):
if el["type"] == "text" and el.get("content", {}).get("text"):
texts.append(el["content"]["text"])
return "\n".join(texts)
Extract tables as CSV:
def get_tables_as_csv(schema: dict) -> list:
tables = []
for page in schema.get("pages", []):
for el in page.get("elements", []):
if el["type"] == "table":
csv_data = el.get("content", {}).get("csv", "")
if csv_data:
tables.append(csv_data)
return tables
Generate RAG chunks:
def make_rag_chunks(schema: dict) -> list:
chunks = []
doc_name = schema.get("docName", "")
for page in schema.get("pages", []):
for el in page.get("elements", []):
text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
if text.strip():
chunks.append({
"source": doc_name,
"page": page["pageNum"],
"type": el["type"],
"text": text.strip()
})
return chunks
User: "Extract all table data from this DOCX report as CSV."
Output:
import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
print(f"=== Table {i+1} ===")
print(csv_data)
=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900
=== Table 2 ===
Item,Amount
Labor,500
Operations,300
Inspired by: Polaris Office DataInsight API documentation and workflow.
response.content directly as JSON — use zipfile.ZipFile to extract it first.content.csv is available for both table and chart elements, making it the most convenient format for data extraction.time.sleep(6)) between calls.boundaryBox to determine where each element sits on the page — useful for layout analysis.POLARIS_DATAINSIGHT_API_KEY) and never hardcode it.SKILL.md file is provided under the Apache 2.0 license.development
Run structured What-If scenario analysis with multi-branch possibility exploration. Use this skill when the user asks speculative questions like "what if...", "what would happen if...", "what are the possibilities", "explore scenarios", "scenario analysis", "possibility space", "what could go wrong", "best case / worst case", "risk analysis", "contingency planning", "strategic options", or any question about uncertain futures. Also trigger when the user faces a fork-in-the-road decision, wants to stress-test an idea, or needs to think through consequences before committing.
development
Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.
development
Use when challenging ideas, plans, decisions, or proposals using structured critical reasoning. Invoke to play devil's advocate, run a pre-mortem, red team, or audit evidence and assumptions.
tools
Core skill for the deep research and writing tool. Write scientific manuscripts in full paragraphs (never bullet points). Use two-stage process with (1) section outlines with key points using research-lookup then (2) convert to flowing prose. IMRAD structure, citations (APA/AMA/Vancouver), figures/tables, reporting guidelines (CONSORT/STROBE/PRISMA), for research papers and journal submissions.