Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

eric861129/polaris-datainsight-doc-extract

Name: polaris-datainsight-doc-extract
Author: eric861129

public/SKILLS/Document Skills/polaris-datainsight-doc-extract/SKILL.md

npx skillsauth add eric861129/skills_all-in-one polaris-datainsight-doc-extract

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Polaris AI DataInsight — Doc Extract Skill

Use the Polaris AI DataInsight Doc Extract API to extract text, images, tables, charts, shapes, equations, and more from Word, PowerPoint, Excel, HWP, and HWPX files, returning everything as a structured unifiedSchema JSON. A single API call gives you the full document structure without any manual parsing.

When to Use This Skill

The user wants to extract text, tables, charts, or images from DOCX, PPTX, XLSX, HWP, or HWPX files
The user needs to understand a document's structure (page count, element types, position data, etc.)
The extracted data will be used in a RAG pipeline, data analysis workflow, or automation task
Table data needs to be converted to CSV, or chart data needs to be broken down into series and labels
The user needs to parse special elements like headers, footers, equations, or shapes

What This Skill Does

Authentication — Authenticates with the Polaris DataInsight API via the x-po-di-apikey header.
Upload and extract — Sends the file as a multipart/form-data POST request and extracts the full document structure.
Parse ZIP response — The API returns a ZIP file; extract it and load the unifiedSchema JSON inside.
Deliver structured data — Returns a JSON organized by page and element type (text, table, chart, image, shape, equation, etc.).
Support multiple usage patterns — Handles full text extraction, table-to-CSV conversion, RAG chunk generation, and more.

How to Use

Prerequisites

Get an API Key: Sign up at https://datainsight.polarisoffice.com and generate your API key.

Authentication: Include the API key as a header on every request.

Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY

Set the environment variable:

export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"

Limits

| Item | Limit | |------|-------| | Supported formats | HWP, HWPX, DOCX, PPTX, XLSX | | Max file size | 25 MB | | Timeout | 10 minutes | | Rate limit | 10 requests per minute |

Basic Usage

Endpoint:

POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract

Extract a document with Python:

import requests
import json
import zipfile
import io

def extract_document(file_path: str, api_key: str) -> dict:
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
            headers={"x-po-di-apikey": api_key},
            files={"file": f}
        )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} - {response.text}")

    # Response is a ZIP file
    zip_buffer = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_buffer) as z:
        json_files = [name for name in z.namelist() if name.endswith('.json')]
        if json_files:
            with z.open(json_files[0]) as jf:
                return json.load(jf)

    raise Exception("No JSON found in ZIP")

# Example usage
import os
api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"]
schema = extract_document("report.docx", api_key)
print(f"Extracted {schema['totalPages']} pages")

Extract with curl:

curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
  -H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
  -F "[email protected]" \
  --output result.zip

unzip result.zip -d result/
cat result/*.json | python -m json.tool

Advanced Usage

Response Structure (unifiedSchema)

Root:

{
  "docName": "sample.docx",
  "totalPages": 3,
  "pages": [ ... ]
}

Page (pages[]):

{
  "pageNum": 1,
  "pageWidth": 595.3,
  "pageHeight": 842.0,
  "extractionSummary": {
    "text": 5, "image": 2, "table": 1, "chart": 1
  },
  "elements": [ ... ]
}

Element types (elements[].type):

| type | Description | |------|-------------| | text | Text block | | image | Image | | table | Table | | chart | Chart | | shape | Shape | | equation | Equation | | header / footer | Header / Footer |

Common element structure:

{
  "type": "text",
  "id": "te1",
  "boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
  "content": { "text": "Body content here" }
}

Table content:

{
  "content": {
    "html": "<table>...</table>",
    "csv": "Header1,Header2\nValue1,Value2",
    "json": [
      {
        "metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
        "para": [{ "content": [{ "text": "Cell content" }] }]
      }
    ]
  }
}

Chart content:

{
  "content": {
    "chart_type": "column",
    "title": "Annual Sales Comparison",
    "x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
    "series_names": ["2023", "2024"],
    "series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
    "csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
  }
}

Usage Patterns

Extract all text:

def get_all_text(schema: dict) -> str:
    texts = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "text" and el.get("content", {}).get("text"):
                texts.append(el["content"]["text"])
    return "\n".join(texts)

Extract tables as CSV:

def get_tables_as_csv(schema: dict) -> list:
    tables = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "table":
                csv_data = el.get("content", {}).get("csv", "")
                if csv_data:
                    tables.append(csv_data)
    return tables

Generate RAG chunks:

def make_rag_chunks(schema: dict) -> list:
    chunks = []
    doc_name = schema.get("docName", "")
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
            if text.strip():
                chunks.append({
                    "source": doc_name,
                    "page": page["pageNum"],
                    "type": el["type"],
                    "text": text.strip()
                })
    return chunks

Example

User: "Extract all table data from this DOCX report as CSV."

Output:

import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
    print(f"=== Table {i+1} ===")
    print(csv_data)

=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900

=== Table 2 ===
Item,Amount
Labor,500
Operations,300

Inspired by: Polaris Office DataInsight API documentation and workflow.

Tips

The response is always a ZIP file. Do not try to parse response.content directly as JSON — use zipfile.ZipFile to extract it first.
content.csv is available for both table and chart elements, making it the most convenient format for data extraction.
The rate limit is 10 requests per minute. When processing multiple files, add a delay (e.g., time.sleep(6)) between calls.
Use boundaryBox to determine where each element sits on the page — useful for layout analysis.
Always store the API key in an environment variable (POLARIS_DATAINSIGHT_API_KEY) and never hardcode it.

Common Use Cases

Document search systems: Extract full text and store it in a vector database for semantic search
Automated report analysis: Collect table and chart data from PPTX/DOCX reports for analysis
HWP digitization: Convert HWP/HWPX documents into structured, machine-readable data
RAG pipeline setup: Split documents into chunks for use in LLM-based Q&A systems
Data migration: Move table and chart data from legacy Office documents into a database

License & Terms

Skill Definition: This SKILL.md file is provided under the Apache 2.0 license.
Service Access: Usage of the DataInsight API requires a valid subscription or license key.
Restrictions: Unauthorized redistribution of the API endpoints or bypassing authentication is strictly prohibited.
Support: For licensing inquiries, visit https://datainsight.polarisoffice.com.

eric861129/polaris-datainsight-doc-extract

public/SKILLS/Document Skills/polaris-datainsight-doc-extract/SKILL.md

Extract structured data from Office documents (DOCX, PPTX, XLSX, HWP, HWPX) using the Polaris AI DataInsight Doc Extract API. Use when the user wants to parse, analyze, or extract text, tables, charts, images, or shapes from document files. Invoke this skill whenever the user mentions extracting content from Word, PowerPoint, Excel, HWP, or HWPX files, wants to parse document structure, needs to convert document data for RAG pipelines, or asks about reading tables, charts, or text from Office-format documents — even if they don't explicitly mention "DataInsight" or "Polaris".

38 stars

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add eric861129/skills_all-in-one polaris-datainsight-doc-extract

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 8:15 PM59.6s1 file scanned

SKILL.md

name:: polaris-datainsight-doc-extract
description:: Extract structured data from Office documents (DOCX, PPTX, XLSX, HWP, HWPX) using the Polaris AI DataInsight Doc Extract API. Use when the user wants to parse, analyze, or extract text, tables, charts, images, or shapes from document files. Invoke this skill whenever the user mentions extracting content from Word, PowerPoint, Excel, HWP, or HWPX files, wants to parse document structure, needs to convert document data for RAG pipelines, or asks about reading tables, charts, or text from Office-format documents — even if they don't explicitly mention "DataInsight" or "Polaris".
license:: Apache-2.0 (for skill definition); Service usage subject to Polaris AI DataInsight Terms of Service.

Polaris AI DataInsight — Doc Extract Skill

When to Use This Skill

The user wants to extract text, tables, charts, or images from DOCX, PPTX, XLSX, HWP, or HWPX files
The user needs to understand a document's structure (page count, element types, position data, etc.)
The extracted data will be used in a RAG pipeline, data analysis workflow, or automation task
Table data needs to be converted to CSV, or chart data needs to be broken down into series and labels
The user needs to parse special elements like headers, footers, equations, or shapes

What This Skill Does

Authentication — Authenticates with the Polaris DataInsight API via the x-po-di-apikey header.
Upload and extract — Sends the file as a multipart/form-data POST request and extracts the full document structure.
Parse ZIP response — The API returns a ZIP file; extract it and load the unifiedSchema JSON inside.
Deliver structured data — Returns a JSON organized by page and element type (text, table, chart, image, shape, equation, etc.).
Support multiple usage patterns — Handles full text extraction, table-to-CSV conversion, RAG chunk generation, and more.

How to Use

Prerequisites

Get an API Key: Sign up at https://datainsight.polarisoffice.com and generate your API key.

Authentication: Include the API key as a header on every request.

Header: x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY

Set the environment variable:

export POLARIS_DATAINSIGHT_API_KEY="your-api-key-here"

Limits

| Item | Limit | |------|-------| | Supported formats | HWP, HWPX, DOCX, PPTX, XLSX | | Max file size | 25 MB | | Timeout | 10 minutes | | Rate limit | 10 requests per minute |

Basic Usage

Endpoint:

POST https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract

Extract a document with Python:

import requests
import json
import zipfile
import io

def extract_document(file_path: str, api_key: str) -> dict:
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract",
            headers={"x-po-di-apikey": api_key},
            files={"file": f}
        )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} - {response.text}")

    # Response is a ZIP file
    zip_buffer = io.BytesIO(response.content)
    with zipfile.ZipFile(zip_buffer) as z:
        json_files = [name for name in z.namelist() if name.endswith('.json')]
        if json_files:
            with z.open(json_files[0]) as jf:
                return json.load(jf)

    raise Exception("No JSON found in ZIP")

# Example usage
import os
api_key = os.environ["POLARIS_DATAINSIGHT_API_KEY"]
schema = extract_document("report.docx", api_key)
print(f"Extracted {schema['totalPages']} pages")

Extract with curl:

curl -X POST "https://datainsight-api.polarisoffice.com/api/v1/datainsight/doc-extract" \
  -H "x-po-di-apikey: $POLARIS_DATAINSIGHT_API_KEY" \
  -F "[email protected]" \
  --output result.zip

unzip result.zip -d result/
cat result/*.json | python -m json.tool

Advanced Usage

Response Structure (unifiedSchema)

Root:

{
  "docName": "sample.docx",
  "totalPages": 3,
  "pages": [ ... ]
}

Page (pages[]):

{
  "pageNum": 1,
  "pageWidth": 595.3,
  "pageHeight": 842.0,
  "extractionSummary": {
    "text": 5, "image": 2, "table": 1, "chart": 1
  },
  "elements": [ ... ]
}

Element types (elements[].type):

Common element structure:

{
  "type": "text",
  "id": "te1",
  "boundaryBox": { "left": 40, "top": 80, "right": 300, "bottom": 120 },
  "content": { "text": "Body content here" }
}

Table content:

{
  "content": {
    "html": "<table>...</table>",
    "csv": "Header1,Header2\nValue1,Value2",
    "json": [
      {
        "metrics": { "rowaddr": 0, "coladdr": 0, "rowspan": 1, "colspan": 1 },
        "para": [{ "content": [{ "text": "Cell content" }] }]
      }
    ]
  }
}

Chart content:

{
  "content": {
    "chart_type": "column",
    "title": "Annual Sales Comparison",
    "x_axis_labels": ["Q1", "Q2", "Q3", "Q4"],
    "series_names": ["2023", "2024"],
    "series_values": [[100, 200, 150, 300], [120, 220, 180, 320]],
    "csv": "Quarter,2023,2024\nQ1,100,120\nQ2,200,220"
  }
}

Usage Patterns

Extract all text:

def get_all_text(schema: dict) -> str:
    texts = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "text" and el.get("content", {}).get("text"):
                texts.append(el["content"]["text"])
    return "\n".join(texts)

Extract tables as CSV:

def get_tables_as_csv(schema: dict) -> list:
    tables = []
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            if el["type"] == "table":
                csv_data = el.get("content", {}).get("csv", "")
                if csv_data:
                    tables.append(csv_data)
    return tables

Generate RAG chunks:

def make_rag_chunks(schema: dict) -> list:
    chunks = []
    doc_name = schema.get("docName", "")
    for page in schema.get("pages", []):
        for el in page.get("elements", []):
            text = el.get("content", {}).get("text") or el.get("content", {}).get("csv") or ""
            if text.strip():
                chunks.append({
                    "source": doc_name,
                    "page": page["pageNum"],
                    "type": el["type"],
                    "text": text.strip()
                })
    return chunks

Example

User: "Extract all table data from this DOCX report as CSV."

Output:

import os
schema = extract_document("report.docx", os.environ["POLARIS_DATAINSIGHT_API_KEY"])
tables = get_tables_as_csv(schema)
for i, csv_data in enumerate(tables):
    print(f"=== Table {i+1} ===")
    print(csv_data)

=== Table 1 ===
Quarter,Revenue,Cost
Q1,1200,800
Q2,1500,900

=== Table 2 ===
Item,Amount
Labor,500
Operations,300

Inspired by: Polaris Office DataInsight API documentation and workflow.

Tips

The response is always a ZIP file. Do not try to parse response.content directly as JSON — use zipfile.ZipFile to extract it first.
content.csv is available for both table and chart elements, making it the most convenient format for data extraction.
The rate limit is 10 requests per minute. When processing multiple files, add a delay (e.g., time.sleep(6)) between calls.
Use boundaryBox to determine where each element sits on the page — useful for layout analysis.
Always store the API key in an environment variable (POLARIS_DATAINSIGHT_API_KEY) and never hardcode it.

Common Use Cases

Document search systems: Extract full text and store it in a vector database for semantic search
Automated report analysis: Collect table and chart data from PPTX/DOCX reports for analysis
HWP digitization: Convert HWP/HWPX documents into structured, machine-readable data
RAG pipeline setup: Split documents into chunks for use in LLM-based Q&A systems
Data migration: Move table and chart data from legacy Office documents into a database

License & Terms

Skill Definition: This SKILL.md file is provided under the Apache 2.0 license.
Service Access: Usage of the DataInsight API requires a valid subscription or license key.
Restrictions: Unauthorized redistribution of the API endpoints or bypassing authentication is strictly prohibited.
Support: For licensing inquiries, visit https://datainsight.polarisoffice.com.

Related Skills

eric861129/what-if-oracle

development

VerifiedTrustedCommunity

Run structured What-If scenario analysis with multi-branch possibility exploration. Use this skill when the user asks speculative questions like "what if...", "what would happen if...", "what are the possibilities", "explore scenarios", "scenario analysis", "possibility space", "what could go wrong", "best case / worst case", "risk analysis", "contingency planning", "strategic options", or any question about uncertain futures. Also trigger when the user faces a fork-in-the-road decision, wants to stress-test an idea, or needs to think through consequences before committing.

38SKILL.mdUpdated Apr 4, 2026

eric861129/what-if-oracle

eric861129/venue-templates

development

VerifiedTrustedCommunity

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.

38SKILL.mdUpdated Apr 4, 2026

eric861129/venue-templates

eric861129/the-fool

development

VerifiedTrustedCommunity

Use when challenging ideas, plans, decisions, or proposals using structured critical reasoning. Invoke to play devil's advocate, run a pre-mortem, red team, or audit evidence and assumptions.

38SKILL.mdUpdated Apr 4, 2026

eric861129/scientific-writing

tools

VerifiedTrustedCommunity

Core skill for the deep research and writing tool. Write scientific manuscripts in full paragraphs (never bullet points). Use two-stage process with (1) section outlines with key points using research-lookup then (2) convert to flowing prose. IMRAD structure, citations (APA/AMA/Vancouver), figures/tables, reporting guidelines (CONSORT/STROBE/PRISMA), for research papers and journal submissions.

38SKILL.mdUpdated Apr 4, 2026

eric861129/scientific-writing

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/eric861129/skills_all-in-one.git

# Copy into Claude Code skills folder (global)
cp -r skills_all-in-one/public/SKILLS/Document Skills/polaris-datainsight-doc-extract ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

eric861129/skills_all-in-one

38 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT