Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

eliferjunior/data-extractor

Name: data-extractor
Author: eliferjunior

.claude/skills/ts-data-extractor/SKILL.md

npx skillsauth add eliferjunior/Claude data-extractor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Extractor

Overview

Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semi-structured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and free-text documents.

Instructions

When a user asks you to extract data from a document, follow this process:

Step 1: Identify the document format and install dependencies

# Determine file type
file document.pdf

# Install dependencies based on format
pip install pdfplumber python-docx beautifulsoup4 lxml openpyxl

Library selection by format:

PDF: pdfplumber (text + tables), PyMuPDF (fitz) for complex layouts
DOCX: python-docx
HTML: beautifulsoup4 with lxml
Excel: openpyxl or pandas
Images: pytesseract (OCR) with Pillow
JSON/XML: Python standard library

Step 2: Extract raw content

PDF extraction:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        print(f"--- Page {i+1} ---")
        print(text)

        # Extract tables if present
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

DOCX extraction:

from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(f"[{para.style.name}] {para.text}")

# Extract tables
for table in doc.tables:
    for row in table.rows:
        print([cell.text for cell in row.cells])

HTML extraction:

from bs4 import BeautifulSoup

with open("document.html") as f:
    soup = BeautifulSoup(f, "lxml")

# Extract specific elements
for table in soup.find_all("table"):
    rows = table.find_all("tr")
    for row in rows:
        cells = [td.get_text(strip=True) for td in row.find_all(["td", "th"])]
        print(cells)

Step 3: Parse and structure the data

Once you have raw text, extract the target fields:

Pattern-based extraction:

import re
import json

text = "..."  # extracted text

# Define patterns for common fields
patterns = {
    "invoice_number": r"Invoice\s*#?\s*:?\s*(\w+[-/]?\w+)",
    "date": r"Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})",
    "total": r"Total\s*:?\s*\$?([\d,]+\.?\d*)",
    "email": r"[\w.-]+@[\w.-]+\.\w+",
}

extracted = {}
for field, pattern in patterns.items():
    match = re.search(pattern, text, re.IGNORECASE)
    if match:
        extracted[field] = match.group(1) if match.lastindex else match.group(0)

print(json.dumps(extracted, indent=2))

Line-item extraction from tables:

import pandas as pd

# From a list of table rows
headers = table_data[0]
rows = table_data[1:]
df = pd.DataFrame(rows, columns=headers)

# Clean up
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.dropna(how="all")

Step 4: Validate and clean the output

# Type conversion
extracted["total"] = float(extracted["total"].replace(",", ""))

# Date normalization
from datetime import datetime
extracted["date"] = datetime.strptime(extracted["date"], "%m/%d/%Y").isoformat()

# Validate required fields
required = ["invoice_number", "date", "total"]
missing = [f for f in required if f not in extracted]
if missing:
    print(f"Warning: missing fields: {missing}")

Step 5: Output in the desired format

# JSON output
with open("extracted_data.json", "w") as f:
    json.dump(extracted, f, indent=2)

# CSV output
df.to_csv("extracted_items.csv", index=False)

# Pretty print summary
print(f"Extracted {len(extracted)} fields from document")
print(f"Line items: {len(df)} rows")

Examples

Example 1: Extract invoice data from a PDF

User request: "Extract the invoice details from this PDF"

Actions:

Open the PDF with pdfplumber and extract text
Use regex patterns to find invoice number, date, vendor, subtotal, tax, total
Extract the line items table into a DataFrame
Output a JSON file with header fields and a CSV with line items

Output:

{
  "invoice_number": "INV-2025-0042",
  "date": "2025-03-15",
  "vendor": "Acme Corp",
  "subtotal": 1250.00,
  "tax": 100.00,
  "total": 1350.00,
  "line_items": [
    {"description": "Widget A", "qty": 10, "unit_price": 75.00, "amount": 750.00},
    {"description": "Widget B", "qty": 5, "unit_price": 100.00, "amount": 500.00}
  ]
}

Example 2: Extract contacts from a DOCX directory

User request: "Pull all names and email addresses from this company directory document"

Actions:

Parse the DOCX file, iterate through paragraphs and tables
Use regex to find email addresses and associated names
Deduplicate and output as CSV

Output: A CSV file with columns: name, email, department, phone.

Example 3: Convert an HTML report to structured data

User request: "Extract the quarterly results table from this HTML page"

Actions:

Parse the HTML with BeautifulSoup
Find the target table by heading or class
Extract headers and rows into a DataFrame
Clean column names and convert numeric values
Export as CSV and provide summary statistics

Output: A clean CSV with quarterly metrics and a summary of key figures.

Guidelines

Always inspect the raw extracted text before writing parsers. Understanding the layout saves time.
Use pdfplumber for most PDF extraction. Fall back to PyMuPDF for complex multi-column layouts.
For scanned PDFs (image-based), use OCR with pytesseract before parsing.
Validate extracted data types: convert strings to numbers, normalize dates.
Report extraction confidence: note any fields that could not be found or seem incorrect.
Handle multi-page documents by accumulating results across pages.
For batch extraction (many documents of the same type), build a reusable extraction function and apply it across all files.
Always preserve the original document alongside extracted data for verification.
When patterns fail, fall back to positional extraction based on text layout.

eliferjunior/data-extractor

.claude/skills/ts-data-extractor/SKILL.md

Extract structured data from any document format using unified document processing. Use when a user asks to extract data from a document, parse a PDF, pull structured data from files, convert documents to JSON or CSV, extract fields from invoices or forms, or scrape data from documents.

development

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add eliferjunior/Claude data-extractor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:59 PM2.0s1 file scanned

SKILL.md

name:: data-extractor
description:: >-
license:: Apache-2.0
compatibility:: Requires Python 3.8+ with appropriate parsing libraries
author:: terminal-skills
version:: 1.0.0
category:: data-ai
tags:: ["data-extraction", "document-parsing", "pdf", "structured-data", "etl"]
agents:: [claude-code, openai-codex, gemini-cli, cursor]

Data Extractor

Overview

Instructions

When a user asks you to extract data from a document, follow this process:

Step 1: Identify the document format and install dependencies

# Determine file type
file document.pdf

# Install dependencies based on format
pip install pdfplumber python-docx beautifulsoup4 lxml openpyxl

Library selection by format:

PDF: pdfplumber (text + tables), PyMuPDF (fitz) for complex layouts
DOCX: python-docx
HTML: beautifulsoup4 with lxml
Excel: openpyxl or pandas
Images: pytesseract (OCR) with Pillow
JSON/XML: Python standard library

Step 2: Extract raw content

PDF extraction:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        print(f"--- Page {i+1} ---")
        print(text)

        # Extract tables if present
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

DOCX extraction:

from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(f"[{para.style.name}] {para.text}")

# Extract tables
for table in doc.tables:
    for row in table.rows:
        print([cell.text for cell in row.cells])

HTML extraction:

from bs4 import BeautifulSoup

with open("document.html") as f:
    soup = BeautifulSoup(f, "lxml")

# Extract specific elements
for table in soup.find_all("table"):
    rows = table.find_all("tr")
    for row in rows:
        cells = [td.get_text(strip=True) for td in row.find_all(["td", "th"])]
        print(cells)

Step 3: Parse and structure the data

Once you have raw text, extract the target fields:

Pattern-based extraction:

import re
import json

text = "..."  # extracted text

# Define patterns for common fields
patterns = {
    "invoice_number": r"Invoice\s*#?\s*:?\s*(\w+[-/]?\w+)",
    "date": r"Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})",
    "total": r"Total\s*:?\s*\$?([\d,]+\.?\d*)",
    "email": r"[\w.-]+@[\w.-]+\.\w+",
}

extracted = {}
for field, pattern in patterns.items():
    match = re.search(pattern, text, re.IGNORECASE)
    if match:
        extracted[field] = match.group(1) if match.lastindex else match.group(0)

print(json.dumps(extracted, indent=2))

Line-item extraction from tables:

import pandas as pd

# From a list of table rows
headers = table_data[0]
rows = table_data[1:]
df = pd.DataFrame(rows, columns=headers)

# Clean up
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.dropna(how="all")

Step 4: Validate and clean the output

# Type conversion
extracted["total"] = float(extracted["total"].replace(",", ""))

# Date normalization
from datetime import datetime
extracted["date"] = datetime.strptime(extracted["date"], "%m/%d/%Y").isoformat()

# Validate required fields
required = ["invoice_number", "date", "total"]
missing = [f for f in required if f not in extracted]
if missing:
    print(f"Warning: missing fields: {missing}")

Step 5: Output in the desired format

# JSON output
with open("extracted_data.json", "w") as f:
    json.dump(extracted, f, indent=2)

# CSV output
df.to_csv("extracted_items.csv", index=False)

# Pretty print summary
print(f"Extracted {len(extracted)} fields from document")
print(f"Line items: {len(df)} rows")

Examples

Example 1: Extract invoice data from a PDF

User request: "Extract the invoice details from this PDF"

Actions:

Open the PDF with pdfplumber and extract text
Use regex patterns to find invoice number, date, vendor, subtotal, tax, total
Extract the line items table into a DataFrame
Output a JSON file with header fields and a CSV with line items

Output:

{
  "invoice_number": "INV-2025-0042",
  "date": "2025-03-15",
  "vendor": "Acme Corp",
  "subtotal": 1250.00,
  "tax": 100.00,
  "total": 1350.00,
  "line_items": [
    {"description": "Widget A", "qty": 10, "unit_price": 75.00, "amount": 750.00},
    {"description": "Widget B", "qty": 5, "unit_price": 100.00, "amount": 500.00}
  ]
}

Example 2: Extract contacts from a DOCX directory

User request: "Pull all names and email addresses from this company directory document"

Actions:

Parse the DOCX file, iterate through paragraphs and tables
Use regex to find email addresses and associated names
Deduplicate and output as CSV

Output: A CSV file with columns: name, email, department, phone.

Example 3: Convert an HTML report to structured data

User request: "Extract the quarterly results table from this HTML page"

Actions:

Parse the HTML with BeautifulSoup
Find the target table by heading or class
Extract headers and rows into a DataFrame
Clean column names and convert numeric values
Export as CSV and provide summary statistics

Output: A clean CSV with quarterly metrics and a summary of key figures.

Guidelines

Always inspect the raw extracted text before writing parsers. Understanding the layout saves time.
Use pdfplumber for most PDF extraction. Fall back to PyMuPDF for complex multi-column layouts.
For scanned PDFs (image-based), use OCR with pytesseract before parsing.
Validate extracted data types: convert strings to numbers, normalize dates.
Report extraction confidence: note any fields that could not be found or seem incorrect.
Handle multi-page documents by accumulating results across pages.
For batch extraction (many documents of the same type), build a reusable extraction function and apply it across all files.
Always preserve the original document alongside extracted data for verification.
When patterns fail, fall back to positional extraction based on text layout.

Related Skills

eliferjunior/fireworks-ai

development

VerifiedTrustedCommunity

Expert guidance for Fireworks AI, the platform for running open-source LLMs (Llama, Mixtral, Qwen, etc.) with enterprise-grade speed and reliability. Helps developers integrate Fireworks' inference API, fine-tune models, and deploy custom model endpoints with function calling and structured output support.

SKILL.mdUpdated Apr 17, 2026

eliferjunior/fireworks-ai

eliferjunior/firecrawl

development

VerifiedTrustedCommunity

Convert any website into clean, structured data with Firecrawl — API-first web scraping service. Use when someone asks to "turn a website into markdown", "scrape website for LLM", "Firecrawl", "extract website content as clean text", "crawl and convert to structured data", or "scrape website for RAG". Covers single-page scraping, full-site crawling, structured extraction, and LLM-ready output.

SKILL.mdUpdated Apr 16, 2026

eliferjunior/firecrawl

eliferjunior/firebase

tools

VerifiedTrustedCommunity

Expert guidance for Firebase, Google's platform for building and scaling web and mobile applications. Helps developers set up authentication, Firestore/Realtime Database, Cloud Functions, hosting, storage, and analytics using Firebase's SDK and CLI.

SKILL.mdUpdated Apr 16, 2026

eliferjunior/firebase

eliferjunior/file-upload-processor

development

VerifiedTrustedCommunity

When the user needs to build file upload functionality for a web application. Use when the user mentions "file upload," "image upload," "upload endpoint," "multipart upload," "presigned URL," "S3 upload," "file validation," "upload to cloud storage," or "accept user files." Handles upload endpoints, file validation (type, size, magic bytes), cloud storage integration, and upload status tracking. For image/video processing after upload, see media-transcoder.

SKILL.mdUpdated Apr 16, 2026

eliferjunior/file-upload-processor

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/eliferjunior/Claude.git

# Copy into Claude Code skills folder (global)
cp -r Claude/.claude/skills/ts-data-extractor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

eliferjunior/Claude

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT