Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aiskillstore/pdf-page-extract

Name: pdf-page-extract
Author: aiskillstore

skills/abejitsu/pdf-page-extract/SKILL.md

npx skillsauth add aiskillstore/marketplace pdf-page-extract

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

PDF Page Extract Skill

Purpose

This skill extracts all necessary data from PDF pages to enable accurate AI-driven HTML generation. It produces three critical artifacts:

Rich extraction data - Text spans with font metadata (sizes, styles, positions)
Rendered PNG image - Visual reference for AI to understand page layout
Page mapping - Authoritative mapping of PDF indices to book pages

This is the deterministic, Python-based foundation for the entire pipeline. All extracted data is saved to persistent files for traceability and future processing.

What to Do

Validate input parameters
- Check PDF file exists and is readable
- Verify page range (PDF indices or book pages)
- Confirm output directory structure
Establish page mapping (if not already done)
- Run: python3 Calypso/tools/read_page_footers.py
- Scans page footers to establish PDF index → book page mapping
- Saves to: analysis/page_mapping.json
Extract rich page data using PyMuPDF and pdfplumber
- Run: python3 Calypso/tools/rich_extractor.py
- Extracts text spans with font metadata:
  - Font name and size
  - Bold/italic flags
  - Position (bounding box)
  - Color information
- Analyzes page structure to identify:
  - Likely headings (by size and style)
  - Paragraphs (regular text)
  - Potential lists
- Detects tables using pdfplumber
- Saves to: analysis/chapter_XX/rich_extraction.json
Render PDF page to PNG
- Convert page to high-resolution PNG image (300+ DPI)
- Maintains visual fidelity for AI reference
- Saves to: output/chapter_XX/page_artifacts/page_YY/02_page_XX.png
Extract embedded images (if present)
- Run: python3 Calypso/tools/extract_images.py
- Extracts all images from page
- Saves: output/chapter_XX/images/page_YY_image_*.png
- Creates metadata: page_YY_images.json
Validate extraction completeness
- Verify all files saved correctly
- Check JSON files are valid
- Confirm PNG image is readable
- Validate page mapping consistency

Input Parameters

chapter: <int>           - Chapter number (1-8)
start_page: <int>        - Starting PDF index (0-based) or page range
end_page: <int>          - Ending PDF index (optional if single page)
pdf_path: <str>          - Path to PDF file (default: Calypso/PREP-AL 4th Ed 9-26-25.pdf)
output_base: <str>       - Output directory (default: Calypso/output)
mapping_file: <str>      - Page mapping file (default: Calypso/analysis/page_mapping.json)

Output Structure

Artifact Files Saved

Per-page artifacts (in output/chapter_XX/page_artifacts/page_YY/):

01_rich_extraction.json - Text spans with metadata
02_page_XX.png - Rendered PDF page image
page_mapping.json - Shared mapping file (symlink or copy)

Extraction data (in analysis/chapter_XX/):

rich_extraction.json - Full extraction for all pages in chapter
page_6_pattern_analysis.json - (Optional) Pattern analysis for specific pages

Images (in output/chapter_XX/images/chapter_XX/):

page_XX_image_*.png - Embedded images from page
page_XX_images.json - Metadata for embedded images

Rich Extraction JSON Format

{
  "page_number": 16,
  "pdf_index": 15,
  "book_page": 17,
  "chapter": 2,
  "dimensions": {
    "width": 612,
    "height": 792
  },
  "text_spans": [
    {
      "text": "Rights in Real Estate",
      "font": "Arial-BoldMT",
      "size": 27.04,
      "bold": true,
      "italic": false,
      "bbox": {
        "x0": 72,
        "y0": 150,
        "x1": 400,
        "y1": 177
      },
      "color": 0,
      "sequence": 1
    }
  ],
  "analysis": {
    "font_sizes": {
      "27.04": 1,
      "11.04": 45
    },
    "font_styles": {
      "bold_27.04": 1,
      "regular_11.04": 45
    },
    "likely_headings": [
      {
        "text": "Rights in Real Estate",
        "level": 1,
        "confidence": 0.95
      }
    ],
    "likely_paragraphs": [
      {
        "text": "Real property consists of...",
        "type": "body_text"
      }
    ]
  },
  "extraction_timestamp": "2025-11-08T14:30:00Z",
  "extraction_tool": "rich_extractor.py v1.0"
}

Python Commands to Execute

Step 1: Establish Page Mapping

cd Calypso/tools
python3 read_page_footers.py \
  --start 15 \
  --end 28 \
  --pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
  --output "../analysis/page_mapping.json"

Success indicators:

Command exits with code 0
Page mapping JSON created/updated
All pages in range have entries

Step 2: Extract Rich Data

cd Calypso/tools
python3 rich_extractor.py \
  --pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
  --start 15 \
  --end 28 \
  --output "../analysis/chapter_02/rich_extraction.json"

Success indicators:

Command exits with code 0
JSON file created
File contains text_spans array
All pages in range represented

Step 3: Render to PNG

cd Calypso/tools
python3 -c "
import fitz
pdf = fitz.open('../PREP-AL 4th Ed 9-26-25.pdf')
for page_idx in range(15, 29):
    page = pdf[page_idx]
    pix = page.get_pixmap(matrix=fitz.Matrix(3, 3))  # 300% zoom for high-res
    pix.save(f'../output/chapter_02/page_artifacts/page_{page_idx:02d}/02_page_{page_idx}.png')
pdf.close()
"

Step 4: Extract Images (if present)

cd Calypso/tools
# For each page with images
python3 extract_images.py \
  --page 17 \
  --pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
  --output "../output" \
  --mapping "../analysis/page_mapping.json"

Quality Checks

Before declaring extraction complete:

File existence
- [ ] 01_rich_extraction.json exists
- [ ] 02_page_XX.png exists and is valid
- [ ] page_mapping.json exists
JSON validity
- [ ] JSON files parse without errors
- [ ] All required fields present
- [ ] No null/undefined values in critical fields
Data completeness
- [ ] All pages in range have text_spans
- [ ] Text content is not empty
- [ ] Font sizes are reasonable (> 0)
- [ ] Bounding boxes are within page dimensions
Image quality
- [ ] PNG files are readable
- [ ] Image dimensions match PDF page size
- [ ] No corrupted or blank images

Error Handling

If PDF file not found:

Exit with error message
Do not create partial artifacts

If page mapping fails:

Fall back to default indexing (PDF index = book page - 1)
Log warning
Continue extraction

If rich extraction produces no text:

Check if page is image-only
Mark in metadata: "page_type": "image_only"
Continue (ASCII preview will handle image OCR)

If PNG rendering fails:

Use fallback: save raw PDF page as PDF image
Log warning
Continue to next step

Persistence & Traceability

All artifacts include metadata:

Extraction timestamp
Tool version
Input parameters
Processing status

This enables:

Reproducibility (re-extract with same parameters)
Debugging (trace what data was extracted)
Auditing (track all changes to artifacts)
Caching (skip re-extraction if unchanged)

Success Criteria

✓ All required files created in correct directories ✓ Rich extraction JSON is valid and complete ✓ PNG image renders correctly ✓ Page mapping is accurate ✓ All data persisted and ready for next skill ✓ No extraction errors or warnings

Next Steps

Once extraction completes successfully:

Skill 2 will create ASCII preview from extracted data
Skill 3 will use extraction + PNG + ASCII for HTML generation
All artifacts available for validation and debugging

Troubleshooting

PDF won't open: Verify file path, ensure PDF is not corrupted No text extracted: Page may be image-only (OCR needed) Wrong page numbers: Check page_mapping.json for accuracy PNG images are blank: Try increasing zoom factor (3x = 300 DPI)

Implementation Notes

This skill is fully deterministic - same inputs always produce same outputs
Python tools ensure data quality and consistency
All files saved to persistent storage for audit trail
No AI involved at this stage - pure data extraction
Ready to support later AI-based HTML generation with complete context

aiskillstore/pdf-page-extract

skills/abejitsu/pdf-page-extract/SKILL.md

Extract rich data from PDF pages including text spans with metadata, rendered PNG images, and page mapping. Creates persistent artifacts for downstream processing.

230 stars

data-ai

Updated Mar 27, 2026

$ install --global

skillsauth

npx skillsauth add aiskillstore/marketplace pdf-page-extract

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 28, 2026, 4:27 PM36.1s2 files scanned

SKILL.md

name:: pdf-page-extract
description:: Extract rich data from PDF pages including text spans with metadata, rendered PNG images, and page mapping. Creates persistent artifacts for downstream processing.

PDF Page Extract Skill

Purpose

This skill extracts all necessary data from PDF pages to enable accurate AI-driven HTML generation. It produces three critical artifacts:

Rich extraction data - Text spans with font metadata (sizes, styles, positions)
Rendered PNG image - Visual reference for AI to understand page layout
Page mapping - Authoritative mapping of PDF indices to book pages

This is the deterministic, Python-based foundation for the entire pipeline. All extracted data is saved to persistent files for traceability and future processing.

What to Do

Validate input parameters
- Check PDF file exists and is readable
- Verify page range (PDF indices or book pages)
- Confirm output directory structure
Establish page mapping (if not already done)
- Run: python3 Calypso/tools/read_page_footers.py
- Scans page footers to establish PDF index → book page mapping
- Saves to: analysis/page_mapping.json
Extract rich page data using PyMuPDF and pdfplumber
- Run: python3 Calypso/tools/rich_extractor.py
- Extracts text spans with font metadata:
  - Font name and size
  - Bold/italic flags
  - Position (bounding box)
  - Color information
- Analyzes page structure to identify:
  - Likely headings (by size and style)
  - Paragraphs (regular text)
  - Potential lists
- Detects tables using pdfplumber
- Saves to: analysis/chapter_XX/rich_extraction.json
Render PDF page to PNG
- Convert page to high-resolution PNG image (300+ DPI)
- Maintains visual fidelity for AI reference
- Saves to: output/chapter_XX/page_artifacts/page_YY/02_page_XX.png
Extract embedded images (if present)
- Run: python3 Calypso/tools/extract_images.py
- Extracts all images from page
- Saves: output/chapter_XX/images/page_YY_image_*.png
- Creates metadata: page_YY_images.json
Validate extraction completeness
- Verify all files saved correctly
- Check JSON files are valid
- Confirm PNG image is readable
- Validate page mapping consistency

Input Parameters

chapter: <int>           - Chapter number (1-8)
start_page: <int>        - Starting PDF index (0-based) or page range
end_page: <int>          - Ending PDF index (optional if single page)
pdf_path: <str>          - Path to PDF file (default: Calypso/PREP-AL 4th Ed 9-26-25.pdf)
output_base: <str>       - Output directory (default: Calypso/output)
mapping_file: <str>      - Page mapping file (default: Calypso/analysis/page_mapping.json)

Output Structure

Artifact Files Saved

Per-page artifacts (in output/chapter_XX/page_artifacts/page_YY/):

01_rich_extraction.json - Text spans with metadata
02_page_XX.png - Rendered PDF page image
page_mapping.json - Shared mapping file (symlink or copy)

Extraction data (in analysis/chapter_XX/):

rich_extraction.json - Full extraction for all pages in chapter
page_6_pattern_analysis.json - (Optional) Pattern analysis for specific pages

Images (in output/chapter_XX/images/chapter_XX/):

page_XX_image_*.png - Embedded images from page
page_XX_images.json - Metadata for embedded images

Rich Extraction JSON Format

{
  "page_number": 16,
  "pdf_index": 15,
  "book_page": 17,
  "chapter": 2,
  "dimensions": {
    "width": 612,
    "height": 792
  },
  "text_spans": [
    {
      "text": "Rights in Real Estate",
      "font": "Arial-BoldMT",
      "size": 27.04,
      "bold": true,
      "italic": false,
      "bbox": {
        "x0": 72,
        "y0": 150,
        "x1": 400,
        "y1": 177
      },
      "color": 0,
      "sequence": 1
    }
  ],
  "analysis": {
    "font_sizes": {
      "27.04": 1,
      "11.04": 45
    },
    "font_styles": {
      "bold_27.04": 1,
      "regular_11.04": 45
    },
    "likely_headings": [
      {
        "text": "Rights in Real Estate",
        "level": 1,
        "confidence": 0.95
      }
    ],
    "likely_paragraphs": [
      {
        "text": "Real property consists of...",
        "type": "body_text"
      }
    ]
  },
  "extraction_timestamp": "2025-11-08T14:30:00Z",
  "extraction_tool": "rich_extractor.py v1.0"
}

Python Commands to Execute

Step 1: Establish Page Mapping

cd Calypso/tools
python3 read_page_footers.py \
  --start 15 \
  --end 28 \
  --pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
  --output "../analysis/page_mapping.json"

Success indicators:

Command exits with code 0
Page mapping JSON created/updated
All pages in range have entries

Step 2: Extract Rich Data

cd Calypso/tools
python3 rich_extractor.py \
  --pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
  --start 15 \
  --end 28 \
  --output "../analysis/chapter_02/rich_extraction.json"

Success indicators:

Command exits with code 0
JSON file created
File contains text_spans array
All pages in range represented

Step 3: Render to PNG

cd Calypso/tools
python3 -c "
import fitz
pdf = fitz.open('../PREP-AL 4th Ed 9-26-25.pdf')
for page_idx in range(15, 29):
    page = pdf[page_idx]
    pix = page.get_pixmap(matrix=fitz.Matrix(3, 3))  # 300% zoom for high-res
    pix.save(f'../output/chapter_02/page_artifacts/page_{page_idx:02d}/02_page_{page_idx}.png')
pdf.close()
"

Step 4: Extract Images (if present)

cd Calypso/tools
# For each page with images
python3 extract_images.py \
  --page 17 \
  --pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
  --output "../output" \
  --mapping "../analysis/page_mapping.json"

Quality Checks

Before declaring extraction complete:

File existence
- [ ] 01_rich_extraction.json exists
- [ ] 02_page_XX.png exists and is valid
- [ ] page_mapping.json exists
JSON validity
- [ ] JSON files parse without errors
- [ ] All required fields present
- [ ] No null/undefined values in critical fields
Data completeness
- [ ] All pages in range have text_spans
- [ ] Text content is not empty
- [ ] Font sizes are reasonable (> 0)
- [ ] Bounding boxes are within page dimensions
Image quality
- [ ] PNG files are readable
- [ ] Image dimensions match PDF page size
- [ ] No corrupted or blank images

Error Handling

If PDF file not found:

Exit with error message
Do not create partial artifacts

If page mapping fails:

Fall back to default indexing (PDF index = book page - 1)
Log warning
Continue extraction

If rich extraction produces no text:

Check if page is image-only
Mark in metadata: "page_type": "image_only"
Continue (ASCII preview will handle image OCR)

If PNG rendering fails:

Use fallback: save raw PDF page as PDF image
Log warning
Continue to next step

Persistence & Traceability

All artifacts include metadata:

Extraction timestamp
Tool version
Input parameters
Processing status

This enables:

Reproducibility (re-extract with same parameters)
Debugging (trace what data was extracted)
Auditing (track all changes to artifacts)
Caching (skip re-extraction if unchanged)

Success Criteria

Next Steps

Once extraction completes successfully:

Skill 2 will create ASCII preview from extracted data
Skill 3 will use extraction + PNG + ASCII for HTML generation
All artifacts available for validation and debugging

Troubleshooting

Implementation Notes

This skill is fully deterministic - same inputs always produce same outputs
Python tools ensure data quality and consistency
All files saved to persistent storage for audit trail
No AI involved at this stage - pure data extraction
Ready to support later AI-based HTML generation with complete context

Related Skills

aiskillstore/hig-components-content

development

VerifiedTrustedCommunity

Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/hig-components-content

aiskillstore/helpdesk-automation

tools

VerifiedTrustedCommunity

Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/helpdesk-automation

aiskillstore/haskell-pro

testing

VerifiedTrustedCommunity

Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/haskell-pro

aiskillstore/graphql

tools

VerifiedTrustedCommunity

GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.

244SKILL.mdUpdated Apr 10, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aiskillstore/marketplace.git

# Copy into Claude Code skills folder (global)
cp -r marketplace/skills/abejitsu/pdf-page-extract ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aiskillstore/marketplace

230 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT