skills/abejitsu/pdf-page-extract/SKILL.md
Extract rich data from PDF pages including text spans with metadata, rendered PNG images, and page mapping. Creates persistent artifacts for downstream processing.
npx skillsauth add aiskillstore/marketplace pdf-page-extractInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill extracts all necessary data from PDF pages to enable accurate AI-driven HTML generation. It produces three critical artifacts:
This is the deterministic, Python-based foundation for the entire pipeline. All extracted data is saved to persistent files for traceability and future processing.
Validate input parameters
Establish page mapping (if not already done)
python3 Calypso/tools/read_page_footers.pyanalysis/page_mapping.jsonExtract rich page data using PyMuPDF and pdfplumber
python3 Calypso/tools/rich_extractor.pyanalysis/chapter_XX/rich_extraction.jsonRender PDF page to PNG
output/chapter_XX/page_artifacts/page_YY/02_page_XX.pngExtract embedded images (if present)
python3 Calypso/tools/extract_images.pyoutput/chapter_XX/images/page_YY_image_*.pngpage_YY_images.jsonValidate extraction completeness
chapter: <int> - Chapter number (1-8)
start_page: <int> - Starting PDF index (0-based) or page range
end_page: <int> - Ending PDF index (optional if single page)
pdf_path: <str> - Path to PDF file (default: Calypso/PREP-AL 4th Ed 9-26-25.pdf)
output_base: <str> - Output directory (default: Calypso/output)
mapping_file: <str> - Page mapping file (default: Calypso/analysis/page_mapping.json)
Per-page artifacts (in output/chapter_XX/page_artifacts/page_YY/):
01_rich_extraction.json - Text spans with metadata02_page_XX.png - Rendered PDF page imagepage_mapping.json - Shared mapping file (symlink or copy)Extraction data (in analysis/chapter_XX/):
rich_extraction.json - Full extraction for all pages in chapterpage_6_pattern_analysis.json - (Optional) Pattern analysis for specific pagesImages (in output/chapter_XX/images/chapter_XX/):
page_XX_image_*.png - Embedded images from pagepage_XX_images.json - Metadata for embedded images{
"page_number": 16,
"pdf_index": 15,
"book_page": 17,
"chapter": 2,
"dimensions": {
"width": 612,
"height": 792
},
"text_spans": [
{
"text": "Rights in Real Estate",
"font": "Arial-BoldMT",
"size": 27.04,
"bold": true,
"italic": false,
"bbox": {
"x0": 72,
"y0": 150,
"x1": 400,
"y1": 177
},
"color": 0,
"sequence": 1
}
],
"analysis": {
"font_sizes": {
"27.04": 1,
"11.04": 45
},
"font_styles": {
"bold_27.04": 1,
"regular_11.04": 45
},
"likely_headings": [
{
"text": "Rights in Real Estate",
"level": 1,
"confidence": 0.95
}
],
"likely_paragraphs": [
{
"text": "Real property consists of...",
"type": "body_text"
}
]
},
"extraction_timestamp": "2025-11-08T14:30:00Z",
"extraction_tool": "rich_extractor.py v1.0"
}
cd Calypso/tools
python3 read_page_footers.py \
--start 15 \
--end 28 \
--pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
--output "../analysis/page_mapping.json"
Success indicators:
cd Calypso/tools
python3 rich_extractor.py \
--pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
--start 15 \
--end 28 \
--output "../analysis/chapter_02/rich_extraction.json"
Success indicators:
cd Calypso/tools
python3 -c "
import fitz
pdf = fitz.open('../PREP-AL 4th Ed 9-26-25.pdf')
for page_idx in range(15, 29):
page = pdf[page_idx]
pix = page.get_pixmap(matrix=fitz.Matrix(3, 3)) # 300% zoom for high-res
pix.save(f'../output/chapter_02/page_artifacts/page_{page_idx:02d}/02_page_{page_idx}.png')
pdf.close()
"
cd Calypso/tools
# For each page with images
python3 extract_images.py \
--page 17 \
--pdf "../PREP-AL 4th Ed 9-26-25.pdf" \
--output "../output" \
--mapping "../analysis/page_mapping.json"
Before declaring extraction complete:
File existence
01_rich_extraction.json exists02_page_XX.png exists and is validpage_mapping.json existsJSON validity
Data completeness
Image quality
If PDF file not found:
If page mapping fails:
If rich extraction produces no text:
"page_type": "image_only"If PNG rendering fails:
All artifacts include metadata:
This enables:
✓ All required files created in correct directories ✓ Rich extraction JSON is valid and complete ✓ PNG image renders correctly ✓ Page mapping is accurate ✓ All data persisted and ready for next skill ✓ No extraction errors or warnings
Once extraction completes successfully:
PDF won't open: Verify file path, ensure PDF is not corrupted No text extracted: Page may be image-only (OCR needed) Wrong page numbers: Check page_mapping.json for accuracy PNG images are blank: Try increasing zoom factor (3x = 300 DPI)
development
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.