Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

pr-e/pdf-text-extractor

Name: pdf-text-extractor
Author: pr-e

skills/pdf-text-extractor/SKILL.md

npx skillsauth add pr-e/openclaw-master-skills pdf-text-extractor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

PDF-Text-Extractor - Extract Text from PDFs

Vernox Utility Skill - Perfect for document digitization.

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

Features

✅ Text Extraction

Extract text from PDFs without external tools
Support for both text-based and scanned PDFs
Preserve document structure and formatting
Fast extraction (milliseconds for text-based)

✅ OCR Support

Use Tesseract.js for scanned documents
Support multiple languages (English, Spanish, French, German)
Configurable OCR quality/speed
Fallback to text extraction when possible

✅ Batch Processing

Process multiple PDFs at once
Batch extraction for document workflows
Progress tracking for large files
Error handling and retry logic

✅ Output Options

Plain text output
JSON output with metadata
Markdown conversion
HTML output (preserving links)

✅ Utility Features

Page-by-page extraction
Character/word counting
Language detection
Metadata extraction (author, title, creation date)

Installation

clawhub install pdf-text-extractor

Quick Start

Extract Text from PDF

const result = await extractText({
  pdfPath: './document.pdf',
  options: {
    outputFormat: 'text',
    ocr: true,
    language: 'eng'
  }
});

console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);

Batch Extract Multiple PDFs

const results = await extractBatch({
  pdfFiles: [
    './document1.pdf',
    './document2.pdf',
    './document3.pdf'
  ],
  options: {
    outputFormat: 'json',
    ocr: true
  }
});

console.log(`Extracted ${results.length} PDFs`);

Extract with OCR

const result = await extractText({
  pdfPath: './scanned-document.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

// OCR will be used (scanned document detected)

Tool Functions

`extractText`

Extract text content from a single PDF file.

Parameters:

pdfPath (string, required): Path to PDF file
options (object, optional): Extraction options
- outputFormat (string): 'text' | 'json' | 'markdown' | 'html'
- ocr (boolean): Enable OCR for scanned docs
- language (string): OCR language code ('eng', 'spa', 'fra', 'deu')
- preserveFormatting (boolean): Keep headings/structure
- minConfidence (number): Minimum OCR confidence score (0-100)

Returns:

text (string): Extracted text content
pages (number): Number of pages processed
wordCount (number): Total word count
charCount (number): Total character count
language (string): Detected language
metadata (object): PDF metadata (title, author, creation date)
method (string): 'text' or 'ocr' (extraction method)

`extractBatch`

Extract text from multiple PDF files at once.

Parameters:

pdfFiles (array, required): Array of PDF file paths
options (object, optional): Same as extractText

Returns:

results (array): Array of extraction results
totalPages (number): Total pages across all PDFs
successCount (number): Successfully extracted
failureCount (number): Failed extractions
errors (array): Error details for failures

`countWords`

Count words in extracted text.

Parameters:

text (string, required): Text to count
options (object, optional):
- minWordLength (number): Minimum characters per word (default: 3)
- excludeNumbers (boolean): Don't count numbers as words
- countByPage (boolean): Return word count per page

Returns:

wordCount (number): Total word count
charCount (number): Total character count
pageCounts (array): Word count per page
averageWordsPerPage (number): Average words per page

`detectLanguage`

Detect the language of extracted text.

Parameters:

text (string, required): Text to analyze
minConfidence (number): Minimum confidence for detection

Returns:

language (string): Detected language code
languageName (string): Full language name
confidence (number): Confidence score (0-100)

Use Cases

Document Digitization

Convert paper documents to digital text
Process invoices and receipts
Digitize contracts and agreements
Archive physical documents

Content Analysis

Extract text for analysis tools
Prepare content for LLM processing
Clean up scanned documents
Parse PDF-based reports

Data Extraction

Extract data from PDF reports
Parse tables from PDFs
Pull structured data
Automate document workflows

Text Processing

Prepare content for translation
Clean up OCR output
Extract specific sections
Search within PDF content

Performance

Text-Based PDFs

Speed: ~100ms for 10-page PDF
Accuracy: 100% (exact text)
Memory: ~10MB for typical document

OCR Processing

Speed: ~1-3s per page (high quality)
Accuracy: 85-95% (depends on scan quality)
Memory: ~50-100MB peak during OCR

Technical Details

PDF Parsing

Uses native PDF.js library
Extracts text layer directly (no OCR needed)
Preserves document structure
Handles password-protected PDFs

OCR Engine

Tesseract.js under the hood
Supports 100+ languages
Adjustable quality/speed tradeoff
Confidence scoring for accuracy

Dependencies

ZERO external dependencies
Uses Node.js built-in modules only
PDF.js included in skill
Tesseract.js bundled

Error Handling

Invalid PDF

Clear error message
Suggest fix (check file format)
Skip to next file in batch

OCR Failure

Report confidence score
Suggest rescan at higher quality
Fallback to basic extraction

Memory Issues

Stream processing for large files
Progress reporting
Graceful degradation

Configuration

Edit `config.json`:

{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium",
    "languages": ["eng", "spa", "fra", "deu"]
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true,
    "includeMetadata": true
  },
  "batch": {
    "maxConcurrent": 3,
    "timeoutSeconds": 30
  }
}

Examples

Extract from Invoice

const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."

Extract from Scanned Contract

const contract = await extractText('./scanned-contract.pdf', {
  ocr: true,
  language: 'eng',
  ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."

Batch Process Documents

const docs = await extractBatch([
  './doc1.pdf',
  './doc2.pdf',
  './doc3.pdf',
  './doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);

Troubleshooting

OCR Not Working

Check if PDF is truly scanned (not text-based)
Try different quality settings (low/medium/high)
Ensure language matches document
Check image quality of scan

Extraction Returns Empty

PDF may be image-only
OCR failed with low confidence
Try different language setting

Slow Processing

Large PDF takes longer
Reduce quality for speed
Process in smaller batches

Tips

Best Results

Use text-based PDFs when possible (faster, 100% accurate)
High-quality scans for OCR (300 DPI+)
Clean background before scanning
Use correct language setting

Performance Optimization

Batch processing for multiple files
Disable OCR for text-based PDFs
Lower OCR quality for speed when acceptable

Roadmap

[ ] PDF/A support
[ ] Advanced OCR pre-processing
[ ] Table extraction from OCR
[ ] Handwriting OCR
[ ] PDF form field extraction
[ ] Batch language detection
[ ] Confidence scoring visualization

License

MIT

Extract text from PDFs. Fast, accurate, zero dependencies. 🔮

pr-e/pdf-text-extractor

skills/pdf-text-extractor/SKILL.md

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

2 stars

testing

Updated Apr 27, 2026

$ install --global

skillsauth

npx skillsauth add pr-e/openclaw-master-skills pdf-text-extractor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 27, 2026, 8:56 AM54.5s6 files scanned

SKILL.md

name:: pdf-text-extractor
description:: Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
"version":: 1.0.0",
"author":: Vernox",
"license":: MIT",
"tags":: ["pdf", "ocr", "text", "extraction", "document", "digitization"],
"category":: tools

PDF-Text-Extractor - Extract Text from PDFs

Vernox Utility Skill - Perfect for document digitization.

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

Features

✅ Text Extraction

Extract text from PDFs without external tools
Support for both text-based and scanned PDFs
Preserve document structure and formatting
Fast extraction (milliseconds for text-based)

✅ OCR Support

Use Tesseract.js for scanned documents
Support multiple languages (English, Spanish, French, German)
Configurable OCR quality/speed
Fallback to text extraction when possible

✅ Batch Processing

Process multiple PDFs at once
Batch extraction for document workflows
Progress tracking for large files
Error handling and retry logic

✅ Output Options

Plain text output
JSON output with metadata
Markdown conversion
HTML output (preserving links)

✅ Utility Features

Page-by-page extraction
Character/word counting
Language detection
Metadata extraction (author, title, creation date)

Installation

clawhub install pdf-text-extractor

Quick Start

Extract Text from PDF

const result = await extractText({
  pdfPath: './document.pdf',
  options: {
    outputFormat: 'text',
    ocr: true,
    language: 'eng'
  }
});

console.log(result.text);
console.log(`Pages: ${result.pages}`);
console.log(`Words: ${result.wordCount}`);

Batch Extract Multiple PDFs

const results = await extractBatch({
  pdfFiles: [
    './document1.pdf',
    './document2.pdf',
    './document3.pdf'
  ],
  options: {
    outputFormat: 'json',
    ocr: true
  }
});

console.log(`Extracted ${results.length} PDFs`);

Extract with OCR

const result = await extractText({
  pdfPath: './scanned-document.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

// OCR will be used (scanned document detected)

Tool Functions

`extractText`

Extract text content from a single PDF file.

Parameters:

pdfPath (string, required): Path to PDF file
options (object, optional): Extraction options
- outputFormat (string): 'text' | 'json' | 'markdown' | 'html'
- ocr (boolean): Enable OCR for scanned docs
- language (string): OCR language code ('eng', 'spa', 'fra', 'deu')
- preserveFormatting (boolean): Keep headings/structure
- minConfidence (number): Minimum OCR confidence score (0-100)

Returns:

text (string): Extracted text content
pages (number): Number of pages processed
wordCount (number): Total word count
charCount (number): Total character count
language (string): Detected language
metadata (object): PDF metadata (title, author, creation date)
method (string): 'text' or 'ocr' (extraction method)

`extractBatch`

Extract text from multiple PDF files at once.

Parameters:

pdfFiles (array, required): Array of PDF file paths
options (object, optional): Same as extractText

Returns:

results (array): Array of extraction results
totalPages (number): Total pages across all PDFs
successCount (number): Successfully extracted
failureCount (number): Failed extractions
errors (array): Error details for failures

`countWords`

Count words in extracted text.

Parameters:

text (string, required): Text to count
options (object, optional):
- minWordLength (number): Minimum characters per word (default: 3)
- excludeNumbers (boolean): Don't count numbers as words
- countByPage (boolean): Return word count per page

Returns:

wordCount (number): Total word count
charCount (number): Total character count
pageCounts (array): Word count per page
averageWordsPerPage (number): Average words per page

`detectLanguage`

Detect the language of extracted text.

Parameters:

text (string, required): Text to analyze
minConfidence (number): Minimum confidence for detection

Returns:

language (string): Detected language code
languageName (string): Full language name
confidence (number): Confidence score (0-100)

Use Cases

Document Digitization

Convert paper documents to digital text
Process invoices and receipts
Digitize contracts and agreements
Archive physical documents

Content Analysis

Extract text for analysis tools
Prepare content for LLM processing
Clean up scanned documents
Parse PDF-based reports

Data Extraction

Extract data from PDF reports
Parse tables from PDFs
Pull structured data
Automate document workflows

Text Processing

Prepare content for translation
Clean up OCR output
Extract specific sections
Search within PDF content

Performance

Text-Based PDFs

Speed: ~100ms for 10-page PDF
Accuracy: 100% (exact text)
Memory: ~10MB for typical document

OCR Processing

Speed: ~1-3s per page (high quality)
Accuracy: 85-95% (depends on scan quality)
Memory: ~50-100MB peak during OCR

Technical Details

PDF Parsing

Uses native PDF.js library
Extracts text layer directly (no OCR needed)
Preserves document structure
Handles password-protected PDFs

OCR Engine

Tesseract.js under the hood
Supports 100+ languages
Adjustable quality/speed tradeoff
Confidence scoring for accuracy

Dependencies

ZERO external dependencies
Uses Node.js built-in modules only
PDF.js included in skill
Tesseract.js bundled

Error Handling

Invalid PDF

Clear error message
Suggest fix (check file format)
Skip to next file in batch

OCR Failure

Report confidence score
Suggest rescan at higher quality
Fallback to basic extraction

Memory Issues

Stream processing for large files
Progress reporting
Graceful degradation

Configuration

Edit `config.json`:

{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium",
    "languages": ["eng", "spa", "fra", "deu"]
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true,
    "includeMetadata": true
  },
  "batch": {
    "maxConcurrent": 3,
    "timeoutSeconds": 30
  }
}

Examples

Extract from Invoice

const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."

Extract from Scanned Contract

const contract = await extractText('./scanned-contract.pdf', {
  ocr: true,
  language: 'eng',
  ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."

Batch Process Documents

const docs = await extractBatch([
  './doc1.pdf',
  './doc2.pdf',
  './doc3.pdf',
  './doc4.pdf'
]);
console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);

Troubleshooting

OCR Not Working

Check if PDF is truly scanned (not text-based)
Try different quality settings (low/medium/high)
Ensure language matches document
Check image quality of scan

Extraction Returns Empty

PDF may be image-only
OCR failed with low confidence
Try different language setting

Slow Processing

Large PDF takes longer
Reduce quality for speed
Process in smaller batches

Tips

Best Results

Use text-based PDFs when possible (faster, 100% accurate)
High-quality scans for OCR (300 DPI+)
Clean background before scanning
Use correct language setting

Performance Optimization

Batch processing for multiple files
Disable OCR for text-based PDFs
Lower OCR quality for speed when acceptable

Roadmap

[ ] PDF/A support
[ ] Advanced OCR pre-processing
[ ] Table extraction from OCR
[ ] Handwriting OCR
[ ] PDF form field extraction
[ ] Batch language detection
[ ] Confidence scoring visualization

License

MIT

Extract text from PDFs. Fast, accurate, zero dependencies. 🔮

Related Skills

pr-e/youtube-watcher

development

VerifiedTrustedCommunity

Fetch and read transcripts from YouTube videos. Use when you need to summarize a video, answer questions about its content, or extract information from it.

2SKILL.mdUpdated May 9, 2026

pr-e/youtube-transcript

devops

VerifiedTrustedCommunity

Fetch and summarize YouTube video transcripts. Use when asked to summarize, transcribe, or extract content from YouTube videos. Handles transcript fetching via residential IP proxy to bypass YouTube's cloud IP blocks.

2SKILL.mdUpdated May 9, 2026

pr-e/youtube-transcript

pr-e/skills/youtube-auto-captions

content-media

VerifiedTrustedCommunity

# youtube-auto-captions - YouTube 自动字幕 ## 描述自动为 YouTube 视频生成字幕，支持多语言翻译、时间轴校准。提升视频可访问性和 SEO。 ## 定价 - **按次收费**: ¥9/次 - 每视频最长 60 分钟 - 支持 50+ 语言 ## 用法 ```bash # 生成字幕 /youtube-auto-captions --video <video_id> --lang zh # 翻译字幕 /youtube-auto-captions --video <video_id> --translate en,ja,ko # 批量处理 /youtube-auto-captions --playlist <playlist_id> --lang zh # 导出字幕 /youtube-auto-captions --video <video_id> --export srt ``` ## 技能目录 `~/.openclaw/workspace/skills/youtube-auto-captions/` ## 作者张 sir #

2SKILL.mdUpdated May 9, 2026

pr-e/skills/youtube-auto-captions

pr-e/youtube

development

VerifiedTrustedCommunity

YouTube Data API integration with managed OAuth. Search videos, manage playlists, access channel data, and interact with comments. Use this skill when users want to interact with YouTube. For other third party apps, use the api-gateway skill (https://clawhub.ai/byungkyu/api-gateway).

2SKILL.mdUpdated May 9, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/pr-e/openclaw-master-skills.git

# Copy into Claude Code skills folder (global)
cp -r openclaw-master-skills/skills/pdf-text-extractor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

pr-e/openclaw-master-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

pr-e/pdf-text-extractor

$ install --global

Security Scan Results

SKILL.md

PDF-Text-Extractor - Extract Text from PDFs

Overview

Features

✅ Text Extraction

✅ OCR Support

✅ Batch Processing

✅ Output Options

✅ Utility Features

Installation

Quick Start

Extract Text from PDF

Batch Extract Multiple PDFs

Extract with OCR

Tool Functions

extractText

extractBatch

countWords

detectLanguage

Use Cases

Document Digitization

Content Analysis

Data Extraction

Text Processing

Performance

Text-Based PDFs

OCR Processing

Technical Details

PDF Parsing

OCR Engine

Dependencies

Error Handling

Invalid PDF

OCR Failure

Memory Issues

Configuration

Edit config.json:

Examples

Extract from Invoice

Extract from Scanned Contract

Batch Process Documents

Troubleshooting

OCR Not Working

Extraction Returns Empty

Slow Processing

Tips

Best Results

Performance Optimization

Roadmap

License

Related Skills

pr-e/youtube-watcher

pr-e/youtube-transcript

pr-e/skills/youtube-auto-captions

pr-e/youtube

pr-e/pdf-text-extractor

$ install --global

Security Scan Results

SKILL.md

PDF-Text-Extractor - Extract Text from PDFs

Overview

Features

✅ Text Extraction

✅ OCR Support

✅ Batch Processing

✅ Output Options

✅ Utility Features

Installation

Quick Start

Extract Text from PDF

Batch Extract Multiple PDFs

Extract with OCR

Tool Functions

extractText

extractBatch

countWords

`extractText`

`extractBatch`

`countWords`

`detectLanguage`

Edit `config.json`:

`extractText`

`extractBatch`

`countWords`

`detectLanguage`

Edit `config.json`: