plugins/lobbi-document-intelligence/skills/pdf-extractor/SKILL.md
Design AI extraction configurations for structured and semi-structured insurance and mortgage PDFs. Use when configuring a document AI model to extract data fields from applications, policy documents, EOBs, pay stubs, tax returns, or bank statements.
npx skillsauth add markus41/claude plugins/lobbi-document-intelligence/skills/pdf-extractorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design the extraction configuration for AI-based data extraction from insurance and mortgage PDFs. Covers document type identification, field definition, confidence thresholds, model selection, pre-processing requirements, and structured output schema.
Before extracting fields, the system must identify what type of document is being processed. Route to the correct extraction model based on document type.
Document classification first: Run a classifier (see form-classifier skill) before the extraction model. The classifier returns a document type label that selects the correct field extraction configuration.
Document types and extraction priority:
| Document Type | Extraction Priority | Typical Use | |--------------|---------------------|------------| | ACORD 80 (personal auto application) | High | Insurance new business | | ACORD 125 (commercial applicant) | High | Commercial lines | | Policy declarations page | High | Policy verification | | Loss run | High | Underwriting | | Pay stub | High | Mortgage income verification | | W-2 | High | Mortgage income verification | | 1040 tax return | High | Mortgage income + self-employed | | 1003 Uniform Residential Loan Application | High | Mortgage processing | | Bank statement | High | Mortgage asset verification | | Explanation of Benefits (EOB) | Medium | Claims / medical | | Certificate of insurance | Medium | Proof of coverage | | MVR (Motor Vehicle Record) | Medium | Insurance underwriting | | Inspection report | Medium | Property insurance |
| Field Name | Data Type | Required | Validation Rule | Extraction Note | |-----------|-----------|----------|----------------|----------------| | employer_name | string | Required | Non-empty | Top of stub; sometimes in logo area | | employee_name | string | Required | Non-empty | Matches borrower name on 1003 | | employee_ssn_last4 | string | Optional | 4 digits | Often masked; extract if visible | | pay_period_start | date | Required | Valid date | Format: varies (MM/DD/YYYY, etc.) | | pay_period_end | date | Required | Valid date | Must be more recent than prior stub | | pay_date | date | Required | Valid date | Used to determine how current | | pay_frequency | enum | Required | Weekly/Bi-Weekly/Semi-Monthly/Monthly | Calculate annual equivalent | | gross_pay_this_period | currency | Required | Positive number | Current period gross before deductions | | ytd_gross_pay | currency | Required | Positive number | Year-to-date gross | | federal_tax_withheld | currency | Optional | Positive number | Cross-check against W-2 | | net_pay | currency | Optional | Positive number | Informational only | | hourly_rate | currency | Optional | If hourly employee | Used to calculate annual | | hours_worked | number | Optional | Positive number | If hourly employee |
Annual income calculation:
Bi-weekly: ytd_gross / (pay_period_end_week_of_year / 2) × 26
Semi-monthly: ytd_gross / (pay_period_number) × 24
Monthly: gross_pay_this_period × 12
Hourly: hourly_rate × hours_per_week × 52 (use 2-year average if variable)
| Field Name | Box | Data Type | Required | Validation | |-----------|-----|-----------|----------|-----------| | employee_name | Employee name box | string | Required | Match to application | | employer_name | Employer name box | string | Required | Match to pay stubs | | employer_ein | b | string | Required | XX-XXXXXXX format | | wages_tips_other | 1 | currency | Required | Primary income figure | | federal_income_tax | 2 | currency | Optional | Cross-check with 1040 | | social_security_wages | 3 | currency | Optional | May differ from Box 1 | | medicare_wages | 5 | currency | Optional | May differ from Box 1 | | state | 15 | string | Optional | State of employment | | state_wages | 16 | currency | Optional | State income | | tax_year | top of form | year | Required | Validate: year should be prior 2 years |
| Field Name | Line | Data Type | Required | Note | |-----------|------|-----------|----------|------| | tax_year | Top of form | year | Required | | | filing_status | Filing status checkbox | enum | Required | Single/MFJ/MFS/HOH/QW | | total_income | 9 | currency | Required | AGI before deductions | | agi | 11 | currency | Required | Adjusted Gross Income | | wages_salaries | 1a | currency | Required for W-2 employees | | | business_income_loss | Schedule C | currency | Required for self-employed | From Schedule C | | schedule_c_gross_revenue | Schedule C line 1 | currency | Self-employed | | | schedule_c_net_profit | Schedule C line 31 | currency | Self-employed | After expenses | | rental_income | Schedule E | currency | If applicable | | | k1_income | Schedule E Part II | currency | Partnership/S-Corp | | | depreciation_added_back | Schedule C + E | currency | Self-employed | Non-cash expense added back | | depletion_added_back | Schedule C + E | currency | Self-employed | Non-cash expense added back |
| Field Name | Data Type | Required | Validation | |-----------|-----------|----------|-----------| | account_holder_name | string | Required | Match to borrower | | institution_name | string | Required | Non-empty | | account_number_last4 | string | Optional | 4 digits (masked) | | account_type | enum | Required | Checking/Savings/Money Market | | statement_period_start | date | Required | Valid date | | statement_period_end | date | Required | Valid date; should be within 60 days | | beginning_balance | currency | Required | | | ending_balance | currency | Required | Used for asset verification | | total_deposits | currency | Required | Identifies large/unusual deposits | | large_deposits | list | Required | Deposits > $[threshold]; itemized | | nsf_count | integer | Optional | Count of NSF/returned items |
| Field Name | Data Type | Required | Validation | |-----------|-----------|----------|-----------| | insured_name | string | Required | Match to client record | | policy_number | string | Required | Format varies by carrier | | carrier_name | string | Required | | | lob | enum | Required | Auto/Home/Commercial/GL/etc. | | effective_date | date | Required | Valid date | | expiration_date | date | Required | After effective date | | premium_annual | currency | Required | | | liability_limit | currency | Required for auto/GL | | | deductible | currency | Required | | | property_address | string | Required for property | Match to risk address | | vehicle_info | object | Required for auto | Year/Make/Model/VIN |
| Confidence Level | Threshold | Handling | |-----------------|-----------|---------| | High confidence | ≥ 0.90 | Auto-accept; proceed without human review | | Medium confidence | 0.70 – 0.89 | Flag for human verification; highlight field in review UI | | Low confidence | < 0.70 | Route to human review queue; display extracted value as suggestion, not fact | | Not found | N/A | Mark field as missing; trigger missing-field condition |
Field-specific thresholds:
Critical fields (wrong value has significant downstream impact) should have higher thresholds:
Confidence aggregation:
| Document Type | Recommended Model | Rationale | |-------------|-----------------|-----------| | Machine-printed structured forms (W-2, 1099) | Azure Document Intelligence (Form Recognizer) — prebuilt W-2/1099 model | Pre-built models for standard IRS forms; high accuracy | | Semi-structured machine print (pay stubs, bank statements, dec pages) | Azure Document Intelligence — custom trained model OR AWS Textract with custom adapter | Requires training on carrier/issuer-specific layouts | | Handwritten fields (ACORD applications, older inspection reports) | Azure Document Intelligence — read model for handwriting | Handles mixed print/handwrite; lower accuracy than machine print | | Tables (bank statement transactions, loss run schedules) | AWS Textract — Tables API OR Azure DI table extraction | Preserves row/column structure; critical for transaction lists | | Complex multi-page documents (1040 with schedules) | Azure Document Intelligence — custom model with schedule awareness | Multi-page layout with dynamic presence of schedules | | Low-quality scans (high noise, skew, faded) | Pre-process then OCR (Tesseract or Azure DI read) | Pre-processing pipeline required before model |
Model selection criteria:
Before sending to extraction model:
| Pre-Processing Step | When Required | Tool | |--------------------|--------------|------| | DPI check | If source is scanned document | Reject if < 150 DPI; warn if < 200 DPI; optimal 300+ DPI | | De-skew (deskew) | If page is rotated or tilted | OpenCV deskew or Azure DI handles internally | | Contrast enhancement | If page is faded or low contrast | Adaptive histogram equalization | | De-noise | If scanned with heavy grain | Gaussian blur or median filter | | Color normalization | Color scans — convert to grayscale or enhance | Improves OCR accuracy | | Page splitting | Multi-document packets | Detect and split at page boundaries between documents | | Page rotation | If individual pages are upside-down | Auto-detect and rotate using text direction | | Watermark removal | If watermarks obscure content | Detect and suppress watermark layer |
Quality gate: Any document failing minimum quality thresholds (DPI < 150, or confidence after pre-processing below floor) is routed to manual entry queue with specific quality failure message.
All extraction results output in consistent JSON format for downstream system consumption.
{
"extraction_id": "uuid",
"document_id": "uuid",
"document_type": "pay_stub",
"extraction_timestamp": "2024-01-15T14:32:00Z",
"model_name": "azure-di-custom-pay-stub-v2",
"model_version": "2.1.0",
"document_confidence": 0.91,
"page_count": 2,
"fields": {
"employer_name": {
"value": "Acme Corporation",
"confidence": 0.97,
"bounding_box": {"page": 1, "x": 120, "y": 45, "width": 200, "height": 20},
"status": "auto_accepted"
},
"gross_pay_this_period": {
"value": 4250.00,
"value_type": "currency",
"confidence": 0.89,
"bounding_box": {"page": 1, "x": 450, "y": 310, "width": 80, "height": 18},
"status": "flagged_for_review"
}
},
"missing_fields": ["employee_ssn_last4"],
"validation_results": {
"pay_period_end_vs_today": "within_60_days",
"ytd_gross_ge_period_gross": "pass",
"cross_field_consistency": "pass"
},
"routing": "auto_process",
"review_reasons": ["gross_pay_confidence_below_threshold"]
}
Deliver two artifacts:
Extraction Configuration Specification — For each document type in scope: field definition table (name, type, required, validation, extraction note), confidence thresholds, model selection, and pre-processing requirements
Output Schema Documentation — JSON schema for extraction results with field definitions, status enum values, routing logic, and validation rule definitions
development
Enhanced plan-authoring skill with Pre-Writing context gathering, task metadata, non-TDD templates, Red Flags, telemetry, and an automated plan linter. Use when you have a spec or requirements for a multi-step task, before touching code.
tools
Documentation intelligence engine with graph-based API docs, algorithm library, and drift detection
tools
Ultraplan cloud planning — kick off a plan in the cloud from your terminal, review and revise in the browser, then execute remotely or send back to CLI
tools
--- name: mcp description: Configure MCP servers for Claude Code — stdio vs HTTP, authentication, Tools/Resources/Prompts distinction, channels (CI webhook, mobile relay, Discord bridge, fakechat), and cost of always-loaded tools. Use this skill whenever adding an MCP server, debugging connection issues, choosing between MCP Tools vs Prompts vs Resources, installing channel servers, or managing .mcp.json. Triggers on: "MCP server", "mcp config", "add Obsidian MCP", "install context7", "channels"