Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

xu-xiang/regex-vs-llm-structured-text

Name: regex-vs-llm-structured-text
Author: xu-xiang

docs/zh-CN/skills/regex-vs-llm-structured-text/SKILL.md

npx skillsauth add xu-xiang/everything-claude-code-zh regex-vs-llm-structured-text

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

正则表达式 vs LLM 用于结构化文本解析

一个用于解析结构化文本（测验、表单、发票、文档）的实用决策框架。核心见解是：正则表达式能以低成本、确定性的方式处理 95-98% 的情况。将昂贵的 LLM 调用留给剩余的边缘情况。

何时使用

解析具有重复模式的结构化文本（问题、表单、表格）
决定在文本提取时使用正则表达式还是 LLM
构建结合两种方法的混合管道
在文本处理中优化成本/准确性权衡

决策框架

Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

架构模式

Source Text
    │
    ▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
    │
    ▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
    │
    ▼
[Confidence Scorer] ─── Flags low-confidence extractions
    │
    ├── High confidence (≥0.95) → Direct output
    │
    └── Low confidence (<0.95) → [LLM Validator] → Output

实现

1. 正则表达式解析器（处理大多数情况）

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

2. 置信度评分

标记可能需要 LLM 审核的项：

@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]

3. LLM 验证器（仅用于边缘情况）

def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item

4. 混合管道

def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

实际指标

来自一个生产中的测验解析管道（410 个项目）：

| 指标 | 值 | |--------|-------| | 正则表达式成功率 | 98.0% | | 低置信度项目 | 8 (2.0%) | | 所需 LLM 调用次数 | ~5 | | 相比全 LLM 的成本节省 | ~95% | | 测试覆盖率 | 93% |

最佳实践

从正则表达式开始 — 即使不完美的正则表达式也能提供一个改进的基线
使用置信度评分 来以编程方式识别需要 LLM 帮助的内容
使用最便宜的 LLM 进行验证（Haiku 类模型已足够）
切勿修改 已解析的项 — 从清理/验证步骤返回新实例
TDD 效果很好 用于解析器 — 首先为已知模式编写测试，然后是边缘情况
记录指标（正则表达式成功率、LLM 调用次数）以跟踪管道健康状况

应避免的反模式

当正则表达式能处理 95% 以上的情况时，将所有文本发送给 LLM（昂贵且缓慢）
对自由格式、高度可变的文本使用正则表达式（LLM 在此处更合适）
跳过置信度评分，希望正则表达式“能正常工作”
在清理/验证步骤中修改已解析的对象
不测试边缘情况（格式错误的输入、缺失字段、编码问题）

适用场景

测验/考试题目解析
表单数据提取
发票/收据处理
文档结构解析（标题、章节、表格）
任何具有重复模式且成本重要的结构化文本

xu-xiang/regex-vs-llm-structured-text

docs/zh-CN/skills/regex-vs-llm-structured-text/SKILL.md

选择在解析结构化文本时使用正则表达式还是大型语言模型的决策框架——从正则表达式开始，仅在低置信度的边缘情况下添加大型语言模型。

410 stars

data-ai

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add xu-xiang/everything-claude-code-zh regex-vs-llm-structured-text

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 8, 2026, 2:36 AM4.8s1 file scanned

SKILL.md

name:: regex-vs-llm-structured-text
description:: 选择在解析结构化文本时使用正则表达式还是大型语言模型的决策框架——从正则表达式开始，仅在低置信度的边缘情况下添加大型语言模型。
origin:: ECC

正则表达式 vs LLM 用于结构化文本解析

何时使用

解析具有重复模式的结构化文本（问题、表单、表格）
决定在文本提取时使用正则表达式还是 LLM
构建结合两种方法的混合管道
在文本处理中优化成本/准确性权衡

决策框架

Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

架构模式

Source Text
    │
    ▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
    │
    ▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
    │
    ▼
[Confidence Scorer] ─── Flags low-confidence extractions
    │
    ├── High confidence (≥0.95) → Direct output
    │
    └── Low confidence (<0.95) → [LLM Validator] → Output

实现

1. 正则表达式解析器（处理大多数情况）

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

2. 置信度评分

标记可能需要 LLM 审核的项：

@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]

3. LLM 验证器（仅用于边缘情况）

def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item

4. 混合管道

def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

实际指标

来自一个生产中的测验解析管道（410 个项目）：

最佳实践

从正则表达式开始 — 即使不完美的正则表达式也能提供一个改进的基线
使用置信度评分 来以编程方式识别需要 LLM 帮助的内容
使用最便宜的 LLM 进行验证（Haiku 类模型已足够）
切勿修改 已解析的项 — 从清理/验证步骤返回新实例
TDD 效果很好 用于解析器 — 首先为已知模式编写测试，然后是边缘情况
记录指标（正则表达式成功率、LLM 调用次数）以跟踪管道健康状况

应避免的反模式

当正则表达式能处理 95% 以上的情况时，将所有文本发送给 LLM（昂贵且缓慢）
对自由格式、高度可变的文本使用正则表达式（LLM 在此处更合适）
跳过置信度评分，希望正则表达式“能正常工作”
在清理/验证步骤中修改已解析的对象
不测试边缘情况（格式错误的输入、缺失字段、编码问题）

适用场景

测验/考试题目解析
表单数据提取
发票/收据处理
文档结构解析（标题、章节、表格）
任何具有重复模式且成本重要的结构化文本

Related Skills

xu-xiang/visa-doc-translate

documentation

VerifiedTrustedCommunity

将签证申请文件（图像）翻译成英文，并创建包含原文和译文的双语 PDF。

411SKILL.mdUpdated Apr 18, 2026

xu-xiang/visa-doc-translate

xu-xiang/verification-loop

development

VerifiedTrustedCommunity

Claude Code 会话的全方位验证系统。

410SKILL.mdUpdated Apr 17, 2026

xu-xiang/verification-loop

xu-xiang/tdd-workflow

tools

VerifiedTrustedCommunity

在编写新功能、修复 Bug 或重构代码时使用此技能。强制执行测试驱动开发（TDD），包括单元测试、集成测试和 E2E 测试，且覆盖率需达到 80% 以上。

410SKILL.mdUpdated Apr 17, 2026

xu-xiang/tdd-workflow

xu-xiang/swiftui-patterns

tools

VerifiedTrustedCommunity

SwiftUI 架构模式，使用 @Observable 进行状态管理，视图组合、导航、性能优化以及现代 iOS/macOS UI 最佳实践。

410SKILL.mdUpdated Apr 17, 2026

xu-xiang/swiftui-patterns

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/xu-xiang/everything-claude-code-zh.git

# Copy into Claude Code skills folder (global)
cp -r everything-claude-code-zh/docs/zh-CN/skills/regex-vs-llm-structured-text ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

xu-xiang/everything-claude-code-zh

410 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT