skills/43-wentorai-research-plugins/skills/tools/ocr-translate/handwriting-recognition-guide/SKILL.md
Apply handwriting OCR to digitize historical and archival documents
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research handwriting-recognition-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for applying handwriting text recognition (HTR) to digitize historical documents, archival manuscripts, and handwritten research notes. Covers HTR platforms, image preprocessing, model training, post-correction, and integration into digital humanities research workflows.
Printed Text OCR:
- Characters are standardized and uniform
- Well-solved problem (>99% accuracy on clean scans)
- Tools: Tesseract, ABBYY FineReader, Adobe Acrobat
Handwriting Text Recognition (HTR):
- Characters vary by writer, mood, pen, era
- Much harder -- typically 85-95% character accuracy
- Requires training on specific handwriting styles
- Tools: Transkribus, Kraken, HTR-Flor, Google Cloud Vision
Challenges specific to historical documents:
- Faded ink, bleed-through, stains, tears
- Archaic letterforms and abbreviations
- Multiple hands in one document
- Non-standard orthography
- Mixed languages and scripts
Pricing note: Transkribus uses a credit-based pricing model. A limited free tier is available, but processing large volumes of pages requires purchasing credits.
Transkribus is the leading platform for historical HTR.
Workflow:
1. Upload document images
2. Automatic layout analysis (detect text regions and baselines)
3. Manual correction of layout (if needed)
4. Apply a pre-trained HTR model (or train your own)
5. Review and correct transcription
6. Export as TEXT, PAGE XML, TEI, DOCX, or PDF
Pre-trained models:
- Noscemus GM (general model for Latin scripts)
- English Writing M1 (18th-19th century English)
- German Kurrent models
- Dutch, French, Italian, Spanish models available
Training a custom model:
- Requires ~15,000-25,000 words of ground truth (manually transcribed)
- Can start with a pre-trained base model and fine-tune
- Training takes 1-8 hours depending on dataset size
| Tool | Type | Strengths | |------|------|----------| | Transkribus | Cloud platform | Best for historical documents, active community | | Kraken | Open source (Python) | Flexible, scriptable, custom training | | eScriptorium | Open source (web) | Based on Kraken, collaborative interface | | Google Cloud Vision | API | Good for modern handwriting, many languages | | Azure AI Vision | API | Competitive with Google for modern text | | HTR-Flor | Open source | Research-focused, PyTorch-based |
from PIL import Image, ImageFilter, ImageEnhance
def preprocess_document_image(image_path: str,
output_path: str) -> dict:
"""
Preprocess a document scan for optimal HTR performance.
Args:
image_path: Path to the input scan
output_path: Path to save the preprocessed image
"""
img = Image.open(image_path)
# Convert to grayscale
img = img.convert("L")
# Enhance contrast
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(1.5)
# Remove noise
img = img.filter(ImageFilter.MedianFilter(size=3))
# Binarize (convert to black and white)
threshold = 128
img = img.point(lambda x: 255 if x > threshold else 0, "1")
img.save(output_path)
return {
"original": image_path,
"processed": output_path,
"steps_applied": [
"Grayscale conversion",
"Contrast enhancement (1.5x)",
"Median filter (noise removal)",
"Binarization (threshold=128)"
],
"additional_steps_if_needed": [
"Deskewing (correct rotation)",
"Dewarping (correct page curvature)",
"Bleed-through removal",
"Background normalization"
]
}
Resolution: 300-400 DPI for most documents
600 DPI for fine handwriting or damaged originals
Color: Grayscale usually sufficient; color for illuminated MSS
Format: TIFF (lossless) for archival; PNG for working copies
Lighting: Even, diffused light; avoid shadows and glare
Flatness: Use a book cradle or V-shaped scanner for bound volumes
Calibration: Include a color/grayscale chart for batch consistency
def post_correction_workflow(raw_transcription: str,
dictionary: set,
confidence_threshold: float = 0.8) -> dict:
"""
Post-correction strategy for HTR output.
Args:
raw_transcription: Raw OCR/HTR text output
dictionary: Set of valid words for the document's language/period
confidence_threshold: Below this, flag for manual review
"""
words = raw_transcription.split()
flagged = []
corrected = []
for word in words:
clean = word.strip(".,;:!?()[]")
if clean.lower() in dictionary:
corrected.append(word)
else:
flagged.append({
"word": word,
"position": len(corrected),
"suggestion": "Manual review needed"
})
corrected.append(word)
return {
"total_words": len(words),
"flagged_words": len(flagged),
"estimated_accuracy": 1 - len(flagged) / max(len(words), 1),
"flagged": flagged[:20],
"correction_strategies": [
"Dictionary-based spell checking (period-appropriate dictionary)",
"N-gram language model for context-aware correction",
"Crowdsourcing (Zooniverse, FromThePage)",
"Double-keying (two independent transcribers, compare)",
"AI-assisted correction with human verification"
]
}
1. Transcribe documents using HTR
2. Correct and validate transcriptions
3. Encode in TEI-XML for digital editions
4. Apply NLP for named entity recognition, topic modeling
5. Link entities to knowledge bases (Wikidata, VIAF)
6. Publish as a searchable digital archive
Tools for TEI encoding:
- oXygen XML Editor (standard for digital humanities)
- TEI Publisher (web-based publishing platform)
- FromThePage (collaborative transcription with TEI export)
Report Character Error Rate (CER) and Word Error Rate (WER) on a held-out test set. CER below 5% is generally considered production-quality for historical documents. Always compare against a manually created ground truth. Report accuracy separately for different document types, hands, or time periods if your corpus is heterogeneous.
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.