.claude/skills/research/SKILL.md
Run autonomous genealogy research cycles: assess the GEDCOM tree, find the highest-value gaps, look up records via archive skills, apply corrections, and document findings. Works outward from the root individual, prioritizing closer generations first. Use this skill whenever the user wants autonomous research done on the tree — "do some research", "fill gaps", "explore edges", "work on the tree for a while", "/research", or any open-ended request to improve the family tree without a specific person or line in mind. Also use when the user sets up a /loop for recurring research. Unlike /harden (which focuses on ONE specific line), this skill picks the best targets across ALL lines and maximizes impact per cycle. Supports an optional budget argument to keep running cycles until a usage ceiling is hit. E.g. "/research 40% session" or "/research 5% weekly". Without a budget, runs a single cycle.
npx skillsauth add rdeknijf/ai-genealogy-kit researchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
An autonomous research cycle that assesses the tree state, identifies the highest-value work, performs archive lookups, applies corrections, and documents everything. Each cycle is self-contained and produces measurable progress.
This skill accepts an optional budget argument. If provided, it runs multiple cycles until the budget ceiling is reached. If omitted, it runs a single cycle.
Budget formats:
/research — single cycle/research 40% session — keep going until 40% more of the 5h block is used/research 5% weekly — keep going until 5% more of the 7d block is used/research 40% session 5% weekly — stop on whichever ceiling hits firstParse patterns: N% session, session N%, Ns, sN for session budget;
N% weekly, weekly N%, Nw, wN for weekly budget.
Before starting, check project configuration for:
.claude/skills/ for data source
skills (e.g., wiewaswie, openarchieven, familysearch)If a budget was specified:
systemd-inhibit --what=sleep:idle --who="Claude Code" --why="research loop" --mode=block sleep infinity &
INHIBIT_PID=$!
If systemd-inhibit isn't available, warn the user that the machine
might sleep mid-research.
cat ~/.cache/ccstatusline/usage.json
The file contains sessionUsage and weeklyUsage as percentages (0-100).
Calculate ceilings:
sessionCeiling = current sessionUsage + requested session budgetweeklyCeiling = current weeklyUsage + requested weekly budgetCap at 100. Report the plan before starting:
Budget research starting:
Session: 15.0% now, ceiling 55.0% (budget: 40%)
Weekly: 38.0% now, ceiling 43.0% (budget: 5%)
Stopping on whichever is hit first.
If the cache file doesn't exist or can't be read, tell the user and stop.
Each cycle has 4 phases. The whole cycle should take ~15 minutes.
Parse the GEDCOM and trace the root individual's ancestors to identify gaps. For each direct ancestor, check:
Privacy filter — skip recent people automatically:
Before selecting targets, filter out anyone whose records fall within archive privacy periods. Unless the user specifically asks to research a recent person, skip them silently:
These are approximate Dutch civil registry thresholds. Don't waste lookup cycles on records that won't be publicly available online.
Priority order for which generation to work on:
Within a generation, prioritize:
Use scripts/gedcom_query.py for all GEDCOM queries:
python scripts/gedcom_query.py ids # highest IDs + counts
python scripts/gedcom_query.py gaps I0100 I0200 ... # gap analysis
python scripts/gedcom_query.py person I0100 # full record
python scripts/gedcom_query.py search "Kemmann" # find by surname
python scripts/gedcom_query.py validate # integrity check
NEVER use re.DOTALL regex over the full GEDCOM content. Patterns like
re.search('0 @I\d+@ INDI(.*?)(?=\n0 @)', content, re.DOTALL) cause
catastrophic backtracking on large files (100% CPU for 45+ minutes).
Always use gedcom_query.py or line-by-line parsing instead.
Launch a background agent for archive lookups using the appropriate data source skill. The agent should do lookups sequentially when using Playwright-based skills (single browser session constraint).
Batch efficiently: Group lookups by archive/region so the agent doesn't need to switch between different search interfaces.
Agent prompt template:
You are a genealogy research agent. Perform archive lookups.
Read the skill at `.claude/skills/<data-source>/SKILL.md` first.
Do these lookups SEQUENTIALLY:
## Lookup N: [Person name] [event type]
- What we know: [birth/death date, place, spouse, parents]
- Search: [archive] — surname "[X]", first name "[Y]",
document type "[birth/marriage/death]",
year_from [N], year_to [N], place "[Z]"
- Extract: [date, place, age, parents, spouse, akte, archive ref]
Return ALL findings in structured format with full archive references.
While the agent runs, do non-browser work in parallel:
For each record found by the agent:
Check source ID space — find the highest existing source ID:
grep -oP 'S\d+' <gedcom_path> | sort -t'S' -k2 -n -u | tail -5
Start new sources ABOVE the highest number. Source ID collisions cause data integrity issues and are hard to debug — previous sessions may have used IDs that look available but aren't.
Edit the GEDCOM event — add 2 SOUR @SXXXXXX@ with inline
3 DATA / 4 TEXT containing the archive reference.
Add source records before 0 TRLR:
0 @SXXXXXX@ SOUR
1 TITL [Document type] [Place] [Year]
1 AUTH [Archive name]
1 PUBL [Archive reference, register, akte number]
Fix dates/ages if the record contradicts the GEDCOM (common issues: registration date recorded instead of actual event date, approximate age ranges instead of exact age from records).
Add new persons if marriage records reveal parents not yet in the GEDCOM. Create INDI + FAM records and link via FAMC.
CRITICAL: Check INDI and FAM ID space first — just like source IDs, you must verify the IDs you assign don't already exist:
grep -oP 'I\d+' <gedcom_path> | sort -t'I' -k2 -n -u | tail -5
grep -oP 'F\d+' <gedcom_path> | sort -t'F' -k2 -n -u | tail -5
Start new records ABOVE the highest number. INDI/FAM ID collisions are worse than source collisions — they silently cause GEDCOM parsers to link the wrong person as a parent, spouse, or child. A previous session created records I600007-I600010 and F600004-F600005 that collided with existing records, corrupting family links for 4 people.
Add findings to the project's findings file with:
After all edits, run a quick integrity check:
python scripts/gedcom_query.py validate
After each cycle completes, if running in budget mode:
Check budget — read ~/.cache/ccstatusline/usage.json and compare
against ceilings. If either tracked metric is at or above its ceiling,
stop (don't start a new cycle).
Report status — briefly log where usage stands:
Cycle N complete (session: 23.4% / 55.0%, weekly: 39.1% / 43.0%)
Loop — start the next cycle from Phase 1.
Don't cut a cycle short even if usage might exceed the ceiling mid-cycle — the cache updates lag by ~3 minutes anyway, and partially applied research is worse than finishing cleanly.
When a ceiling is reached (or both):
kill $INHIBIT_PID 2>/dev/nullEnd each cycle (or the full budget run) with a brief summary:
.claude/skills/ (include name,
URL, and why it would help)Always check ALL existing IDs (INDI, FAM, and SOUR) before adding new records. Previous sessions may have used IDs in the range you're about to use. A collision means two different records share the same ID:
To check, use the validation script in the Validate section, which detects all three types of duplicate IDs.
Many civil registries record events 1-3 days after they happen. The GEDCOM should store the actual event date (birth, death), not the registration date. Marriage records are typically registered on the day of the ceremony.
GEDCOM allows multiple instances of the same event tag (e.g., two
1 BIRT blocks). One may have the date/place while another has the
source citation. Check for this pattern before concluding a source is
missing.
Recent death records may not be publicly indexed. The restriction period varies by archive and country (commonly 50-100 years for deaths, 75-115 years for births). Don't waste lookups on records that can't be found online.
Very common names (e.g., "Jan Jansen", "John Smith") return hundreds of results. Always narrow searches by year range, place, and verify identity by checking parent names in the record detail.
Official records often use the municipality name rather than the village name. Both are correct — include both when possible (e.g., "Silvolde, Wisch" rather than just "Wisch").
Marriage records noting child legitimization ("wettiging", etc.) mean the couple had children before marrying. Document this — it affects family chronology and may explain birth records under the mother's maiden name.
The ~/.cache/ccstatusline/usage.json file is updated by an external
process roughly every 180 seconds. If it can't be read between cycles,
log a warning but continue for one more cycle. If it fails twice in a
row, stop and tell the user. Other Claude sessions consume the same
budget — if usage jumps unexpectedly, that's fine, just stop at the
ceiling.
tools
Search Dutch civil registry records (births, marriages, deaths) on WieWasWie.nl via direct JSON API calls, with Open Archives API as a secondary source and Playwright browser automation as fallback. Use this skill whenever you need to look up or verify a person in Dutch civil records, check a birth/marriage/death date against official archives, or find parents/spouses from indexed Burgerlijke Stand records. Triggers on: "look up on wiewaswie", "check the birth record", "find the marriage certificate", "verify this date in the civil registry", "/wiewaswie", or any request to search Dutch genealogical records for a specific person. Also use when comparing GEDCOM data against official sources or when a Tier B verification is needed.
development
Search the VOC Opvarenden database for Dutch East India Company crew records (1699-1794). Uses the Nationaal Archief HUB3 API — 853,785 indexed entries with rich detail: name, origin, rank, ship, fate (died/returned/deserted), service dates, VOC chamber, and links to original scans. Use this skill when: "search VOC records", "VOC crew", "VOC opvarenden", "sailed to Batavia", "Dutch East India Company", "VOC soldier", "VOC sailor", "/voc-opvarenden", or when looking for ancestors who may have sailed with the VOC. Also use when checking Daniel Pieterse Knijf (1704, Woerden) or any Knijf/Knijff VOC connections. No login required.
tools
Generate a scan verification page for the user to review AI-extracted genealogy findings against actual document scans. The user clicks through records, confirms or rejects each one, and confirmed records become Tier A evidence in FINDINGS.md. Use this skill when: "verify scans", "show me what needs verifying", "review pending scans", "scan verification", "/verify-scans", or when the user wants to upgrade research findings from Tier C/D to Tier A by visually confirming document scans. Also use after a research session that produced scan-backed findings that need human confirmation.
tools
Search indexed person records at Streekarchief Midden-Holland (samh.nl) via the Memorix Genealogy REST API. No browser automation needed — returns structured JSON in ~50ms per query. Based in Gouda, covers municipalities: Gouda, Haastrecht, Schoonhoven, Waddinxveen, Noord-Waddinxveen, Moerkapelle, Moordrecht, Ammerstol, Broek, Vlist, and surrounding areas in the Midden-Holland region of South Holland. 3M+ person records with DTB (doop/trouw/begraven), BS (geboorte/huwelijk/overlijden), and Inschrijvingaktes. 36 Knijf results found, including Gijsbert de Knijf records in Gouda and van der Knijf in Waddinxveen. Scans available for most records. Triggers on: "search Gouda archive", "Streekarchief Midden-Holland", "SAMH", "Haastrecht records", "Schoonhoven records", "/streekarchief-midden-holland", or any genealogy research in the Gouda/Midden-Holland area. No login required. Parallelizable — run multiple queries simultaneously.