.claude/skills/batch-scrape/SKILL.md
Run scrape workflows for one or more ecosystems in dev mode using local plugin code
npx skillsauth add melodic-software/claude-code-plugins batch-scrapeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run documentation scraping workflows for one or more ecosystems using local plugin code (dev mode).
Parse $ARGUMENTS to determine mode and ecosystems:
claude, cursor, duende, google, openaiclaude google)all or no argumentssequential (default) or headlessExamples:
/batch-scrape claude -- scrape claude ecosystem only/batch-scrape cursor duende -- scrape cursor + duende sequentially/batch-scrape all -- sequential, all ecosystems/batch-scrape headless -- headless, all ecosystems/batch-scrape headless claude google -- headless, just claude + google| Ecosystem | Dev Env Var | Plugin Path | Docs Skill |
| --- | --- | --- | --- |
| claude | OFFICIAL_DOCS_DEV_ROOT | plugins/claude-ecosystem/skills/docs-management/ | claude-ecosystem:docs-ops scrape |
| cursor | CURSOR_DOCS_DEV_ROOT | plugins/cursor-ecosystem/skills/cursor-docs/ | cursor-ecosystem:docs-ops scrape |
| duende | DUENDE_DOCS_DEV_ROOT | plugins/duende-ecosystem/skills/duende-docs/ | duende-ecosystem:docs-ops scrape |
| google | GEMINI_DOCS_DEV_ROOT | plugins/google-ecosystem/skills/gemini-cli-docs/ | google-ecosystem:docs-ops scrape |
| openai | CODEX_DOCS_DEV_ROOT | plugins/openai-ecosystem/skills/codex-cli-docs/ | openai-ecosystem:docs-ops scrape |
CRITICAL: Environment variables set mid-session do NOT persist across Claude's Bash tool calls. Each Bash command runs in a fresh shell. You must set the env var in the SAME command that runs the script.
IMPORTANT: Claude's Bash tool uses Git Bash (MINGW64) on Windows, not PowerShell. Use Bash inline prefix syntax for all commands executed by Claude Code.
Bash syntax (use this in Claude Code):
<ENV_VAR>="<repo-root>/<plugin-path>" python <script-path>
PowerShell syntax (for native PowerShell terminal only):
$env:<ENV_VAR> = "<repo-root>/<plugin-path>"; python <script-path>
This overrides the installed plugin path, redirecting all operations to the local development copy.
Run each ecosystem's scrape workflow in order within a single session. Use /compact between each to manage context window size.
For each ecosystem in the selected list:
/compact to reclaim context before the next ecosystemgit log --oneline -10 after all runs to verify commitsRun each ecosystem in a separate headless Claude session. These can run in parallel across terminal windows.
For each ecosystem in the selected list, output the corresponding claude -p command:
claude -p "Run /batch-scrape <ecosystem> following all steps including audit/fix/commit." \
--allowedTools "Read,Edit,Write,Bash,Skill,Glob,Grep"
Then instruct the user to run them in separate terminal windows.
claude -p invocation gets its own context window -- no cross-contaminationgit log --oneline -20--resume to continue an interrupted sessionScrape Claude Code / Anthropic documentation using local plugin code at plugins/claude-ecosystem/skills/docs-management/.
OFFICIAL_DOCS_DEV_ROOT="<repo-root>/plugins/claude-ecosystem/skills/docs-management" python <repo-root>/plugins/claude-ecosystem/skills/docs-management/scripts/core/scrape_all_sources.py --parallel --skip-existing
STOP AND VERIFY: Check the first lines of output for [DEV MODE]. If you see [PROD MODE], the environment variable was not set correctly. Do NOT proceed -- troubleshoot the env var first.
OFFICIAL_DOCS_DEV_ROOT="<repo-root>/plugins/claude-ecosystem/skills/docs-management" python <repo-root>/plugins/claude-ecosystem/skills/docs-management/scripts/management/refresh_index.py
STOP AND VERIFY: Check for [DEV MODE] in output. If [PROD MODE] appears, stop and troubleshoot.
Clean up Anthropic articles that have aged out based on published_at dates. The threshold is read automatically from max_age_days in references/sources.json.
OFFICIAL_DOCS_DEV_ROOT="<repo-root>/plugins/claude-ecosystem/skills/docs-management" python <repo-root>/plugins/claude-ecosystem/skills/docs-management/scripts/maintenance/cleanup_old_anthropic_docs.py --execute
STOP AND VERIFY: Check for [DEV MODE] in output. Review the cleanup summary -- it should report either "No old documents found" or list the specific files removed.
Note: The --execute flag is required to actually delete files (default is dry-run). The age threshold is read from sources.json automatically.
Run git status to see what files changed in the local repo.
IMPORTANT: Only analyze and report on changes within plugins/claude-ecosystem/. Ignore changes in other plugins -- those are from separate scrape runs.
Analyze scraped content changes to detect potential issues from external source changes:
# Check for significant content changes in scraped docs
git diff --stat plugins/claude-ecosystem/skills/docs-management/canonical/
# Review specific changes for potential issues:
# - Broken formatting from upstream changes
# - Missing sections or content
# - New content that may need metadata updates
# - Encoding issues or unexpected characters
git diff plugins/claude-ecosystem/skills/docs-management/canonical/ | head -200
What to look for:
If issues are found, investigate the source URLs and determine if adjustments are needed to sources.json or scraping scripts.
After reviewing the git diff, perform a structural analysis to detect potential filtering gaps. This analysis should be future-proof (not relying on brittle text patterns).
Analysis Steps:
Source Type Correlation: Group modified files by source path prefix:
anthropic-com/research/ - Research articlesanthropic-com/news/ - News articlesanthropic-com/engineering/ - Engineering blogcode-claude-com/ - Claude Code docsdocs-claude-com/ - API docsRed flag: If 5+ files from the same source all changed, this likely indicates a structural issue with that source's filtering configuration (not genuine content updates).
Change Location Analysis: For each modified file in a source group:
content_hash) changes with <10 content lines changedRed flag: Multiple files with changes concentrated at the end = likely "Related content" or footer sections not being filtered.
Cross-Reference with Scraper Logs: During scraping, the ContentFilter logs messages like:
Filtered N sections from URL: reasons=[...], headings=[...]
Red flag: If files show as git-modified but scraper logs show sections_removed: 0 for that source type = filter configuration may be missing for that source.
Potential Improvements Output:
If issues are detected, include a "Potential Improvements" section with actionable suggestions:
news_blog_stop_sections to source X in content_filtering.yaml"Reference: Filter configuration is in plugins/claude-ecosystem/skills/docs-management/config/content_filtering.yaml
Summarize:
plugins/claude-ecosystem/ changes)After the final report, if there are changed files ready for commit:
Audit inline: Check all modified canonical files for:
Fix issues found: Apply fixes directly -- do NOT write findings to a separate file
Lint markdown: Run markdownlint on modified files:
npx markdownlint-cli2 --fix "plugins/claude-ecosystem/skills/docs-management/canonical/**/*.md"
Commit: Use the melodic-software:git-commit skill. Suggested format:
feat(claude-ecosystem): re-scrape docs with [summary of changes]fix(claude-ecosystem): fix encoding/formatting issues in scraped docsSTOP AND CONFIRM: Present the commit plan to the user before executing
Scrape Cursor documentation using local plugin code at plugins/cursor-ecosystem/skills/cursor-docs/.
CURSOR_DOCS_DEV_ROOT="<repo-root>/plugins/cursor-ecosystem/skills/cursor-docs" python <repo-root>/plugins/cursor-ecosystem/skills/cursor-docs/scripts/core/scrape_docs.py --llms-txt "https://cursor.com/llms.txt" --skip-existing
STOP AND VERIFY: Check the first lines of output for [DEV MODE]. If you see [PROD MODE], the environment variable was not set correctly. Do NOT proceed -- troubleshoot the env var first.
Run the full index refresh pipeline which rebuilds the index AND extracts keywords/metadata:
CURSOR_DOCS_DEV_ROOT="<repo-root>/plugins/cursor-ecosystem/skills/cursor-docs" python <repo-root>/plugins/cursor-ecosystem/skills/cursor-docs/scripts/management/refresh_index.py
STOP AND VERIFY: Check for [DEV MODE] in output. If [PROD MODE] appears, stop and troubleshoot.
IMPORTANT: Use refresh_index.py (not rebuild_index.py). The refresh script runs the full pipeline:
Using only rebuild_index.py will strip metadata (keywords, subsections, tags, descriptions) from the index.
Verify index integrity:
CURSOR_DOCS_DEV_ROOT="<repo-root>/plugins/cursor-ecosystem/skills/cursor-docs" python <repo-root>/plugins/cursor-ecosystem/skills/cursor-docs/scripts/maintenance/validate_index.py
Run git status to see what files changed in the local repo.
IMPORTANT: Only analyze and report on changes within plugins/cursor-ecosystem/. Ignore changes in other plugins -- those are from separate scrape runs.
Analyze scraped content changes to detect potential issues from external source changes:
# Check for significant content changes in scraped docs
git diff --stat plugins/cursor-ecosystem/skills/cursor-docs/canonical/
# Review specific changes for potential issues
git diff plugins/cursor-ecosystem/skills/cursor-docs/canonical/ | head -200
What to look for:
If issues are found, investigate the source URLs and determine if adjustments are needed to sources.json or scraping scripts.
After scraping, verify tag distribution is reasonable. Cursor docs use tags (not categories):
# Count documents by tag
CURSOR_DOCS_DEV_ROOT="<repo-root>/plugins/cursor-ecosystem/skills/cursor-docs" python -c "
import yaml
with open('<repo-root>/plugins/cursor-ecosystem/skills/cursor-docs/canonical/index.yaml', 'r', encoding='utf-8') as f:
index = yaml.safe_load(f)
tags_count = {}
for doc_id, meta in index.items():
tags = meta.get('tags', [])
for tag in tags:
tags_count[tag] = tags_count.get(tag, 0) + 1
for tag, count in sorted(tags_count.items(), key=lambda x: -x[1]):
print(f'{tag}: {count}')
"
Expected tags:
| Tag | Expected Range | | --- | -------------- | | cursor | 100-110 (all docs) | | agent | 15-25 | | cli | 15-25 | | configuration | 10-20 | | inline-edit | 10-15 | | examples | 8-15 | | enterprise | 5-15 | | context | 5-10 | | reference | 4-10 | | mcp | 3-8 |
Red flags:
cursor tag on any document (all should have it)Summarize:
plugins/cursor-ecosystem/ changes)After the final report, if there are changed files ready for commit:
Audit inline: Check all modified canonical files for:
Fix issues found: Apply fixes directly -- do NOT write findings to a separate file
Lint markdown: Run markdownlint on modified files:
npx markdownlint-cli2 --fix "plugins/cursor-ecosystem/skills/cursor-docs/canonical/**/*.md"
Commit: Use the melodic-software:git-commit skill. Suggested format:
feat(cursor-ecosystem): re-scrape docs with [summary of changes]fix(cursor-ecosystem): fix encoding/formatting issues in scraped docsSTOP AND CONFIRM: Present the commit plan to the user before executing
Scrape Duende IdentityServer documentation using local plugin code at plugins/duende-ecosystem/skills/duende-docs/.
DUENDE_DOCS_DEV_ROOT="<repo-root>/plugins/duende-ecosystem/skills/duende-docs" python <repo-root>/plugins/duende-ecosystem/skills/duende-docs/scripts/core/scrape_docs.py
STOP AND VERIFY: Check the first lines of output for [DEV MODE]. If you see [PROD MODE], the environment variable was not set correctly. Do NOT proceed -- troubleshoot the env var first.
Rebuild the index from the freshly scraped files:
DUENDE_DOCS_DEV_ROOT="<repo-root>/plugins/duende-ecosystem/skills/duende-docs" python <repo-root>/plugins/duende-ecosystem/skills/duende-docs/scripts/management/rebuild_index.py
STOP AND VERIFY: Check for [DEV MODE] in output. If [PROD MODE] appears, stop and troubleshoot.
Verify index integrity:
DUENDE_DOCS_DEV_ROOT="<repo-root>/plugins/duende-ecosystem/skills/duende-docs" python <repo-root>/plugins/duende-ecosystem/skills/duende-docs/scripts/maintenance/validate_index.py
Run git status to see what files changed in the local repo.
IMPORTANT: Only analyze and report on changes within plugins/duende-ecosystem/. Ignore changes in other plugins -- those are from separate scrape runs.
Analyze scraped content changes to detect potential issues from external source changes:
# Check for significant content changes in scraped docs
git diff --stat plugins/duende-ecosystem/skills/duende-docs/canonical/
# Review specific changes for potential issues
git diff plugins/duende-ecosystem/skills/duende-docs/canonical/ | head -200
What to look for:
If issues are found, investigate the source URLs and determine if adjustments are needed to sources.json or scraping scripts.
After scraping, verify category distribution is reasonable:
# Count documents per category
DUENDE_DOCS_DEV_ROOT="<repo-root>/plugins/duende-ecosystem/skills/duende-docs" python <repo-root>/plugins/duende-ecosystem/skills/duende-docs/scripts/management/manage_index.py count
Expected categories:
| Category | Expected Range | | ---------- | ---------------- | | identityserver | 50-80 | | bff | 20-40 | | accesstokenmanagement | 3-10 | | identitymodel | 1-10 | | identitymodel-oidcclient | 3-10 | | introspection | 2-10 | | general | 1-5 | | uncategorized | 100-200 |
Red flags:
Summarize:
plugins/duende-ecosystem/ changes)After the final report, if there are changed files ready for commit:
Audit inline: Check all modified canonical files for:
Fix issues found: Apply fixes directly -- do NOT write findings to a separate file
Lint markdown: Run markdownlint on modified files:
npx markdownlint-cli2 --fix "plugins/duende-ecosystem/skills/duende-docs/canonical/**/*.md"
Commit: Use the melodic-software:git-commit skill. Suggested format:
feat(duende-ecosystem): re-scrape docs with [summary of changes]fix(duende-ecosystem): fix encoding/formatting issues in scraped docsSTOP AND CONFIRM: Present the commit plan to the user before executing
Scrape Gemini CLI documentation using local plugin code at plugins/google-ecosystem/skills/gemini-cli-docs/.
GEMINI_DOCS_DEV_ROOT="<repo-root>/plugins/google-ecosystem/skills/gemini-cli-docs" python <repo-root>/plugins/google-ecosystem/skills/gemini-cli-docs/scripts/core/scrape_all_sources.py --parallel --skip-existing
STOP AND VERIFY: Check the first lines of output for [DEV MODE]. If you see [PROD MODE], the environment variable was not set correctly. Do NOT proceed -- troubleshoot the env var first.
GEMINI_DOCS_DEV_ROOT="<repo-root>/plugins/google-ecosystem/skills/gemini-cli-docs" python <repo-root>/plugins/google-ecosystem/skills/gemini-cli-docs/scripts/management/refresh_index.py
STOP AND VERIFY: Check for [DEV MODE] in output. If [PROD MODE] appears, stop and troubleshoot.
Run git status to see what files changed in the local repo.
IMPORTANT: Only analyze and report on changes within plugins/google-ecosystem/. Ignore changes in other plugins -- those are from separate scrape runs.
Analyze scraped content changes to detect potential issues from external source changes:
# Check for significant content changes in scraped docs
git diff --stat plugins/google-ecosystem/skills/gemini-cli-docs/canonical/
# Review specific changes for potential issues
git diff plugins/google-ecosystem/skills/gemini-cli-docs/canonical/ | head -200
What to look for:
If issues are found, investigate the source URLs and determine if adjustments are needed to sources.json or scraping scripts.
After reviewing the git diff, perform a structural analysis to detect potential filtering gaps.
Analysis Steps:
Source Analysis: Since gemini-cli-docs uses a single source (geminicli.com llms.txt), check if:
Change Location Analysis: For each modified file:
content_hash) changes with <10 content lines changedRed flag: Multiple files with changes concentrated at the end = likely footer sections not being filtered.
Cross-Reference with Scraper Logs: During scraping, check for logged messages about:
Potential Improvements Output:
If issues are detected, include a "Potential Improvements" section with actionable suggestions:
filtering.yaml"Reference: Filter configuration is in plugins/google-ecosystem/skills/gemini-cli-docs/config/filtering.yaml
Summarize:
plugins/google-ecosystem/ changes)After the final report, if there are changed files ready for commit:
Audit inline: Check all modified canonical files for:
Fix issues found: Apply fixes directly -- do NOT write findings to a separate file
Lint markdown: Run markdownlint on modified files:
npx markdownlint-cli2 --fix "plugins/google-ecosystem/skills/gemini-cli-docs/canonical/**/*.md"
Commit: Use the melodic-software:git-commit skill. Suggested format:
feat(google-ecosystem): re-scrape docs with [summary of changes]fix(google-ecosystem): fix encoding/formatting issues in scraped docsSTOP AND CONFIRM: Present the commit plan to the user before executing
Scrape OpenAI Codex CLI documentation using local plugin code at plugins/openai-ecosystem/skills/codex-cli-docs/.
CODEX_DOCS_DEV_ROOT="<repo-root>/plugins/openai-ecosystem/skills/codex-cli-docs" python <repo-root>/plugins/openai-ecosystem/skills/codex-cli-docs/scripts/core/scrape_docs.py --parallel
STOP AND VERIFY: Check the first lines of output for [DEV MODE]. If you see [PROD MODE], the environment variable was not set correctly. Do NOT proceed -- troubleshoot the env var first.
CODEX_DOCS_DEV_ROOT="<repo-root>/plugins/openai-ecosystem/skills/codex-cli-docs" python <repo-root>/plugins/openai-ecosystem/skills/codex-cli-docs/scripts/management/refresh_index.py
STOP AND VERIFY: Check for [DEV MODE] in output. If [PROD MODE] appears, stop and troubleshoot.
Run git status to see what files changed in the local repo.
IMPORTANT: Only analyze and report on changes within plugins/openai-ecosystem/. Ignore changes in other plugins -- those are from separate scrape runs.
Analyze scraped content changes to detect potential issues from external source changes:
# Check for significant content changes in scraped docs
git diff --stat plugins/openai-ecosystem/skills/codex-cli-docs/canonical/
# Review specific changes for potential issues
git diff plugins/openai-ecosystem/skills/codex-cli-docs/canonical/ | head -200
What to look for:
If issues are found, investigate the source URLs and determine if adjustments are needed to sources.json or scraping scripts.
After reviewing the git diff, perform a structural analysis to detect potential filtering gaps.
Analysis Steps:
Source Analysis: Check if:
Change Location Analysis: For each modified file:
content_hash) changes with <10 content lines changedRed flag: Multiple files with changes concentrated at the end = likely footer sections not being filtered.
Cross-Reference with Scraper Logs: During scraping, check for logged messages about:
Potential Improvements Output:
If issues are detected, include a "Potential Improvements" section with actionable suggestions:
filtering.yaml"Reference: Filter configuration is in plugins/openai-ecosystem/skills/codex-cli-docs/config/filtering.yaml
Summarize:
plugins/openai-ecosystem/ changes)After the final report, if there are changed files ready for commit:
Audit inline: Check all modified canonical files for:
Fix issues found: Apply fixes directly -- do NOT write findings to a separate file
Lint markdown: Run markdownlint on modified files:
npx markdownlint-cli2 --fix "plugins/openai-ecosystem/skills/codex-cli-docs/canonical/**/*.md"
Commit: Use the melodic-software:git-commit skill. Suggested format:
feat(openai-ecosystem): re-scrape docs with [summary of changes]fix(openai-ecosystem): fix encoding/formatting issues in scraped docsSTOP AND CONFIRM: Present the commit plan to the user before executing
$env:...) in Claude Code -- use Bash inline prefix instead/[ecosystem]:docs-ops scrape commands (uses installed plugin)[PROD MODE] appears -- stop and fix the env varrebuild_index.py alone for cursor -- use refresh_index.py to preserve metadatadevelopment
Search Milan Jovanovic's .NET blog for Clean Architecture, DDD, CQRS, EF Core, and ASP.NET Core patterns. Use for finding applicable patterns, code examples, and architecture guidance. Invoke when working with .NET projects that could benefit from proven architectural patterns.
tools
Install and configure Data API Builder (DAB) for production SQL Server MCP access with RBAC
tools
Manage MssqlMcp servers - status, rebuild, and upstream updates
tools
Developer environment setup guides for Windows, macOS, Linux, and WSL. Use when setting up development machines, installing tools, configuring environments, or following platform-specific setup guides. Covers package management, shell/terminal, code editors, AI tooling, containerization, databases, and more.