skills/capabilities/site-content-catalog/SKILL.md
Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.
npx skillsauth add gooseworks-ai/goose-skills site-content-catalogInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.
# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"
# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20
# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json
| Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | domain | Yes | — | Domain to catalog (e.g., "example.com") | | deep-analyze | No | 0 | Number of top pages to deep-read for content analysis | | output | No | stdout | Path to save JSON output | | include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |
The script attempts multiple methods to find all pages on a site, in order:
https://[domain]/sitemap.xml/sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xmlrobots.txt for Sitemap: directives/feed, /rss, /atom.xml, /blog/feed, etc./blog, /resources, /insights, /news, /articles/blog/page/2, ?page=2, etc.)site:[domain] to estimate total indexed pagessite:[domain]/blog to find blog contentsite:[domain] intitle: to discover page title patternsonescales/sitemap-url-extractorFor each discovered URL, classify by:
Classify based on URL patterns and page titles:
| Type | URL Patterns | Examples |
|------|-------------|----------|
| blog-post | /blog/, /posts/, /articles/ | How-to guides, opinion pieces |
| case-study | /case-study/, /customers/, /success-stories/ | Customer stories |
| comparison | /vs/, /compare/, /alternative/ | X vs Y pages |
| landing-page | /solutions/, /use-cases/, /for-/ | Product marketing pages |
| docs | /docs/, /help/, /documentation/, /api/ | Technical documentation |
| changelog | /changelog/, /releases/, /whats-new/ | Product updates |
| pricing | /pricing/ | Pricing page |
| about | /about/, /team/, /careers/ | Company pages |
| legal | /privacy/, /terms/, /security/ | Legal/compliance |
| resource | /resources/, /guides/, /ebooks/, /webinars/ | Gated/downloadable content |
| glossary | /glossary/, /dictionary/, /terms/ | SEO glossary pages |
| integration | /integrations/, /apps/, /marketplace/ | Integration pages |
| other | — | Anything else |
Group by extracting topic signals from URL slugs and titles:
From the dated content (primarily blog posts):
If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:
{
"domain": "example.com",
"crawl_date": "2026-02-25",
"total_pages": 347,
"discovery_methods": ["sitemap.xml", "rss"],
"pages": [
{
"url": "https://example.com/blog/reduce-aws-costs",
"title": "How to Reduce Your AWS Bill by 40%",
"date": "2025-11-15",
"type": "blog-post",
"topic_cluster": "Cloud Cost Optimization",
"deep_analysis": {
"word_count": 2100,
"target_keyword": "reduce aws costs",
"funnel_stage": "TOFU",
"content_depth": "deep",
"has_images": true,
"has_cta": true
}
}
],
"summary": {
"by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
"by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
"publishing_cadence": {
"posts_per_month_avg": 4.2,
"trend": "increasing",
"most_recent": "2026-02-20"
}
}
}
# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347
## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...
## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...
## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20
## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |
requests library (pip install requests)APIFY_API_TOKEN env var (only for Apify fallback mode)development
End-to-end skill that turns a single reference image into a fully-installed, example-rendered style preset for the goose-graphics composite. Analyzes the image, writes the slim style spec, registers it in styles/index.json, generates all 7 format examples using the standard brief, renders PNGs via Playwright, and updates examples/manifest.json. Invoke with /goose-graphics-create-style.
development
Evaluate YC batch companies for investment — scrapes the YC directory, researches each company and its founders (work history, LinkedIn, website), assesses founder-company fit, and exports to Google Sheets with priority rankings. Use when asked to evaluate YC companies, research a YC batch, screen startups, or do due diligence on YC companies.
tools
Take screenshots of any website using Notte browser automation. Use when asked to screenshot, capture, or snap a webpage.
development
Search the web, platforms, and datasets. Use when asked to search, find, look up, research, or discover information from the web, YouTube, Amazon, eBay, news, academic sources, or any online platform.