skills/capabilities/site-content-catalog/SKILL.md
Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.
npx skillsauth add athina-ai/goose-skills site-content-catalogInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.
# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"
# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20
# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json
| Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | domain | Yes | — | Domain to catalog (e.g., "example.com") | | deep-analyze | No | 0 | Number of top pages to deep-read for content analysis | | output | No | stdout | Path to save JSON output | | include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |
The script attempts multiple methods to find all pages on a site, in order:
https://[domain]/sitemap.xml/sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xmlrobots.txt for Sitemap: directives/feed, /rss, /atom.xml, /blog/feed, etc./blog, /resources, /insights, /news, /articles/blog/page/2, ?page=2, etc.)site:[domain] to estimate total indexed pagessite:[domain]/blog to find blog contentsite:[domain] intitle: to discover page title patternsonescales/sitemap-url-extractorFor each discovered URL, classify by:
Classify based on URL patterns and page titles:
| Type | URL Patterns | Examples |
|------|-------------|----------|
| blog-post | /blog/, /posts/, /articles/ | How-to guides, opinion pieces |
| case-study | /case-study/, /customers/, /success-stories/ | Customer stories |
| comparison | /vs/, /compare/, /alternative/ | X vs Y pages |
| landing-page | /solutions/, /use-cases/, /for-/ | Product marketing pages |
| docs | /docs/, /help/, /documentation/, /api/ | Technical documentation |
| changelog | /changelog/, /releases/, /whats-new/ | Product updates |
| pricing | /pricing/ | Pricing page |
| about | /about/, /team/, /careers/ | Company pages |
| legal | /privacy/, /terms/, /security/ | Legal/compliance |
| resource | /resources/, /guides/, /ebooks/, /webinars/ | Gated/downloadable content |
| glossary | /glossary/, /dictionary/, /terms/ | SEO glossary pages |
| integration | /integrations/, /apps/, /marketplace/ | Integration pages |
| other | — | Anything else |
Group by extracting topic signals from URL slugs and titles:
From the dated content (primarily blog posts):
If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:
{
"domain": "example.com",
"crawl_date": "2026-02-25",
"total_pages": 347,
"discovery_methods": ["sitemap.xml", "rss"],
"pages": [
{
"url": "https://example.com/blog/reduce-aws-costs",
"title": "How to Reduce Your AWS Bill by 40%",
"date": "2025-11-15",
"type": "blog-post",
"topic_cluster": "Cloud Cost Optimization",
"deep_analysis": {
"word_count": 2100,
"target_keyword": "reduce aws costs",
"funnel_stage": "TOFU",
"content_depth": "deep",
"has_images": true,
"has_cta": true
}
}
],
"summary": {
"by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
"by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
"publishing_cadence": {
"posts_per_month_avg": 4.2,
"trend": "increasing",
"most_recent": "2026-02-20"
}
}
}
# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347
## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...
## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...
## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20
## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |
requests library (pip install requests)APIFY_API_TOKEN env var (only for Apify fallback mode)content-media
Takes an existing screen recording or demo video and adds professional zoom/pan effects synchronized to the narration. Uses transcript-driven zoom targeting and Remotion for rendering. Optionally replaces audio with a soundtrack.
tools
Repurposes long-form video (podcasts, interviews, talks) into short-form vertical clips for Instagram Reels, TikTok, and YouTube Shorts. Handles transcription, moment selection, clip extraction, speaker-tracked reframing (16:9 to 9:16), and animated captions.
development
Creates talking head videos from any source material (docs, changelogs, blog posts, notes, transcripts). Produces multi-scene videos with avatar narration over screenshots/images using HeyGen v2 API. Supports Quick Shot and Full Producer modes.
tools
Generates Instagram-ready product reels from any e-commerce product page URL. Scrapes product images, classifies by type, generates AI-animated clips via Higgsfield API, creates text overlays with style presets, and composes a 15-20 second reel with music. Supports model-based and product-only reels.