Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

gooseworks-ai/site-content-catalog

Name: site-content-catalog
Author: gooseworks-ai

skills/capabilities/site-content-catalog/SKILL.md

npx skillsauth add gooseworks-ai/goose-skills site-content-catalog

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Site Content Catalog

Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.

Quick Start

# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"

# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20

# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json

Inputs

| Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | domain | Yes | — | Domain to catalog (e.g., "example.com") | | deep-analyze | No | 0 | Number of top pages to deep-read for content analysis | | output | No | stdout | Path to save JSON output | | include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |

Cost

Sitemap/RSS crawling: Free (direct HTTP requests)
Apify sitemap extractor (fallback): ~$0.50 per site
Deep analysis: Free (WebFetch on individual pages)

Process

Phase 1: Discover All Pages

The script attempts multiple methods to find all pages on a site, in order:

A) Sitemap.xml

Fetch https://[domain]/sitemap.xml
If it's a sitemap index, recursively fetch all child sitemaps
Common alternate locations: /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml
Check robots.txt for Sitemap: directives

B) RSS/Atom Feeds

Check /feed, /rss, /atom.xml, /blog/feed, etc.
Extract posts with titles, dates, and URLs
RSS typically only surfaces recent content (last 10-50 posts)

C) Blog Index Crawl

Fetch /blog, /resources, /insights, /news, /articles
Extract links from the page
Follow pagination if present (/blog/page/2, ?page=2, etc.)

D) Site: Search (fallback)

WebSearch: site:[domain] to estimate total indexed pages
WebSearch: site:[domain]/blog to find blog content
WebSearch: site:[domain] intitle: to discover page title patterns

E) Apify Sitemap Extractor (fallback for JS-heavy sites)

Actor: onescales/sitemap-url-extractor
Use when sitemap.xml is missing and the site is JS-rendered

Phase 2: Classify Each Page

For each discovered URL, classify by:

Content Type

Classify based on URL patterns and page titles:

| Type | URL Patterns | Examples | |------|-------------|----------| | blog-post | /blog/, /posts/, /articles/ | How-to guides, opinion pieces | | case-study | /case-study/, /customers/, /success-stories/ | Customer stories | | comparison | /vs/, /compare/, /alternative/ | X vs Y pages | | landing-page | /solutions/, /use-cases/, /for-/ | Product marketing pages | | docs | /docs/, /help/, /documentation/, /api/ | Technical documentation | | changelog | /changelog/, /releases/, /whats-new/ | Product updates | | pricing | /pricing/ | Pricing page | | about | /about/, /team/, /careers/ | Company pages | | legal | /privacy/, /terms/, /security/ | Legal/compliance | | resource | /resources/, /guides/, /ebooks/, /webinars/ | Gated/downloadable content | | glossary | /glossary/, /dictionary/, /terms/ | SEO glossary pages | | integration | /integrations/, /apps/, /marketplace/ | Integration pages | | other | — | Anything else |

Topic Cluster

Group by extracting topic signals from URL slugs and titles:

Extract keywords from URL path segments
Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
Use simple keyword co-occurrence for clustering

Phase 3: Analyze Publishing Patterns

From the dated content (primarily blog posts):

Total content pieces by type
Publishing frequency: Posts per month over last 12 months
Trend: Increasing, decreasing, or stable output
Recency: Date of most recent publish
Author diversity: Unique authors (if extractable from RSS)

Phase 4: Deep Analysis (Optional)

If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:

Word count (approximate)
Target keyword (inferred from title + H1 + URL)
Funnel stage: TOFU (awareness), MOFU (consideration), BOFU (decision)
Content depth: Shallow (<500 words), Medium (500-1500), Deep (1500+)
Has images/video: Boolean
Has CTA: Boolean (detected by common CTA patterns)
Internal links count

Phase 5: Output

JSON Output (default)

{
  "domain": "example.com",
  "crawl_date": "2026-02-25",
  "total_pages": 347,
  "discovery_methods": ["sitemap.xml", "rss"],
  "pages": [
    {
      "url": "https://example.com/blog/reduce-aws-costs",
      "title": "How to Reduce Your AWS Bill by 40%",
      "date": "2025-11-15",
      "type": "blog-post",
      "topic_cluster": "Cloud Cost Optimization",
      "deep_analysis": {
        "word_count": 2100,
        "target_keyword": "reduce aws costs",
        "funnel_stage": "TOFU",
        "content_depth": "deep",
        "has_images": true,
        "has_cta": true
      }
    }
  ],
  "summary": {
    "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
    "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
    "publishing_cadence": {
      "posts_per_month_avg": 4.2,
      "trend": "increasing",
      "most_recent": "2026-02-20"
    }
  }
}

Markdown Summary (also generated)

# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347

## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...

## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...

## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20

## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |

Tips

Sitemap.xml is the best source. Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
RSS only shows recent content. If you need the full catalog, sitemap is essential. RSS is supplementary.
Deep analysis is optional but valuable. Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
JS-rendered sites may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
Combine with seo-domain-analyzer to overlay traffic data on the content inventory — see which content actually performs.

Dependencies

Python 3.8+
requests library (pip install requests)
APIFY_API_TOKEN env var (only for Apify fallback mode)

gooseworks-ai/site-content-catalog

skills/capabilities/site-content-catalog/SKILL.md

Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.

455 stars

development

Updated Apr 21, 2026

$ install --global

skillsauth

npx skillsauth add gooseworks-ai/goose-skills site-content-catalog

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 30, 2026, 9:31 AM9.8s1 file scanned

SKILL.md

name:: site-content-catalog
description:: >
tags:: [content, seo]

Site Content Catalog

Quick Start

# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"

# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20

# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json

Inputs

Cost

Sitemap/RSS crawling: Free (direct HTTP requests)
Apify sitemap extractor (fallback): ~$0.50 per site
Deep analysis: Free (WebFetch on individual pages)

Process

Phase 1: Discover All Pages

The script attempts multiple methods to find all pages on a site, in order:

A) Sitemap.xml

Fetch https://[domain]/sitemap.xml
If it's a sitemap index, recursively fetch all child sitemaps
Common alternate locations: /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml
Check robots.txt for Sitemap: directives

B) RSS/Atom Feeds

Check /feed, /rss, /atom.xml, /blog/feed, etc.
Extract posts with titles, dates, and URLs
RSS typically only surfaces recent content (last 10-50 posts)

C) Blog Index Crawl

Fetch /blog, /resources, /insights, /news, /articles
Extract links from the page
Follow pagination if present (/blog/page/2, ?page=2, etc.)

D) Site: Search (fallback)

WebSearch: site:[domain] to estimate total indexed pages
WebSearch: site:[domain]/blog to find blog content
WebSearch: site:[domain] intitle: to discover page title patterns

E) Apify Sitemap Extractor (fallback for JS-heavy sites)

Actor: onescales/sitemap-url-extractor
Use when sitemap.xml is missing and the site is JS-rendered

Phase 2: Classify Each Page

For each discovered URL, classify by:

Content Type

Classify based on URL patterns and page titles:

Topic Cluster

Group by extracting topic signals from URL slugs and titles:

Extract keywords from URL path segments
Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
Use simple keyword co-occurrence for clustering

Phase 3: Analyze Publishing Patterns

From the dated content (primarily blog posts):

Total content pieces by type
Publishing frequency: Posts per month over last 12 months
Trend: Increasing, decreasing, or stable output
Recency: Date of most recent publish
Author diversity: Unique authors (if extractable from RSS)

Phase 4: Deep Analysis (Optional)

If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:

Word count (approximate)
Target keyword (inferred from title + H1 + URL)
Funnel stage: TOFU (awareness), MOFU (consideration), BOFU (decision)
Content depth: Shallow (<500 words), Medium (500-1500), Deep (1500+)
Has images/video: Boolean
Has CTA: Boolean (detected by common CTA patterns)
Internal links count

Phase 5: Output

JSON Output (default)

{
  "domain": "example.com",
  "crawl_date": "2026-02-25",
  "total_pages": 347,
  "discovery_methods": ["sitemap.xml", "rss"],
  "pages": [
    {
      "url": "https://example.com/blog/reduce-aws-costs",
      "title": "How to Reduce Your AWS Bill by 40%",
      "date": "2025-11-15",
      "type": "blog-post",
      "topic_cluster": "Cloud Cost Optimization",
      "deep_analysis": {
        "word_count": 2100,
        "target_keyword": "reduce aws costs",
        "funnel_stage": "TOFU",
        "content_depth": "deep",
        "has_images": true,
        "has_cta": true
      }
    }
  ],
  "summary": {
    "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
    "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
    "publishing_cadence": {
      "posts_per_month_avg": 4.2,
      "trend": "increasing",
      "most_recent": "2026-02-20"
    }
  }
}

Markdown Summary (also generated)

# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347

## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...

## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...

## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20

## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |

Tips

Sitemap.xml is the best source. Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
RSS only shows recent content. If you need the full catalog, sitemap is essential. RSS is supplementary.
Deep analysis is optional but valuable. Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
JS-rendered sites may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
Combine with seo-domain-analyzer to overlay traffic data on the content inventory — see which content actually performs.

Dependencies

Python 3.8+
requests library (pip install requests)
APIFY_API_TOKEN env var (only for Apify fallback mode)

Related Skills

gooseworks-ai/goose-graphics-create-style

development

VerifiedTrustedCommunity

End-to-end skill that turns a single reference image into a fully-installed, example-rendered style preset for the goose-graphics composite. Analyzes the image, writes the slim style spec, registers it in styles/index.json, generates all 7 format examples using the standard brief, renders PNGs via Playwright, and updates examples/manifest.json. Invoke with /goose-graphics-create-style.

600SKILL.mdUpdated Apr 28, 2026

gooseworks-ai/goose-graphics-create-style

gooseworks-ai/yc-batch-evaluator

development

VerifiedTrustedCommunity

Evaluate YC batch companies for investment — scrapes the YC directory, researches each company and its founders (work history, LinkedIn, website), assesses founder-company fit, and exports to Google Sheets with priority rankings. Use when asked to evaluate YC companies, research a YC batch, screen startups, or do due diligence on YC companies.

600SKILL.mdUpdated Apr 28, 2026

gooseworks-ai/yc-batch-evaluator

gooseworks-ai/website-screenshot-notte

tools

VerifiedTrustedCommunity

Take screenshots of any website using Notte browser automation. Use when asked to screenshot, capture, or snap a webpage.

600SKILL.mdUpdated Apr 28, 2026

gooseworks-ai/website-screenshot-notte

gooseworks-ai/web-search

development

VerifiedTrustedCommunity

Search the web, platforms, and datasets. Use when asked to search, find, look up, research, or discover information from the web, YouTube, Amazon, eBay, news, academic sources, or any online platform.

600SKILL.mdUpdated Apr 28, 2026

gooseworks-ai/web-search

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/gooseworks-ai/goose-skills.git

# Copy into Claude Code skills folder (global)
cp -r goose-skills/skills/capabilities/site-content-catalog ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

gooseworks-ai/goose-skills

455 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT