Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

athina-ai/site-content-catalog

Name: site-content-catalog
Author: athina-ai

skills/capabilities/site-content-catalog/SKILL.md

npx skillsauth add athina-ai/goose-skills site-content-catalog

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Site Content Catalog

Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.

Quick Start

# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"

# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20

# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json

Inputs

| Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | domain | Yes | — | Domain to catalog (e.g., "example.com") | | deep-analyze | No | 0 | Number of top pages to deep-read for content analysis | | output | No | stdout | Path to save JSON output | | include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |

Cost

Sitemap/RSS crawling: Free (direct HTTP requests)
Apify sitemap extractor (fallback): ~$0.50 per site
Deep analysis: Free (WebFetch on individual pages)

Process

Phase 1: Discover All Pages

The script attempts multiple methods to find all pages on a site, in order:

A) Sitemap.xml

Fetch https://[domain]/sitemap.xml
If it's a sitemap index, recursively fetch all child sitemaps
Common alternate locations: /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml
Check robots.txt for Sitemap: directives

B) RSS/Atom Feeds

Check /feed, /rss, /atom.xml, /blog/feed, etc.
Extract posts with titles, dates, and URLs
RSS typically only surfaces recent content (last 10-50 posts)

C) Blog Index Crawl

Fetch /blog, /resources, /insights, /news, /articles
Extract links from the page
Follow pagination if present (/blog/page/2, ?page=2, etc.)

D) Site: Search (fallback)

WebSearch: site:[domain] to estimate total indexed pages
WebSearch: site:[domain]/blog to find blog content
WebSearch: site:[domain] intitle: to discover page title patterns

E) Apify Sitemap Extractor (fallback for JS-heavy sites)

Actor: onescales/sitemap-url-extractor
Use when sitemap.xml is missing and the site is JS-rendered

Phase 2: Classify Each Page

For each discovered URL, classify by:

Content Type

Classify based on URL patterns and page titles:

| Type | URL Patterns | Examples | |------|-------------|----------| | blog-post | /blog/, /posts/, /articles/ | How-to guides, opinion pieces | | case-study | /case-study/, /customers/, /success-stories/ | Customer stories | | comparison | /vs/, /compare/, /alternative/ | X vs Y pages | | landing-page | /solutions/, /use-cases/, /for-/ | Product marketing pages | | docs | /docs/, /help/, /documentation/, /api/ | Technical documentation | | changelog | /changelog/, /releases/, /whats-new/ | Product updates | | pricing | /pricing/ | Pricing page | | about | /about/, /team/, /careers/ | Company pages | | legal | /privacy/, /terms/, /security/ | Legal/compliance | | resource | /resources/, /guides/, /ebooks/, /webinars/ | Gated/downloadable content | | glossary | /glossary/, /dictionary/, /terms/ | SEO glossary pages | | integration | /integrations/, /apps/, /marketplace/ | Integration pages | | other | — | Anything else |

Topic Cluster

Group by extracting topic signals from URL slugs and titles:

Extract keywords from URL path segments
Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
Use simple keyword co-occurrence for clustering

Phase 3: Analyze Publishing Patterns

From the dated content (primarily blog posts):

Total content pieces by type
Publishing frequency: Posts per month over last 12 months
Trend: Increasing, decreasing, or stable output
Recency: Date of most recent publish
Author diversity: Unique authors (if extractable from RSS)

Phase 4: Deep Analysis (Optional)

If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:

Word count (approximate)
Target keyword (inferred from title + H1 + URL)
Funnel stage: TOFU (awareness), MOFU (consideration), BOFU (decision)
Content depth: Shallow (<500 words), Medium (500-1500), Deep (1500+)
Has images/video: Boolean
Has CTA: Boolean (detected by common CTA patterns)
Internal links count

Phase 5: Output

JSON Output (default)

{
  "domain": "example.com",
  "crawl_date": "2026-02-25",
  "total_pages": 347,
  "discovery_methods": ["sitemap.xml", "rss"],
  "pages": [
    {
      "url": "https://example.com/blog/reduce-aws-costs",
      "title": "How to Reduce Your AWS Bill by 40%",
      "date": "2025-11-15",
      "type": "blog-post",
      "topic_cluster": "Cloud Cost Optimization",
      "deep_analysis": {
        "word_count": 2100,
        "target_keyword": "reduce aws costs",
        "funnel_stage": "TOFU",
        "content_depth": "deep",
        "has_images": true,
        "has_cta": true
      }
    }
  ],
  "summary": {
    "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
    "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
    "publishing_cadence": {
      "posts_per_month_avg": 4.2,
      "trend": "increasing",
      "most_recent": "2026-02-20"
    }
  }
}

Markdown Summary (also generated)

# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347

## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...

## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...

## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20

## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |

Tips

Sitemap.xml is the best source. Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
RSS only shows recent content. If you need the full catalog, sitemap is essential. RSS is supplementary.
Deep analysis is optional but valuable. Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
JS-rendered sites may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
Combine with seo-domain-analyzer to overlay traffic data on the content inventory — see which content actually performs.

Dependencies

Python 3.8+
requests library (pip install requests)
APIFY_API_TOKEN env var (only for Apify fallback mode)

athina-ai/site-content-catalog

skills/capabilities/site-content-catalog/SKILL.md

Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.

497 stars

development

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add athina-ai/goose-skills site-content-catalog

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 9:06 AM9.9s3 files scanned

SKILL.md

name:: site-content-catalog
description:: >
tags:: [content, seo]

Site Content Catalog

Quick Start

# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"

# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20

# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json

Inputs

Cost

Sitemap/RSS crawling: Free (direct HTTP requests)
Apify sitemap extractor (fallback): ~$0.50 per site
Deep analysis: Free (WebFetch on individual pages)

Process

Phase 1: Discover All Pages

The script attempts multiple methods to find all pages on a site, in order:

A) Sitemap.xml

Fetch https://[domain]/sitemap.xml
If it's a sitemap index, recursively fetch all child sitemaps
Common alternate locations: /sitemap_index.xml, /sitemap-index.xml, /wp-sitemap.xml
Check robots.txt for Sitemap: directives

B) RSS/Atom Feeds

Check /feed, /rss, /atom.xml, /blog/feed, etc.
Extract posts with titles, dates, and URLs
RSS typically only surfaces recent content (last 10-50 posts)

C) Blog Index Crawl

Fetch /blog, /resources, /insights, /news, /articles
Extract links from the page
Follow pagination if present (/blog/page/2, ?page=2, etc.)

D) Site: Search (fallback)

WebSearch: site:[domain] to estimate total indexed pages
WebSearch: site:[domain]/blog to find blog content
WebSearch: site:[domain] intitle: to discover page title patterns

E) Apify Sitemap Extractor (fallback for JS-heavy sites)

Actor: onescales/sitemap-url-extractor
Use when sitemap.xml is missing and the site is JS-rendered

Phase 2: Classify Each Page

For each discovered URL, classify by:

Content Type

Classify based on URL patterns and page titles:

Topic Cluster

Group by extracting topic signals from URL slugs and titles:

Extract keywords from URL path segments
Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
Use simple keyword co-occurrence for clustering

Phase 3: Analyze Publishing Patterns

From the dated content (primarily blog posts):

Total content pieces by type
Publishing frequency: Posts per month over last 12 months
Trend: Increasing, decreasing, or stable output
Recency: Date of most recent publish
Author diversity: Unique authors (if extractable from RSS)

Phase 4: Deep Analysis (Optional)

If --deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:

Word count (approximate)
Target keyword (inferred from title + H1 + URL)
Funnel stage: TOFU (awareness), MOFU (consideration), BOFU (decision)
Content depth: Shallow (<500 words), Medium (500-1500), Deep (1500+)
Has images/video: Boolean
Has CTA: Boolean (detected by common CTA patterns)
Internal links count

Phase 5: Output

JSON Output (default)

{
  "domain": "example.com",
  "crawl_date": "2026-02-25",
  "total_pages": 347,
  "discovery_methods": ["sitemap.xml", "rss"],
  "pages": [
    {
      "url": "https://example.com/blog/reduce-aws-costs",
      "title": "How to Reduce Your AWS Bill by 40%",
      "date": "2025-11-15",
      "type": "blog-post",
      "topic_cluster": "Cloud Cost Optimization",
      "deep_analysis": {
        "word_count": 2100,
        "target_keyword": "reduce aws costs",
        "funnel_stage": "TOFU",
        "content_depth": "deep",
        "has_images": true,
        "has_cta": true
      }
    }
  ],
  "summary": {
    "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
    "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
    "publishing_cadence": {
      "posts_per_month_avg": 4.2,
      "trend": "increasing",
      "most_recent": "2026-02-20"
    }
  }
}

Markdown Summary (also generated)

# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347

## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...

## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...

## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20

## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |

Tips

Sitemap.xml is the best source. Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
RSS only shows recent content. If you need the full catalog, sitemap is essential. RSS is supplementary.
Deep analysis is optional but valuable. Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
JS-rendered sites may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
Combine with seo-domain-analyzer to overlay traffic data on the content inventory — see which content actually performs.

Dependencies

Python 3.8+
requests library (pip install requests)
APIFY_API_TOKEN env var (only for Apify fallback mode)

Related Skills

athina-ai/video-polish

content-media

VerifiedTrustedCommunity

Takes an existing screen recording or demo video and adds professional zoom/pan effects synchronized to the narration. Uses transcript-driven zoom targeting and Remotion for rendering. Optionally replaces audio with a soundtrack.

507SKILL.mdUpdated Apr 25, 2026

athina-ai/video-polish

athina-ai/video-clipper

tools

VerifiedTrustedCommunity

Repurposes long-form video (podcasts, interviews, talks) into short-form vertical clips for Instagram Reels, TikTok, and YouTube Shorts. Handles transcription, moment selection, clip extraction, speaker-tracked reframing (16:9 to 9:16), and animated captions.

507SKILL.mdUpdated Apr 25, 2026

athina-ai/video-clipper

athina-ai/talking-head-video

development

VerifiedTrustedCommunity

Creates talking head videos from any source material (docs, changelogs, blog posts, notes, transcripts). Produces multi-scene videos with avatar narration over screenshots/images using HeyGen v2 API. Supports Quick Shot and Full Producer modes.

507SKILL.mdUpdated Apr 25, 2026

athina-ai/talking-head-video

athina-ai/product-reel-generator

tools

VerifiedTrustedCommunity

Generates Instagram-ready product reels from any e-commerce product page URL. Scrapes product images, classifies by type, generates AI-animated clips via Higgsfield API, creates text overlays with style presets, and composes a 15-20 second reel with music. Supports model-based and product-only reels.

507SKILL.mdUpdated Apr 25, 2026

athina-ai/product-reel-generator

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/athina-ai/goose-skills.git

# Copy into Claude Code skills folder (global)
cp -r goose-skills/skills/capabilities/site-content-catalog ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

athina-ai/goose-skills

497 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT