Site Auditor

Quick Reference

| Category | Trigger | Complexity | Source | |----------|---------|------------|--------| | audit | "auditing a website", "finding content gaps", "site quality audit", "content inventory analysis" | High | 8 projects |

Perform a comprehensive site audit by fetching and cross-referencing three independent data sources — the query index JSON, the XML sitemap, and live navigation links. The audit identifies 11 categories of content gaps, metadata issues, and quality problems, producing a structured analysis report that downstream skills (pagespeed-audit, report-hub-generator) can consume.

When to Use

User wants to understand the overall health of a website's content
User needs to find pages that exist in one system but not another (index vs. sitemap mismatches)
User is looking for stale content that has not been updated in over 12 months
User needs a content inventory with categorization by path prefix
User wants to detect missing metadata (descriptions, OG images) before a launch or redesign
User is migrating or restructuring a site and needs a baseline audit
A downstream skill (pagespeed-audit, report-hub-generator) requests a site audit as input

Instructions

Step 1: Collect Data from Three Sources

Every audit begins by fetching three independent data sources from the target origin. All three are required for a complete audit; if one fails, proceed with the remaining sources and note the gap in the summary.

Source 1: Query Index JSON

Fetch {origin}/query-index.json. This is the primary content database, typically generated by a CMS or static site generator.

Expected response shape:

{
  "total": 847,
  "offset": 0,
  "limit": 512,
  "data": [
    {
      "path": "/blog/2024/site-performance-tips",
      "title": "10 Tips for Better Site Performance",
      "description": "Learn how to optimize your website for Core Web Vitals and user experience.",
      "image": "/media_1a2b3c.png",
      "lastModified": "1709251200"
    }
  ]
}

Key fields: path (required), title, description, image, lastModified (Unix timestamp as string).

Handle pagination: if total > offset + limit, fetch subsequent pages with ?offset={next} until all entries are collected. Concatenate all data[] arrays into a single list.

Source 2: Sitemap XML

Fetch {origin}/sitemap.xml. Parse all <loc> tags to extract URLs.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>https://example.com/blog/2024/site-performance-tips</loc></url>
  <url><loc>https://example.com/about</loc></url>
</urlset>

Handle sitemap indexes: if the response contains <sitemapindex>, recursively fetch each <sitemap><loc> entry and merge all URLs.

Convert absolute URLs to relative paths by stripping the origin. Apply path normalization (see Step 2).

Source 3: Navigation Links

Fetch the homepage ({origin}/) and parse all <a href> attributes from navigation elements. Target these selectors in priority order:

nav a[href] — semantic navigation elements
header a[href] — header links (often includes main nav)
footer a[href] — footer links (often includes sitemap-style navigation)
.nav a[href], .navigation a[href] — class-based navigation

Filter to internal links only (same origin or relative paths). Remove duplicates and apply path normalization.

Step 2: Normalize All Paths

Before any comparison, normalize every path from every source using these rules:

Remove trailing slashes: /about/ becomes /about
Remove hash fragments: /page#section becomes /page
Remove query parameters: /page?ref=nav becomes /page
Lowercase the path: /About-Us becomes /about-us
Decode URL-encoded characters: /%2Fblog becomes /blog
Ensure leading slash: about becomes /about

function normalizePath(rawPath) {
  try {
    let path = rawPath.startsWith('http')
      ? new URL(rawPath).pathname
      : rawPath

    path = decodeURIComponent(path)
      .toLowerCase()
      .split('#')[0]
      .split('?')[0]
      .replace(/\/+$/, '')

    if (!path.startsWith('/')) path = '/' + path
    return path || '/'
  } catch {
    return null  // Discard unparseable paths
  }
}

Discard any path that normalizes to null. Log discarded paths for debugging but do not include them in gap analysis.

Step 3: Categorize Paths by Prefix

Group all collected paths by their top-level prefix to enable section-level analysis. Extract the first two path segments as the category key:

/developer/guides/getting-started → /developer/guides
/blog/2024/performance-tips       → /blog/2024
/products/analytics               → /products
/about                            → /about

If a path has only one segment, use that segment as the category. Track counts per category per source (index, sitemap, nav) to spot sections that are underrepresented in any source.

Step 4: Run Gap Analysis Across All 11 Categories

Compare the three normalized path sets against each other and against metadata quality criteria. For each gap found, record the path, the gap category, and any relevant metadata.

Category 1: In Index but Not Sitemap

Pages present in query-index.json but missing from sitemap.xml. These pages exist in the CMS but are invisible to search engines.

Severity: High — direct SEO impact.

Category 2: In Sitemap but Not Index

URLs in sitemap.xml with no corresponding entry in query-index.json. These may be orphaned pages, redirects, or pages excluded from the content index.

Severity: Medium — may indicate stale sitemap entries or content pipeline issues.

Category 3: Stale Content

Pages in the index where lastModified is older than 12 months from the current date. Convert the Unix timestamp string to a date and compare:

function isStale(lastModifiedStr) {
  const lastModified = new Date(parseInt(lastModifiedStr) * 1000)
  const twelveMonthsAgo = new Date()
  twelveMonthsAgo.setMonth(twelveMonthsAgo.getMonth() - 12)
  return lastModified < twelveMonthsAgo
}

Severity: Low to Medium — depends on content type (blog posts age naturally, product pages should stay current).

Category 4: Missing Descriptions

Index entries where description is empty, null, undefined, or shorter than 20 characters (too short to be meaningful).

Severity: Medium — affects search snippets and social sharing.

Category 5: Missing OG Images

Index entries where image is empty, null, or points to a default/placeholder image. Detect placeholders by checking for common patterns: placeholder, default, no-image, or images smaller than 200x200 if dimensions are available.

Severity: Medium — affects social media card rendering and visual search results.

Category 6: Duplicate Titles

Two or more index entries sharing the exact same title after trimming whitespace. Group duplicates together and flag each group.

Severity: Medium — confuses search engines and users navigating search results.

Category 7: Broken Navigation Links

Paths found in navigation (Source 3) that do not appear in either the index or the sitemap. These are links visible to users that may lead to 404 pages.

Severity: High — directly impacts user experience.

Category 8: Old Branding References

Index entries where title or description contains outdated brand terms. Build a configurable list of old brand terms to scan for (e.g., old company names, retired product names, deprecated terminology).

Common patterns to check:

Previous company names after a rebrand
Retired product line names
Old taglines or slogans
Deprecated feature names

Severity: Medium — brand consistency issue.

Category 9: Pages in Navigation but Not Index

Subset of Category 7 — links in the nav that appear in the sitemap but NOT in the index. These pages are navigable and crawlable but missing from the content database.

Severity: Medium — content pipeline gap.

Category 10: Deprecated Pages

Index entries whose path contains signals of deprecation: /deprecated/, /archive/, /legacy/, /old/, /v1/, /v2/ (when v3+ exists). Also flag pages with titles containing "deprecated", "archived", "legacy", or "end of life".

Severity: Low — informational, but should be reviewed for removal from active navigation.

Category 11: Lab/Experimental Pages

Index entries whose path contains /lab/, /labs/, /experimental/, /beta/, /preview/, /sandbox/. These may be internal-only pages accidentally exposed in the index or sitemap.

Severity: Low to Medium — potential unintended public exposure.

Step 5: Perform Freshness Analysis

Beyond the binary stale/fresh check in Category 3, build a freshness distribution:

{
  "freshnessDistribution": {
    "last30days": 42,
    "last90days": 128,
    "last6months": 234,
    "last12months": 387,
    "older": 460
  },
  "oldestPage": { "path": "/about", "lastModified": "2019-03-15" },
  "newestPage": { "path": "/blog/2024/latest", "lastModified": "2024-12-01" },
  "averageAgeDays": 342
}

This helps the user understand the overall content velocity and identify sections that are systematically neglected.

Step 6: Generate the Output

Write all output to data/audit/ relative to the project root.

File 1: `data/audit/query-index.json`

The raw query index data as fetched (all pages concatenated if paginated). This serves as a cache for downstream skills.

File 2: `data/audit/analysis.json`

The structured analysis report:

{
  "meta": {
    "origin": "https://example.com",
    "auditDate": "2024-12-15T10:30:00Z",
    "sources": {
      "queryIndex": { "status": "ok", "count": 847 },
      "sitemap": { "status": "ok", "count": 912 },
      "navigation": { "status": "ok", "count": 64 }
    }
  },
  "summary": {
    "totalUniquePages": 934,
    "gapCount": 127,
    "criticalGaps": 23,
    "categories": {
      "/blog": { "index": 340, "sitemap": 355, "nav": 5 },
      "/developer": { "index": 210, "sitemap": 198, "nav": 18 }
    }
  },
  "gaps": {
    "inIndexNotSitemap": ["/blog/draft-post", "/internal/test-page"],
    "inSitemapNotIndex": ["/old-landing-page", "/event/2022-conference"],
    "brokenNavLinks": ["/products/retired-product"]
  },
  "metadata": {
    "missingDescriptions": ["/about", "/contact"],
    "missingOgImages": ["/blog/2023/quick-update"],
    "duplicateTitles": [
      { "title": "Home", "paths": ["/", "/home", "/index"] }
    ]
  },
  "quality": {
    "stalePages": [{ "path": "/blog/2022/old-post", "lastModified": "2022-01-15" }],
    "oldBranding": [{ "path": "/about", "match": "OldCo Inc." }],
    "deprecatedPages": ["/legacy/v1-api"],
    "labPages": ["/labs/experimental-feature"]
  },
  "freshness": {
    "distribution": { "last30days": 42, "last90days": 128 },
    "averageAgeDays": 342
  },
  "allEntries": [
    {
      "path": "/blog/2024/site-performance-tips",
      "title": "10 Tips for Better Site Performance",
      "inIndex": true,
      "inSitemap": true,
      "inNav": false,
      "gaps": [],
      "category": "/blog/2024"
    }
  ]
}

The allEntries array is a denormalized view of every unique path across all three sources, with boolean flags for each source and a list of applicable gap categories. This is the primary input for report-hub-generator.

Step 7: RUM Integration (Optional)

If Real User Monitoring data is available (see pagespeed-audit skill for RUM bundle format), cross-reference audit findings with traffic data:

Stale pages WITH traffic: High priority to refresh — users are actively visiting outdated content
Missing OG images WITH social referral traffic: Critical — broken social cards are actively hurting click-through rates
Pages in index but NOT sitemap WITH organic search traffic: Surprising — these pages are getting traffic despite being invisible to crawlers (possibly through internal links)
Broken nav links WITH clicks: Critical — users are actively encountering dead links

Add a rumEnriched boolean to the meta object and include traffic-weighted severity scores when RUM data is available.

Step 8: Present Results to the User

Summarize findings in a clear, prioritized format:

Critical issues (broken nav links, high-traffic stale pages): Fix immediately
SEO issues (index/sitemap mismatches, missing descriptions): Fix within a sprint
Quality issues (old branding, duplicate titles): Fix during content review cycles
Informational (deprecated pages, lab pages): Review and decide

Always tell the user the exact counts and provide the top 5 examples for each category. Point them to data/audit/analysis.json for the complete dataset.

Troubleshooting

| Problem | Cause | Fix | |---------|-------|-----| | Query index returns 404 | Site does not use a query index | Skip this source; audit with sitemap + nav only. Note reduced coverage in summary. | | Sitemap returns a sitemap index | Nested sitemaps | Recursively fetch all child sitemaps and merge URLs | | Navigation returns thousands of links | Mega-menu or footer with every page linked | Filter to unique paths and apply a max of 500 nav links | | lastModified is "0" or missing | CMS does not track modification dates | Exclude from freshness analysis; flag as "unknown freshness" | | All titles are the same | Site uses a default title template | Flag as duplicate titles but note the pattern (likely a CMS configuration issue, not a content issue) | | Audit takes too long | Large site with 10K+ pages | Process in batches of 500; write intermediate results to disk |

Cross-References

pagespeed-audit: Provides RUM data for traffic-weighted severity scoring and Core Web Vitals per page
accessibility-auditor: Complements content audit with accessibility compliance checks
report-hub-generator: Consumes analysis.json to produce formatted audit reports (PDF, HTML, dashboard)
brand-extractor: Provides old branding terms list for Category 8 detection

Site Auditor

Quick Reference

When to Use

User wants to understand the overall health of a website's content
User needs to find pages that exist in one system but not another (index vs. sitemap mismatches)
User is looking for stale content that has not been updated in over 12 months
User needs a content inventory with categorization by path prefix
User wants to detect missing metadata (descriptions, OG images) before a launch or redesign
User is migrating or restructuring a site and needs a baseline audit
A downstream skill (pagespeed-audit, report-hub-generator) requests a site audit as input

Instructions

Step 1: Collect Data from Three Sources

Source 1: Query Index JSON

Fetch {origin}/query-index.json. This is the primary content database, typically generated by a CMS or static site generator.

Expected response shape:

{
  "total": 847,
  "offset": 0,
  "limit": 512,
  "data": [
    {
      "path": "/blog/2024/site-performance-tips",
      "title": "10 Tips for Better Site Performance",
      "description": "Learn how to optimize your website for Core Web Vitals and user experience.",
      "image": "/media_1a2b3c.png",
      "lastModified": "1709251200"
    }
  ]
}

Key fields: path (required), title, description, image, lastModified (Unix timestamp as string).

Handle pagination: if total > offset + limit, fetch subsequent pages with ?offset={next} until all entries are collected. Concatenate all data[] arrays into a single list.

Source 2: Sitemap XML

Fetch {origin}/sitemap.xml. Parse all <loc> tags to extract URLs.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url><loc>https://example.com/blog/2024/site-performance-tips</loc></url>
  <url><loc>https://example.com/about</loc></url>
</urlset>

Handle sitemap indexes: if the response contains <sitemapindex>, recursively fetch each <sitemap><loc> entry and merge all URLs.

Convert absolute URLs to relative paths by stripping the origin. Apply path normalization (see Step 2).

Source 3: Navigation Links

Fetch the homepage ({origin}/) and parse all <a href> attributes from navigation elements. Target these selectors in priority order:

nav a[href] — semantic navigation elements
header a[href] — header links (often includes main nav)
footer a[href] — footer links (often includes sitemap-style navigation)
.nav a[href], .navigation a[href] — class-based navigation

Filter to internal links only (same origin or relative paths). Remove duplicates and apply path normalization.

Step 2: Normalize All Paths

Before any comparison, normalize every path from every source using these rules:

Remove trailing slashes: /about/ becomes /about
Remove hash fragments: /page#section becomes /page
Remove query parameters: /page?ref=nav becomes /page
Lowercase the path: /About-Us becomes /about-us
Decode URL-encoded characters: /%2Fblog becomes /blog
Ensure leading slash: about becomes /about

function normalizePath(rawPath) {
  try {
    let path = rawPath.startsWith('http')
      ? new URL(rawPath).pathname
      : rawPath

    path = decodeURIComponent(path)
      .toLowerCase()
      .split('#')[0]
      .split('?')[0]
      .replace(/\/+$/, '')

    if (!path.startsWith('/')) path = '/' + path
    return path || '/'
  } catch {
    return null  // Discard unparseable paths
  }
}

Discard any path that normalizes to null. Log discarded paths for debugging but do not include them in gap analysis.

Step 3: Categorize Paths by Prefix

Group all collected paths by their top-level prefix to enable section-level analysis. Extract the first two path segments as the category key:

/developer/guides/getting-started → /developer/guides
/blog/2024/performance-tips       → /blog/2024
/products/analytics               → /products
/about                            → /about

If a path has only one segment, use that segment as the category. Track counts per category per source (index, sitemap, nav) to spot sections that are underrepresented in any source.

Step 4: Run Gap Analysis Across All 11 Categories

Compare the three normalized path sets against each other and against metadata quality criteria. For each gap found, record the path, the gap category, and any relevant metadata.

Category 1: In Index but Not Sitemap

Pages present in query-index.json but missing from sitemap.xml. These pages exist in the CMS but are invisible to search engines.

Severity: High — direct SEO impact.

Category 2: In Sitemap but Not Index

URLs in sitemap.xml with no corresponding entry in query-index.json. These may be orphaned pages, redirects, or pages excluded from the content index.

Severity: Medium — may indicate stale sitemap entries or content pipeline issues.

Category 3: Stale Content

Pages in the index where lastModified is older than 12 months from the current date. Convert the Unix timestamp string to a date and compare:

function isStale(lastModifiedStr) {
  const lastModified = new Date(parseInt(lastModifiedStr) * 1000)
  const twelveMonthsAgo = new Date()
  twelveMonthsAgo.setMonth(twelveMonthsAgo.getMonth() - 12)
  return lastModified < twelveMonthsAgo
}

Severity: Low to Medium — depends on content type (blog posts age naturally, product pages should stay current).

Category 4: Missing Descriptions

Index entries where description is empty, null, undefined, or shorter than 20 characters (too short to be meaningful).

Severity: Medium — affects search snippets and social sharing.

Category 5: Missing OG Images

Severity: Medium — affects social media card rendering and visual search results.

Category 6: Duplicate Titles

Two or more index entries sharing the exact same title after trimming whitespace. Group duplicates together and flag each group.

Severity: Medium — confuses search engines and users navigating search results.

Category 7: Broken Navigation Links

Paths found in navigation (Source 3) that do not appear in either the index or the sitemap. These are links visible to users that may lead to 404 pages.

Severity: High — directly impacts user experience.

Category 8: Old Branding References

Common patterns to check:

Previous company names after a rebrand
Retired product line names
Old taglines or slogans
Deprecated feature names

Severity: Medium — brand consistency issue.

Category 9: Pages in Navigation but Not Index

Subset of Category 7 — links in the nav that appear in the sitemap but NOT in the index. These pages are navigable and crawlable but missing from the content database.

Severity: Medium — content pipeline gap.

Category 10: Deprecated Pages

Severity: Low — informational, but should be reviewed for removal from active navigation.

Category 11: Lab/Experimental Pages

Index entries whose path contains /lab/, /labs/, /experimental/, /beta/, /preview/, /sandbox/. These may be internal-only pages accidentally exposed in the index or sitemap.

Severity: Low to Medium — potential unintended public exposure.

Step 5: Perform Freshness Analysis

Beyond the binary stale/fresh check in Category 3, build a freshness distribution:

{
  "freshnessDistribution": {
    "last30days": 42,
    "last90days": 128,
    "last6months": 234,
    "last12months": 387,
    "older": 460
  },
  "oldestPage": { "path": "/about", "lastModified": "2019-03-15" },
  "newestPage": { "path": "/blog/2024/latest", "lastModified": "2024-12-01" },
  "averageAgeDays": 342
}

This helps the user understand the overall content velocity and identify sections that are systematically neglected.

Step 6: Generate the Output

Write all output to data/audit/ relative to the project root.

File 1: `data/audit/query-index.json`

The raw query index data as fetched (all pages concatenated if paginated). This serves as a cache for downstream skills.

File 2: `data/audit/analysis.json`

The structured analysis report:

{
  "meta": {
    "origin": "https://example.com",
    "auditDate": "2024-12-15T10:30:00Z",
    "sources": {
      "queryIndex": { "status": "ok", "count": 847 },
      "sitemap": { "status": "ok", "count": 912 },
      "navigation": { "status": "ok", "count": 64 }
    }
  },
  "summary": {
    "totalUniquePages": 934,
    "gapCount": 127,
    "criticalGaps": 23,
    "categories": {
      "/blog": { "index": 340, "sitemap": 355, "nav": 5 },
      "/developer": { "index": 210, "sitemap": 198, "nav": 18 }
    }
  },
  "gaps": {
    "inIndexNotSitemap": ["/blog/draft-post", "/internal/test-page"],
    "inSitemapNotIndex": ["/old-landing-page", "/event/2022-conference"],
    "brokenNavLinks": ["/products/retired-product"]
  },
  "metadata": {
    "missingDescriptions": ["/about", "/contact"],
    "missingOgImages": ["/blog/2023/quick-update"],
    "duplicateTitles": [
      { "title": "Home", "paths": ["/", "/home", "/index"] }
    ]
  },
  "quality": {
    "stalePages": [{ "path": "/blog/2022/old-post", "lastModified": "2022-01-15" }],
    "oldBranding": [{ "path": "/about", "match": "OldCo Inc." }],
    "deprecatedPages": ["/legacy/v1-api"],
    "labPages": ["/labs/experimental-feature"]
  },
  "freshness": {
    "distribution": { "last30days": 42, "last90days": 128 },
    "averageAgeDays": 342
  },
  "allEntries": [
    {
      "path": "/blog/2024/site-performance-tips",
      "title": "10 Tips for Better Site Performance",
      "inIndex": true,
      "inSitemap": true,
      "inNav": false,
      "gaps": [],
      "category": "/blog/2024"
    }
  ]
}

Step 7: RUM Integration (Optional)

If Real User Monitoring data is available (see pagespeed-audit skill for RUM bundle format), cross-reference audit findings with traffic data:

Stale pages WITH traffic: High priority to refresh — users are actively visiting outdated content
Missing OG images WITH social referral traffic: Critical — broken social cards are actively hurting click-through rates
Pages in index but NOT sitemap WITH organic search traffic: Surprising — these pages are getting traffic despite being invisible to crawlers (possibly through internal links)
Broken nav links WITH clicks: Critical — users are actively encountering dead links

Add a rumEnriched boolean to the meta object and include traffic-weighted severity scores when RUM data is available.

Step 8: Present Results to the User

Summarize findings in a clear, prioritized format:

Critical issues (broken nav links, high-traffic stale pages): Fix immediately
SEO issues (index/sitemap mismatches, missing descriptions): Fix within a sprint
Quality issues (old branding, duplicate titles): Fix during content review cycles
Informational (deprecated pages, lab pages): Review and decide

Always tell the user the exact counts and provide the top 5 examples for each category. Point them to data/audit/analysis.json for the complete dataset.

Troubleshooting

Cross-References

pagespeed-audit: Provides RUM data for traffic-weighted severity scoring and Core Web Vitals per page
accessibility-auditor: Complements content audit with accessibility compliance checks
report-hub-generator: Consumes analysis.json to produce formatted audit reports (PDF, HTML, dashboard)
brand-extractor: Provides old branding terms list for Category 8 detection

Adoption

paolomoz/site-auditor

$ install --global

Security Scan Results

SKILL.md

Site Auditor

Quick Reference

When to Use

Instructions

Step 1: Collect Data from Three Sources

Source 1: Query Index JSON

Source 2: Sitemap XML

Source 3: Navigation Links

Step 2: Normalize All Paths

Step 3: Categorize Paths by Prefix

Step 4: Run Gap Analysis Across All 11 Categories

Category 1: In Index but Not Sitemap

Category 2: In Sitemap but Not Index

Category 3: Stale Content

Category 4: Missing Descriptions

Category 5: Missing OG Images

Category 6: Duplicate Titles

Category 7: Broken Navigation Links

Category 8: Old Branding References

Category 9: Pages in Navigation but Not Index

Category 10: Deprecated Pages

Category 11: Lab/Experimental Pages

Step 5: Perform Freshness Analysis

Step 6: Generate the Output

File 1: data/audit/query-index.json

File 2: data/audit/analysis.json

Step 7: RUM Integration (Optional)

Step 8: Present Results to the User

Troubleshooting

Cross-References

Related Skills

paolomoz/sumi

paolomoz/sse-streaming

paolomoz/session-context

paolomoz/screenshot-capture

paolomoz/site-auditor

$ install --global

Security Scan Results

SKILL.md

Site Auditor

Quick Reference

When to Use

Instructions

Step 1: Collect Data from Three Sources

Source 1: Query Index JSON

Source 2: Sitemap XML

Source 3: Navigation Links

Step 2: Normalize All Paths

Step 3: Categorize Paths by Prefix

Step 4: Run Gap Analysis Across All 11 Categories

Category 1: In Index but Not Sitemap

Category 2: In Sitemap but Not Index

Category 3: Stale Content

Category 4: Missing Descriptions

Category 5: Missing OG Images

Category 6: Duplicate Titles

Category 7: Broken Navigation Links

Category 8: Old Branding References

Category 9: Pages in Navigation but Not Index

Category 10: Deprecated Pages

Category 11: Lab/Experimental Pages

Step 5: Perform Freshness Analysis

Step 6: Generate the Output

File 1: data/audit/query-index.json

File 2: data/audit/analysis.json

Step 7: RUM Integration (Optional)

Step 8: Present Results to the User

Troubleshooting

Cross-References

Related Skills

paolomoz/sumi

paolomoz/sse-streaming

paolomoz/session-context

paolomoz/screenshot-capture

File 1: `data/audit/query-index.json`

File 2: `data/audit/analysis.json`

File 1: `data/audit/query-index.json`

File 2: `data/audit/analysis.json`