skills/site-auditor/SKILL.md
Audit websites by cross-referencing query indexes, sitemaps, and navigation to identify content gaps, stale pages, missing metadata, and quality issues. Use when "auditing a website", "finding content gaps", "site quality audit", or "content inventory analysis".
npx skillsauth add paolomoz/skills site-auditorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Category | Trigger | Complexity | Source | |----------|---------|------------|--------| | audit | "auditing a website", "finding content gaps", "site quality audit", "content inventory analysis" | High | 8 projects |
Perform a comprehensive site audit by fetching and cross-referencing three independent data sources — the query index JSON, the XML sitemap, and live navigation links. The audit identifies 11 categories of content gaps, metadata issues, and quality problems, producing a structured analysis report that downstream skills (pagespeed-audit, report-hub-generator) can consume.
Every audit begins by fetching three independent data sources from the target origin. All three are required for a complete audit; if one fails, proceed with the remaining sources and note the gap in the summary.
Fetch {origin}/query-index.json. This is the primary content database, typically generated by a CMS or static site generator.
Expected response shape:
{
"total": 847,
"offset": 0,
"limit": 512,
"data": [
{
"path": "/blog/2024/site-performance-tips",
"title": "10 Tips for Better Site Performance",
"description": "Learn how to optimize your website for Core Web Vitals and user experience.",
"image": "/media_1a2b3c.png",
"lastModified": "1709251200"
}
]
}
Key fields: path (required), title, description, image, lastModified (Unix timestamp as string).
Handle pagination: if total > offset + limit, fetch subsequent pages with ?offset={next} until all entries are collected. Concatenate all data[] arrays into a single list.
Fetch {origin}/sitemap.xml. Parse all <loc> tags to extract URLs.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://example.com/blog/2024/site-performance-tips</loc></url>
<url><loc>https://example.com/about</loc></url>
</urlset>
Handle sitemap indexes: if the response contains <sitemapindex>, recursively fetch each <sitemap><loc> entry and merge all URLs.
Convert absolute URLs to relative paths by stripping the origin. Apply path normalization (see Step 2).
Fetch the homepage ({origin}/) and parse all <a href> attributes from navigation elements. Target these selectors in priority order:
nav a[href] — semantic navigation elementsheader a[href] — header links (often includes main nav)footer a[href] — footer links (often includes sitemap-style navigation).nav a[href], .navigation a[href] — class-based navigationFilter to internal links only (same origin or relative paths). Remove duplicates and apply path normalization.
Before any comparison, normalize every path from every source using these rules:
/about/ becomes /about/page#section becomes /page/page?ref=nav becomes /page/About-Us becomes /about-us/%2Fblog becomes /blogabout becomes /aboutfunction normalizePath(rawPath) {
try {
let path = rawPath.startsWith('http')
? new URL(rawPath).pathname
: rawPath
path = decodeURIComponent(path)
.toLowerCase()
.split('#')[0]
.split('?')[0]
.replace(/\/+$/, '')
if (!path.startsWith('/')) path = '/' + path
return path || '/'
} catch {
return null // Discard unparseable paths
}
}
Discard any path that normalizes to null. Log discarded paths for debugging but do not include them in gap analysis.
Group all collected paths by their top-level prefix to enable section-level analysis. Extract the first two path segments as the category key:
/developer/guides/getting-started → /developer/guides
/blog/2024/performance-tips → /blog/2024
/products/analytics → /products
/about → /about
If a path has only one segment, use that segment as the category. Track counts per category per source (index, sitemap, nav) to spot sections that are underrepresented in any source.
Compare the three normalized path sets against each other and against metadata quality criteria. For each gap found, record the path, the gap category, and any relevant metadata.
Pages present in query-index.json but missing from sitemap.xml. These pages exist in the CMS but are invisible to search engines.
Severity: High — direct SEO impact.
URLs in sitemap.xml with no corresponding entry in query-index.json. These may be orphaned pages, redirects, or pages excluded from the content index.
Severity: Medium — may indicate stale sitemap entries or content pipeline issues.
Pages in the index where lastModified is older than 12 months from the current date. Convert the Unix timestamp string to a date and compare:
function isStale(lastModifiedStr) {
const lastModified = new Date(parseInt(lastModifiedStr) * 1000)
const twelveMonthsAgo = new Date()
twelveMonthsAgo.setMonth(twelveMonthsAgo.getMonth() - 12)
return lastModified < twelveMonthsAgo
}
Severity: Low to Medium — depends on content type (blog posts age naturally, product pages should stay current).
Index entries where description is empty, null, undefined, or shorter than 20 characters (too short to be meaningful).
Severity: Medium — affects search snippets and social sharing.
Index entries where image is empty, null, or points to a default/placeholder image. Detect placeholders by checking for common patterns: placeholder, default, no-image, or images smaller than 200x200 if dimensions are available.
Severity: Medium — affects social media card rendering and visual search results.
Two or more index entries sharing the exact same title after trimming whitespace. Group duplicates together and flag each group.
Severity: Medium — confuses search engines and users navigating search results.
Paths found in navigation (Source 3) that do not appear in either the index or the sitemap. These are links visible to users that may lead to 404 pages.
Severity: High — directly impacts user experience.
Index entries where title or description contains outdated brand terms. Build a configurable list of old brand terms to scan for (e.g., old company names, retired product names, deprecated terminology).
Common patterns to check:
Severity: Medium — brand consistency issue.
Subset of Category 7 — links in the nav that appear in the sitemap but NOT in the index. These pages are navigable and crawlable but missing from the content database.
Severity: Medium — content pipeline gap.
Index entries whose path contains signals of deprecation: /deprecated/, /archive/, /legacy/, /old/, /v1/, /v2/ (when v3+ exists). Also flag pages with titles containing "deprecated", "archived", "legacy", or "end of life".
Severity: Low — informational, but should be reviewed for removal from active navigation.
Index entries whose path contains /lab/, /labs/, /experimental/, /beta/, /preview/, /sandbox/. These may be internal-only pages accidentally exposed in the index or sitemap.
Severity: Low to Medium — potential unintended public exposure.
Beyond the binary stale/fresh check in Category 3, build a freshness distribution:
{
"freshnessDistribution": {
"last30days": 42,
"last90days": 128,
"last6months": 234,
"last12months": 387,
"older": 460
},
"oldestPage": { "path": "/about", "lastModified": "2019-03-15" },
"newestPage": { "path": "/blog/2024/latest", "lastModified": "2024-12-01" },
"averageAgeDays": 342
}
This helps the user understand the overall content velocity and identify sections that are systematically neglected.
Write all output to data/audit/ relative to the project root.
data/audit/query-index.jsonThe raw query index data as fetched (all pages concatenated if paginated). This serves as a cache for downstream skills.
data/audit/analysis.jsonThe structured analysis report:
{
"meta": {
"origin": "https://example.com",
"auditDate": "2024-12-15T10:30:00Z",
"sources": {
"queryIndex": { "status": "ok", "count": 847 },
"sitemap": { "status": "ok", "count": 912 },
"navigation": { "status": "ok", "count": 64 }
}
},
"summary": {
"totalUniquePages": 934,
"gapCount": 127,
"criticalGaps": 23,
"categories": {
"/blog": { "index": 340, "sitemap": 355, "nav": 5 },
"/developer": { "index": 210, "sitemap": 198, "nav": 18 }
}
},
"gaps": {
"inIndexNotSitemap": ["/blog/draft-post", "/internal/test-page"],
"inSitemapNotIndex": ["/old-landing-page", "/event/2022-conference"],
"brokenNavLinks": ["/products/retired-product"]
},
"metadata": {
"missingDescriptions": ["/about", "/contact"],
"missingOgImages": ["/blog/2023/quick-update"],
"duplicateTitles": [
{ "title": "Home", "paths": ["/", "/home", "/index"] }
]
},
"quality": {
"stalePages": [{ "path": "/blog/2022/old-post", "lastModified": "2022-01-15" }],
"oldBranding": [{ "path": "/about", "match": "OldCo Inc." }],
"deprecatedPages": ["/legacy/v1-api"],
"labPages": ["/labs/experimental-feature"]
},
"freshness": {
"distribution": { "last30days": 42, "last90days": 128 },
"averageAgeDays": 342
},
"allEntries": [
{
"path": "/blog/2024/site-performance-tips",
"title": "10 Tips for Better Site Performance",
"inIndex": true,
"inSitemap": true,
"inNav": false,
"gaps": [],
"category": "/blog/2024"
}
]
}
The allEntries array is a denormalized view of every unique path across all three sources, with boolean flags for each source and a list of applicable gap categories. This is the primary input for report-hub-generator.
If Real User Monitoring data is available (see pagespeed-audit skill for RUM bundle format), cross-reference audit findings with traffic data:
Add a rumEnriched boolean to the meta object and include traffic-weighted severity scores when RUM data is available.
Summarize findings in a clear, prioritized format:
Always tell the user the exact counts and provide the top 5 examples for each category. Point them to data/audit/analysis.json for the complete dataset.
| Problem | Cause | Fix |
|---------|-------|-----|
| Query index returns 404 | Site does not use a query index | Skip this source; audit with sitemap + nav only. Note reduced coverage in summary. |
| Sitemap returns a sitemap index | Nested sitemaps | Recursively fetch all child sitemaps and merge URLs |
| Navigation returns thousands of links | Mega-menu or footer with every page linked | Filter to unique paths and apply a max of 500 nav links |
| lastModified is "0" or missing | CMS does not track modification dates | Exclude from freshness analysis; flag as "unknown freshness" |
| All titles are the same | Site uses a default title template | Flag as duplicate titles but note the pattern (likely a CMS configuration issue, not a content issue) |
| Audit takes too long | Large site with 10K+ pages | Process in batches of 500; write intermediate results to disk |
development
Generate artistic infographics from any topic. Runs the Sumi pipeline (analyze → structure → craft prompt → generate image) entirely within Claude Code. Use when "generate infographic", "create infographic", "sumi", "make an infographic about", or "visualize topic".
tools
Implement Server-Sent Events streaming from Cloudflare Workers to browser clients with reconnection, state persistence, and progress tracking. Use when building "SSE streaming", "real-time updates", "server push", or "event streaming".
data-ai
Track user session context across multi-turn interactions using browser sessionStorage and server-side KV caching with TTL. Use when implementing "session tracking", "conversation context", "multi-turn sessions", or "user journey tracking".
development
Capture full-page and viewport screenshots of websites using Playwright with overlay removal, cookie consent handling, and comparison modes. Use when "capturing screenshots", "website screenshots", "visual testing", or "page capture".