distributions/claude/skills/document-audit-extraction/SKILL.md
Audit document collections to extract features, identify patterns, assess quality, and build structured inventories. Covers metadata extraction, content classification, gap analysis, and coverage mapping. Triggers on document audit, content inventory, or document feature extraction requests.
npx skillsauth add a-organvm/a-i--skills document-audit-extractionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematically inventory, evaluate, and extract structured data from document collections.
Phase 1: Inventory → What exists?
Phase 2: Classify → What type is each document?
Phase 3: Evaluate → What's the quality?
Phase 4: Extract → What structured data can we pull?
from pathlib import Path
from dataclasses import dataclass
@dataclass
class DocumentEntry:
path: str
name: str
extension: str
size_bytes: int
modified: str
word_count: int
has_frontmatter: bool
def inventory_documents(root: str, patterns: list[str] = ["*.md", "*.txt", "*.yaml"]) -> list[DocumentEntry]:
entries = []
for pattern in patterns:
for path in Path(root).rglob(pattern):
content = path.read_text(errors="ignore")
entries.append(DocumentEntry(
path=str(path.relative_to(root)),
name=path.stem,
extension=path.suffix,
size_bytes=path.stat().st_size,
modified=path.stat().st_mtime,
word_count=len(content.split()),
has_frontmatter=content.startswith("---"),
))
return entries
## Document Inventory
| Path | Type | Words | Frontmatter | Modified |
|------|------|-------|-------------|----------|
| skills/dev/testing/SKILL.md | skill | 1,245 | Yes | 2026-03-20 |
| docs/CHANGELOG.md | changelog | 890 | No | 2026-03-19 |
| README.md | readme | 450 | No | 2026-03-18 |
**Total:** 142 documents | **With frontmatter:** 105 | **Total words:** 185,000
| Type | Signal | Example |
|------|--------|---------|
| Skill | YAML frontmatter with name:, in skills/ | SKILL.md |
| Configuration | YAML/JSON schema | seed.yaml, registry.json |
| Guide | Tutorial structure, step-by-step | getting-started.md |
| Reference | API docs, schema docs | api-spec.md |
| Decision | ADR format, options + decision | adr-001.md |
| Changelog | Date-ordered entries | CHANGELOG.md |
| Policy | Rules, constraints | CONTRIBUTING.md |
def classify_document(path: str, content: str) -> str:
if "skills/" in path and content.startswith("---"):
return "skill"
if path.endswith("seed.yaml") or path.endswith("registry.json"):
return "configuration"
if "## Step" in content or "### Step" in content:
return "guide"
if "## Decision" in content or "## Alternatives" in content:
return "decision"
if "## [" in content and any(d in content for d in ["Added", "Fixed", "Changed"]):
return "changelog"
return "general"
| Criterion | Weight | Score (0-3) | Notes | |-----------|--------|-------------|-------| | Completeness | 30% | | All required sections present? | | Accuracy | 25% | | Information correct and current? | | Clarity | 20% | | Understandable without prior context? | | Structure | 15% | | Logical organization, headings, formatting? | | Maintenance | 10% | | Updated date, versioned, no stale links? |
def quality_check(path: str, content: str) -> dict:
checks = {
"has_title": content.startswith("#") or content.startswith("---"),
"has_sections": content.count("\n##") >= 2,
"reasonable_length": 100 < len(content.split()) < 10000,
"no_todo_left": "TODO" not in content and "FIXME" not in content,
"no_broken_links": not re.search(r'\[.*?\]\(\s*\)', content),
"has_code_examples": "```" in content,
"frontmatter_complete": check_frontmatter_fields(content),
}
score = sum(checks.values()) / len(checks)
return {"checks": checks, "score": round(score, 2)}
import yaml
import re
def extract_frontmatter(content: str) -> dict | None:
match = re.match(r'^---\n(.*?)\n---', content, re.DOTALL)
if match:
return yaml.safe_load(match.group(1))
return None
def extract_features(content: str) -> dict:
return {
"headings": re.findall(r'^#+\s+(.+)$', content, re.MULTILINE),
"code_blocks": len(re.findall(r'```', content)) // 2,
"links": re.findall(r'\[([^\]]+)\]\(([^)]+)\)', content),
"images": re.findall(r'!\[([^\]]*)\]\(([^)]+)\)', content),
"tables": content.count("\n|"),
"todos": re.findall(r'- \[ \]\s+(.+)', content),
}
def build_reference_graph(documents: list[dict]) -> dict:
graph = {}
for doc in documents:
links = extract_features(doc["content"])["links"]
graph[doc["path"]] = {
"outgoing": [link[1] for link in links if not link[1].startswith("http")],
"incoming": [],
}
# Build incoming links
for source, data in graph.items():
for target in data["outgoing"]:
if target in graph:
graph[target]["incoming"].append(source)
return graph
def gap_analysis(inventory: list[dict], expected: dict) -> dict:
existing = {doc["path"] for doc in inventory}
gaps = {
"missing_required": [p for p in expected.get("required", []) if p not in existing],
"missing_recommended": [p for p in expected.get("recommended", []) if p not in existing],
"orphaned": [p for p in existing if p not in expected.get("all_known", existing)],
"empty_files": [doc["path"] for doc in inventory if doc["word_count"] < 10],
}
return gaps
development
Create algorithmic and generative art using mathematical patterns, noise functions, particle systems, and procedural generation. Covers flow fields, L-systems, fractals, and creative coding foundations. Triggers on generative art, algorithmic art, creative coding, procedural generation, or mathematical visualization requests.
development
Audits web applications and architectures for compliance with GDPR, CCPA, and other privacy regulations, focusing on consent, data minimization, and user rights.
development
Optimize Google Cloud Platform resource allocation and manage cloud credits efficiently. Use when planning GCP deployments, analyzing cloud spend, maximizing value from expiring credits, right-sizing instances, or designing cost-effective architectures. Triggers on GCP cost optimization, credit management, resource allocation planning, or cloud budget concerns.
testing
Designs engaging gameplay loops, economies, and progression systems, balancing challenge and reward for interactive experiences.