.claude/skills/codebase-exploration/SKILL.md
# Codebase Exploration ## Overview 7-phase progressive exploration protocol for analyzing unfamiliar codebases while keeping token usage under 34K. Based on research into SWE-bench top performers, LocAgent, and OpenHands — the key insight is **search-first, read-selectively, write-findings-immediately**. **Core principle:** Never read a file to discover what's in it. Use search to locate, then read to confirm. ## When to Invoke ```javascript Skill({ skill: 'codebase-exploration' }); ``` In
npx skillsauth add oimiragieo/agent-studio .claude/skills/codebase-explorationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
7-phase progressive exploration protocol for analyzing unfamiliar codebases while keeping token usage under 34K. Based on research into SWE-bench top performers, LocAgent, and OpenHands — the key insight is search-first, read-selectively, write-findings-immediately.
Core principle: Never read a file to discover what's in it. Use search to locate, then read to confirm.
Skill({ skill: 'codebase-exploration' });
Invoke when:
Token budget: 34K total. Hard stop at 60K — invoke context-compressor immediately.
Input validated against schemas/input.schema.json before execution.
Output contract defined in schemas/output.schema.json.
Pre-execution hook: hooks/pre-execute.cjs
Post-execution hook: hooks/post-execute.cjs
Estimate token budget BEFORE diving in. This prevents context overflow.
Actions:
# Count files, excluding noise directories
find . -type f | grep -v node_modules | grep -v .git | grep -v __pycache__ | grep -v dist | grep -v build | grep -v .venv | wc -l
# Estimate token budget of relevant subtree
# ~4 chars per token, so: file_count * avg_file_size_bytes / 4 = token_estimate
# Or use: pnpm search:tokens . (if analyzing our own repo)
Decision gate:
500 files OR > 100K tokens: Recommend multi-agent decomposition via planner
Multi-agent decomposition protocol (when >100K tokens):
.claude/context/tmp/exploration-<timestamp>/chunk-1-findings.md, chunk-2-findings.md, etc.Write scope assessment to report file immediately. Do not hold in context.
Goal: Build a mental map without reading any file content.
# Directory tree, depth 3, exclude noise
find . -maxdepth 3 -type d | grep -v node_modules | grep -v .git | grep -v __pycache__ | grep -v dist | grep -v build | grep -v .venv | sort
# File count per top-level directory
for d in */; do echo "$d: $(find $d -type f 2>/dev/null | wc -l)"; done
# Identify language stack from file extensions
find . -type f | grep -v node_modules | grep -v .git | sed 's/.*\.//' | sort | uniq -c | sort -rn | head -15
Output: Write directory tree + language stack to report file. Note top-level directories and likely purpose.
Goal: Extract function/class signatures without reading bodies.
# Read README (first 100 lines only — use offset/limit)
# Read({ file_path: "README.md", limit: 100 })
# Read package manifest
# Read({ file_path: "package.json" }) or pyproject.toml, Cargo.toml, go.mod
# Extract function/class signatures via ripgrep
rg "^export |^class |^function |^def |^func |^type |^interface |^const " \
--type-add 'src:*.{ts,js,py,go,rs,java}' -t src \
--no-heading -l | head -30
# Map import/dependency graph (find most-imported modules)
rg "^import |^from |require\(" -l | head -30
Write repo map to report file immediately — do not hold raw output in context.
Include in report:
Goal: Use search tools to find specific patterns WITHOUT reading full files.
# Find entry points from manifest
# (read manifest → extract main/bin/scripts → note filenames)
# Find API routes / endpoints
rg "app\.(get|post|put|delete|use)\(" -l # Express
rg "@app\.(route|get|post)" -l # Flask/FastAPI
rg "router\.(get|post|put)" -l # Other routers
# Find test patterns
rg "describe\(|it\(|test\(|def test_" -l | head -20
# Find configuration patterns
rg "process\.env\.|os\.environ\." -l | head -20
# Find database/storage patterns
rg "db\.(query|find|select)|mongoose\.|prisma\." -l | head -20
Write search findings to report. Move to Phase 4 only for files identified by search.
Rules:
Read with offset/limit — never read entire large filesPattern:
# Step 1: Find the exact line number of what you need
rg -n "function processOrder" src/core/engine.ts
# Step 2: Read only that section
# Read({ file_path: "src/core/engine.ts", offset: 142, limit: 50 })
# Step 3: Write finding to report file immediately
# Write 2-3 sentence summary before reading next file
If context exceeds 60K tokens after this phase:
Skill({ skill: 'context-compressor' });
Goal: Understand component connections without reading more files.
# Find all callers of a key function
rg "processOrder\(" -l
# Find all implementations of a pattern
rg "implements OrderProcessor" -l
# Trace data flow
rg "db\.(query|find|select|get)" -l
# Find circular dependencies (who imports whom)
rg "require\('./auth'\|from './auth'" -l
If LSP tools are available (TypeScript/JavaScript projects):
// lsp_goToDefinition — find where a symbol is defined
// lsp_findReferences — find all usages
// lsp_incomingCalls — who calls this function?
// lsp_outgoingCalls — what does this function call?
// Note: LSP requires a running language server; fall back to ripgrep if empty results
Write dependency/call flow diagram (text-based) to report.
Goal: Free context by writing all findings to report file.
# Write comprehensive report to .claude/context/tmp/ or .claude/context/reports/
# Include: project type, architecture, key features (file:line refs), dependencies, test coverage
Report structure:
# Codebase Analysis: [Project Name]
## Project Profile
- Type: [library/API/CLI/SaaS/etc]
- Language: [primary language + version]
- Framework: [frameworks detected]
- Entry points: [file:line refs]
## Architecture
[Text diagram of key components and connections]
## Key Findings
1. [Finding with file:line reference]
2. [Finding with file:line reference]
## Dependencies
- Production: [key deps]
- Dev: [key deps]
## Test Coverage Assessment
- Test files: [count, location]
- Framework: [jest/pytest/etc]
## Unknowns / Follow-up Questions
- [What couldn't be determined without deeper analysis]
Return to caller: file path + 5-bullet summary (max 500 chars). Do NOT inline the full report.
| Need | Tool | Why |
| ------------------ | -------------------------- | ------------------------------------- |
| File discovery | Glob | Fast pattern matching |
| Content search | Grep (ripgrep) | Regex across whole repo |
| Targeted read | Read with offset/limit | Windowed, token-efficient |
| Structure patterns | ast-grep (if installed) | Language-aware |
| Definitions | LSP (if server available) | Compiler-level accuracy |
| External context | WebFetch + docs URL | Find docs, articles about the project |
| Compression | context-compressor | When >60K tokens |
Tools NOT available for external repos:
pnpm search:code — requires index build (internal repo only)pnpm search:structure — internal repo onlyoffset/limit for all reads (max 200 lines)rg -n to find exact line numbers firstcontext-compressoroffset/limit on Read for any file over 200 lines — full-file reads of large files consume 5-15K tokens in a single operation.context-compressor when context exceeds 60K tokens — not after; at the boundary.Before starting:
node .claude/lib/memory/memory-search.cjs "codebase exploration external repo analysis"
Read .claude/context/memory/learnings.md
Read .claude/context/memory/decisions.md
After completing:
.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.mdASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.