Reflect: Ingest — Global Knowledge Indexer

The single command that makes the knowledge base comprehensive. Sweeps ALL memory sources across all tools and projects, archives originals, generates entity sidecars, and dual-indexes into GraphRAG + QMD.

Philosophy: Every piece of knowledge from every tool should be searchable in one place. If it exists in a memory file somewhere, it should be in the global index.

When to Use

Periodically (weekly recommended) to keep the knowledge base current
Before deleting projects or cleaning up ~/.claude/projects/
After heavy multi-project work across different tools
When reflect-status shows low sidecar coverage or stale sources
After running reflect:consolidate (to index what it produced)

What Gets Indexed

reflect:ingest sweeps ALL of these sources:

| # | Source | Path | Provider | |---|--------|------|----------| | 1 | Claude auto-memories (individual files) | ~/.claude/projects/*/memory/*.md | claude.py | | 2 | Codex memories | ~/.codex/memories/*.md | codex.py | | 3 | Codex global AGENTS.md | ~/.codex/AGENTS.md | codex.py | | 4 | Copilot AGENTS.md | ~/.copilot/AGENTS.md | copilot.py | | 5 | Gemini global config | ~/.gemini/GEMINI.md | gemini.py | | 6 | Gemini project configs | <project>/GEMINI.md | gemini.py | | 7 | Project AGENTS.md files | <repo>/AGENTS.md | codex.py / copilot.py | | 8 | Consolidated project memory | <repo>/.agents/MEMORY.md | Glob scan | | 9 | Unindexed knowledge notes | docs/solutions/**/*.md without sidecars | Glob scan | | 10 | Episode files | ~/.reflect/episodes/ep-*.md | Glob scan | | 11 | Skill-embedded learnings | ~/.claude/skills/*/references/*.md | Glob scan |

Output Destinations

Everything flows to:

~/.learnings/
├── documents/
│   ├── learnings/          Knowledge notes (.md + .entities.yaml)
│   │                       Indexed into GraphRAG + QMD
│   ├── memories/           Archived originals (by project)
│   │   ├── shotclubhouse/
│   │   ├── ai-coder-rules/
│   │   └── ...
│   └── episodes/           Session episode notes
├── nano_graphrag_cache/    GraphRAG index (nodes, edges, communities)
├── cli/learnings           CLI for add/search/reindex
└── .memory-ingest-log.yaml Tracks what's been ingested (prevents reprocessing)

~/.cache/qmd/
├── index.sqlite            QMD search index
└── (collections: learnings, obsidian, blog, writing)

Pipeline

Step 1: Discover ALL sources

# Use multi-tool provider discovery
python3 {{HOME_TOOL_DIR}}/skills/reflect/scripts/memory_discovery.py discover --json

# Also scan non-provider sources
# Episode files:
find ~/.reflect/episodes -name "ep-*.md" 2>/dev/null

# Unindexed knowledge notes (missing sidecars):
find docs/solutions -name "*.md" -not -name "*.entities.yaml" 2>/dev/null | while read f; do
  [ ! -f "${f%.md}.entities.yaml" ] && echo "$f"
done

# .agents/MEMORY.md in current repo:
[ -f .agents/MEMORY.md ] && echo ".agents/MEMORY.md"

Step 2: Read and Classify Each Memory File

For each discovered file:

Read the file fully
Extract frontmatter fields: name, description, type
Classify by memory type → learning route:

| Memory Type | Learning Type | Scope | Confidence | |-------------|---------------|-------|------------| | feedback | correction / pattern | universal (if general) or project | HIGH (user stated) | | user | preference | universal | HIGH | | project | decision / pattern | project | MEDIUM | | reference | reference | project or domain | MEDIUM | | Episode | session summary | project | MEDIUM | | Knowledge note | (already structured) | from frontmatter | from frontmatter | | AGENTS.md section | preference / process | universal | HIGH |

Extract project name from directory path:

-Users-stevengonsalvez-d-git-shotclubhouse → shotclubhouse

Step 3: Dedup Against Existing Knowledge Base

Check each entry against what's already indexed:

LEARNINGS_CLI="$HOME/.learnings/cli/learnings"

# Check QMD first (fastest)
qmd query --collection learnings --json "{key_insight}" 2>/dev/null

# Check GraphRAG
"$LEARNINGS_CLI" search "{key terms}" 2>/dev/null

# Check ingest log (prevents reprocessing)
grep -q "{content_hash}" "$HOME/.learnings/.memory-ingest-log.yaml" 2>/dev/null

Dedup rules:

Same content_hash in ingest log → SKIP (already processed)
High-similarity match in QMD (>85%) → SKIP (duplicate)
Same topic but different detail level → KEEP (more detail supersedes)
Same error + same fix → SKIP
Different angle on same topic → KEEP both, note relationship

Step 4: Convert to Knowledge Learnings

For each non-duplicate memory, generate:

A. Learning note (~/.learnings/documents/learnings/{id}.md):

Use the template at assets/learning_template.md for the note structure. Fields preserved: id, scope, confidence, learning_type, source_episodes, superseded_by (for revisions), provenance (source_tool/path/hash), plus Problem / Solution / Anti-Pattern / Context sections.

---
title: "{descriptive title}"
category: "{auto-detected category}"
type: LEARNING
scope: "{universal|project|domain}"
confidence: {0.0-1.0}
key_insight: "{the one-line takeaway}"
tags: ["{project}", "{tech}", "{topic}"]
provenance:
  source_tool: "{claude|codex|copilot|gemini}"
  source_path: "{original file path}"
  content_hash: "{sha256 prefix}"
  ingested_at: "{ISO timestamp}"
---

## Problem
{context from the memory — what situation triggered this}

## Solution
{the actionable guidance from the memory body}

## Context
{project name, source path, related files}

B. Entity sidecar (~/.learnings/documents/learnings/{id}.entities.yaml):

Required schema — every field below is mandatory except strength (defaults to 5). The learnings add --entities CLI enforces this via entity_store.Entity and will raise KeyError if any required key is missing.

document_id: lrn-{slug}-{hash6}
extracted_at: "{ISO-8601 timestamp}"
entities:
  - name: "{canonical lowercase name}"
    type: technology | error | pattern | function | concept | tool
    description: "{brief one-line description}"      # REQUIRED
relationships:
  - source: "{entity A name}"                        # NOT 'from'
    target: "{entity B name}"                        # NOT 'to'
    type: caused_by | solves | requires | relates_to
    description: "{how they relate}"                 # REQUIRED
    strength: 1-10                                   # optional, default 5

Entity type hints by memory kind:

Tools, frameworks, libraries → technology
Patterns, anti-patterns → pattern
Errors, gotchas → error
Functions, APIs → function
Concepts → concept

Relationship hints by memory kind:

feedback: solves or relates_to
project: requires or relates_to
reference: relates_to to external resources

Validate before writing:

python3 {{HOME_TOOL_DIR}}/skills/reflect/scripts/validate_sidecar.py "$SIDECAR_PATH"

See references/knowledge_format.md for the full entity/relationship type reference.

Step 5: Present Summary for Approval

## Ingest Plan

### Sources Scanned
| Provider | Files Found | New | Duplicate | Stale |
|----------|------------|-----|-----------|-------|
| Claude   | 45         | 12  | 30        | 3     |
| Codex    | 8          | 3   | 5         | 0     |
| Copilot  | 2          | 0   | 2         | 0     |
| Gemini   | 4          | 2   | 2         | 0     |
| Episodes | 9          | 2   | 7         | 0     |
| docs/    | 3          | 3   | 0         | 0     |

### New Learnings to Index
| # | Source | Project | Title | Type | Category |
|---|--------|---------|-------|------|----------|
| 1 | claude | shotclubhouse | "RLS bypass with SECURITY DEFINER" | pattern | database |
| 2 | codex  | ai-coder-rules | "Codex memories dir structure" | reference | tooling |
| 3 | episode | — | "Sprint bug blitz session" | session | debugging |
...

**New learnings**: 22
**Skipped (duplicates)**: 46
**Sidecars to generate**: 3 (for existing unindexed notes)

Proceed? [Y/n/select by number]

NEVER auto-index without user approval.

Step 6: Archive Originals and Index

For each approved entry:

A. Archive original (preserves raw content in git):

LEARNINGS_HOME="${LEARNINGS_HOME:-$HOME/.learnings}"
PROJECT_NAME="$(decode_project "$PROJECT_SLUG")"
ARCHIVE_DIR="$LEARNINGS_HOME/documents/memories/$PROJECT_NAME"
mkdir -p "$ARCHIVE_DIR"

# Copy with archive metadata
{
  echo "<!-- archived: $(date -Iseconds) -->"
  echo "<!-- source: $ORIGINAL_PATH -->"
  echo ""
  cat "$ORIGINAL_PATH"
} > "$ARCHIVE_DIR/$(basename "$ORIGINAL_PATH")"

B. Write learning note + entity sidecar

C. Index into GraphRAG + QMD:

LEARNINGS_CLI="$HOME/.learnings/cli/learnings"
"$LEARNINGS_CLI" add "$NOTE_PATH" --entities "$SIDECAR_PATH"

D. Update QMD:

qmd update --collection learnings 2>/dev/null
qmd embed 2>/dev/null

Step 7: Mark as Ingested

Append to tracking file so already-ingested memories aren't reprocessed:

INGEST_LOG="$HOME/.learnings/.memory-ingest-log.yaml"
# Append entry:
# - file: {original path}
#   content_hash: {sha256}
#   ingested_at: {ISO timestamp}
#   learning_id: {generated id}
#   source_tool: {provider name}

Step 8: Git Commit Archive + Learnings

cd "$LEARNINGS_HOME"
git add documents/memories/ documents/learnings/ .memory-ingest-log.yaml
git commit -m "reflect:ingest — $COUNT learnings from $PROVIDER_COUNT providers

Archived: $ARCHIVED original memory files
Indexed: $INDEXED into GraphRAG + QMD
Sources: $SOURCES"

Step 9: Update Metrics

python3 {{HOME_TOOL_DIR}}/skills/reflect/scripts/metrics_updater.py \
    --knowledge-notes $COUNT --sidecars $COUNT

Step 10: Report

## Ingest Complete

- **Sources scanned**: 4 providers + episodes + docs
- **New learnings indexed**: 22 (with entity sidecars)
- **Duplicates skipped**: 46
- **Sidecars backfilled**: 3
- **Archives written**: 22 → ~/.learnings/documents/memories/
- **GraphRAG**: {node_count} nodes, {edge_count} edges
- **QMD**: {file_count} files, {vector_count} vectors

Difference from reflect:consolidate

| | reflect:consolidate | reflect:ingest | |---|---|---| | Scope | Current project only | ALL projects, ALL tools | | Purpose | Tidy worktree orphans → .agents/MEMORY.md | Build comprehensive global knowledge base | | Output | .agents/MEMORY.md (project, git-tracked) | ~/.learnings/documents/ (global GraphRAG + QMD) | | Cleanup | Deletes orphaned worktree dirs | Marks as ingested (preserves originals as archives) | | When | After deleting worktrees | Periodically, or before project cleanup |

Typical workflow: Run reflect:consolidate first (tidies project), then reflect:ingest (indexes everything globally).

Safety

NEVER auto-index without user approval
NEVER delete original memory files (archive them instead)
Always check ingest log to prevent reprocessing
Show full summary before any writes
Preserve provenance metadata on every learning

Reflect: Ingest — Global Knowledge Indexer

Philosophy: Every piece of knowledge from every tool should be searchable in one place. If it exists in a memory file somewhere, it should be in the global index.

When to Use

Periodically (weekly recommended) to keep the knowledge base current
Before deleting projects or cleaning up ~/.claude/projects/
After heavy multi-project work across different tools
When reflect-status shows low sidecar coverage or stale sources
After running reflect:consolidate (to index what it produced)

What Gets Indexed

reflect:ingest sweeps ALL of these sources:

Output Destinations

Everything flows to:

~/.learnings/
├── documents/
│   ├── learnings/          Knowledge notes (.md + .entities.yaml)
│   │                       Indexed into GraphRAG + QMD
│   ├── memories/           Archived originals (by project)
│   │   ├── shotclubhouse/
│   │   ├── ai-coder-rules/
│   │   └── ...
│   └── episodes/           Session episode notes
├── nano_graphrag_cache/    GraphRAG index (nodes, edges, communities)
├── cli/learnings           CLI for add/search/reindex
└── .memory-ingest-log.yaml Tracks what's been ingested (prevents reprocessing)

~/.cache/qmd/
├── index.sqlite            QMD search index
└── (collections: learnings, obsidian, blog, writing)

Pipeline

Step 1: Discover ALL sources

# Use multi-tool provider discovery
python3 {{HOME_TOOL_DIR}}/skills/reflect/scripts/memory_discovery.py discover --json

# Also scan non-provider sources
# Episode files:
find ~/.reflect/episodes -name "ep-*.md" 2>/dev/null

# Unindexed knowledge notes (missing sidecars):
find docs/solutions -name "*.md" -not -name "*.entities.yaml" 2>/dev/null | while read f; do
  [ ! -f "${f%.md}.entities.yaml" ] && echo "$f"
done

# .agents/MEMORY.md in current repo:
[ -f .agents/MEMORY.md ] && echo ".agents/MEMORY.md"

Step 2: Read and Classify Each Memory File

For each discovered file:

Read the file fully
Extract frontmatter fields: name, description, type
Classify by memory type → learning route:

Extract project name from directory path:

-Users-stevengonsalvez-d-git-shotclubhouse → shotclubhouse

Step 3: Dedup Against Existing Knowledge Base

Check each entry against what's already indexed:

LEARNINGS_CLI="$HOME/.learnings/cli/learnings"

# Check QMD first (fastest)
qmd query --collection learnings --json "{key_insight}" 2>/dev/null

# Check GraphRAG
"$LEARNINGS_CLI" search "{key terms}" 2>/dev/null

# Check ingest log (prevents reprocessing)
grep -q "{content_hash}" "$HOME/.learnings/.memory-ingest-log.yaml" 2>/dev/null

Dedup rules:

Same content_hash in ingest log → SKIP (already processed)
High-similarity match in QMD (>85%) → SKIP (duplicate)
Same topic but different detail level → KEEP (more detail supersedes)
Same error + same fix → SKIP
Different angle on same topic → KEEP both, note relationship

Step 4: Convert to Knowledge Learnings

For each non-duplicate memory, generate:

A. Learning note (~/.learnings/documents/learnings/{id}.md):

---
title: "{descriptive title}"
category: "{auto-detected category}"
type: LEARNING
scope: "{universal|project|domain}"
confidence: {0.0-1.0}
key_insight: "{the one-line takeaway}"
tags: ["{project}", "{tech}", "{topic}"]
provenance:
  source_tool: "{claude|codex|copilot|gemini}"
  source_path: "{original file path}"
  content_hash: "{sha256 prefix}"
  ingested_at: "{ISO timestamp}"
---

## Problem
{context from the memory — what situation triggered this}

## Solution
{the actionable guidance from the memory body}

## Context
{project name, source path, related files}

B. Entity sidecar (~/.learnings/documents/learnings/{id}.entities.yaml):

document_id: lrn-{slug}-{hash6}
extracted_at: "{ISO-8601 timestamp}"
entities:
  - name: "{canonical lowercase name}"
    type: technology | error | pattern | function | concept | tool
    description: "{brief one-line description}"      # REQUIRED
relationships:
  - source: "{entity A name}"                        # NOT 'from'
    target: "{entity B name}"                        # NOT 'to'
    type: caused_by | solves | requires | relates_to
    description: "{how they relate}"                 # REQUIRED
    strength: 1-10                                   # optional, default 5

Entity type hints by memory kind:

Tools, frameworks, libraries → technology
Patterns, anti-patterns → pattern
Errors, gotchas → error
Functions, APIs → function
Concepts → concept

Relationship hints by memory kind:

feedback: solves or relates_to
project: requires or relates_to
reference: relates_to to external resources

Validate before writing:

python3 {{HOME_TOOL_DIR}}/skills/reflect/scripts/validate_sidecar.py "$SIDECAR_PATH"

See references/knowledge_format.md for the full entity/relationship type reference.

Step 5: Present Summary for Approval

## Ingest Plan

### Sources Scanned
| Provider | Files Found | New | Duplicate | Stale |
|----------|------------|-----|-----------|-------|
| Claude   | 45         | 12  | 30        | 3     |
| Codex    | 8          | 3   | 5         | 0     |
| Copilot  | 2          | 0   | 2         | 0     |
| Gemini   | 4          | 2   | 2         | 0     |
| Episodes | 9          | 2   | 7         | 0     |
| docs/    | 3          | 3   | 0         | 0     |

### New Learnings to Index
| # | Source | Project | Title | Type | Category |
|---|--------|---------|-------|------|----------|
| 1 | claude | shotclubhouse | "RLS bypass with SECURITY DEFINER" | pattern | database |
| 2 | codex  | ai-coder-rules | "Codex memories dir structure" | reference | tooling |
| 3 | episode | — | "Sprint bug blitz session" | session | debugging |
...

**New learnings**: 22
**Skipped (duplicates)**: 46
**Sidecars to generate**: 3 (for existing unindexed notes)

Proceed? [Y/n/select by number]

NEVER auto-index without user approval.

Step 6: Archive Originals and Index

For each approved entry:

A. Archive original (preserves raw content in git):

LEARNINGS_HOME="${LEARNINGS_HOME:-$HOME/.learnings}"
PROJECT_NAME="$(decode_project "$PROJECT_SLUG")"
ARCHIVE_DIR="$LEARNINGS_HOME/documents/memories/$PROJECT_NAME"
mkdir -p "$ARCHIVE_DIR"

# Copy with archive metadata
{
  echo "<!-- archived: $(date -Iseconds) -->"
  echo "<!-- source: $ORIGINAL_PATH -->"
  echo ""
  cat "$ORIGINAL_PATH"
} > "$ARCHIVE_DIR/$(basename "$ORIGINAL_PATH")"

B. Write learning note + entity sidecar

C. Index into GraphRAG + QMD:

LEARNINGS_CLI="$HOME/.learnings/cli/learnings"
"$LEARNINGS_CLI" add "$NOTE_PATH" --entities "$SIDECAR_PATH"

D. Update QMD:

qmd update --collection learnings 2>/dev/null
qmd embed 2>/dev/null

Step 7: Mark as Ingested

Append to tracking file so already-ingested memories aren't reprocessed:

INGEST_LOG="$HOME/.learnings/.memory-ingest-log.yaml"
# Append entry:
# - file: {original path}
#   content_hash: {sha256}
#   ingested_at: {ISO timestamp}
#   learning_id: {generated id}
#   source_tool: {provider name}

Step 8: Git Commit Archive + Learnings

cd "$LEARNINGS_HOME"
git add documents/memories/ documents/learnings/ .memory-ingest-log.yaml
git commit -m "reflect:ingest — $COUNT learnings from $PROVIDER_COUNT providers

Archived: $ARCHIVED original memory files
Indexed: $INDEXED into GraphRAG + QMD
Sources: $SOURCES"

Step 9: Update Metrics

python3 {{HOME_TOOL_DIR}}/skills/reflect/scripts/metrics_updater.py \
    --knowledge-notes $COUNT --sidecars $COUNT

Step 10: Report

## Ingest Complete

- **Sources scanned**: 4 providers + episodes + docs
- **New learnings indexed**: 22 (with entity sidecars)
- **Duplicates skipped**: 46
- **Sidecars backfilled**: 3
- **Archives written**: 22 → ~/.learnings/documents/memories/
- **GraphRAG**: {node_count} nodes, {edge_count} edges
- **QMD**: {file_count} files, {vector_count} vectors

Difference from reflect:consolidate

Typical workflow: Run reflect:consolidate first (tidies project), then reflect:ingest (indexes everything globally).

Safety

NEVER auto-index without user approval
NEVER delete original memory files (archive them instead)
Always check ingest log to prevent reprocessing
Show full summary before any writes
Preserve provenance metadata on every learning

Adoption

stevengonsalvez/reflect:ingest

$ install --global

Security Scan Results

SKILL.md

Reflect: Ingest — Global Knowledge Indexer

When to Use

What Gets Indexed

Output Destinations

Pipeline

Step 1: Discover ALL sources

Step 2: Read and Classify Each Memory File

Step 3: Dedup Against Existing Knowledge Base

Step 4: Convert to Knowledge Learnings

Step 5: Present Summary for Approval

Step 6: Archive Originals and Index

Step 7: Mark as Ingested

Step 8: Git Commit Archive + Learnings

Step 9: Update Metrics

Step 10: Report

Difference from reflect:consolidate

Safety

Related Skills

stevengonsalvez/reflect:cost

stevengonsalvez/ainb-fleet:standup

stevengonsalvez/ainb-fleet:sequence

stevengonsalvez/ainb-fleet:needs

stevengonsalvez/reflect:ingest

$ install --global

Security Scan Results

SKILL.md

Reflect: Ingest — Global Knowledge Indexer

When to Use

What Gets Indexed

Output Destinations

Pipeline

Step 1: Discover ALL sources

Step 2: Read and Classify Each Memory File

Step 3: Dedup Against Existing Knowledge Base

Step 4: Convert to Knowledge Learnings

Step 5: Present Summary for Approval

Step 6: Archive Originals and Index

Step 7: Mark as Ingested

Step 8: Git Commit Archive + Learnings

Step 9: Update Metrics

Step 10: Report

Difference from reflect:consolidate

Safety

Related Skills

stevengonsalvez/reflect:cost

stevengonsalvez/ainb-fleet:standup

stevengonsalvez/ainb-fleet:sequence

stevengonsalvez/ainb-fleet:needs