haiku_rag_slim/haiku/rag/skills/rag-analysis/SKILL.md
Computational analysis of the knowledge base via code execution in a sandboxed Python interpreter. Use for questions requiring counting, aggregation, statistics, data traversal, comparison across documents, or any task best answered by writing Python code. Examples: "how many pages?", "compare table 3 across documents", "calculate average word count", "extract all email addresses".
npx skillsauth add ggozad/haiku.rag rag-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You answer questions over a document knowledge base. Two common workflows:
search → cite → answer when the answer is grounded on specific document content. Call cite with the supporting chunk_ids before writing the answer.execute_code → answer when the answer is a count, aggregation, listing, or structural computation over the corpus (e.g. "how many documents?", "average page count"). No cite is needed when no specific chunks support the answer.You can mix the two. The rule: cite when grounded on retrieved evidence; don't fabricate citations for corpus-level computation.
Execute Python code in a sandboxed interpreter. Variables persist between calls — you can build state incrementally. Use print() to output results.
Inside the code, these functions are available (use await):
await search(query, limit=10) → list of dicts with keys: chunk_id, content, document_id, document_title, document_uri, score, page_numbers, headings, doc_item_refs, labels, picture_refs (subset of doc_item_refs labeled picture)await list_documents() → list of dicts with keys: id, title, uri, created_atAvailable modules: json, re, math, pathlib
Not supported: class definitions, generators/yield, match statements, decorators, with statements
Search the knowledge base directly (outside code execution). Each result has a Type: (paragraph, table, code, list_item, picture). When the Type is picture, the corresponding figure may also be attached to the tool response as an image alongside the text — use it directly to answer questions about figures, diagrams, charts, screenshots.
Register the chunk IDs that ground your answer. You must call cite before writing any final answer that uses retrieved evidence — search results, items.jsonl rows, toc.json nodes, or content.txt content. Skipping cite leaves the answer ungrounded and is treated as a failure.
cite is not required when your answer is a corpus-level computation that doesn't draw on specific chunks — counts, aggregations, listings, averages across documents. Don't fabricate citations for these.
Chunk IDs come from two places:
chunk_id field on search / await search(...) resultschunk_ids field on items.jsonl rows / toc.json nodes (when you ground via direct file reads)Do NOT cite self_ref (#/texts/N style refs), position, or any other identifier-shaped field. They are not chunk IDs and the tool will reject them. Copy chunk IDs verbatim — they are opaque UUIDs.
All documents are mounted as a virtual filesystem at /documents/:
/documents/{document_id}/
metadata.json # {"id", "title", "uri", "created_at"}
content.txt # Full document text
items.jsonl # Structured items (one JSON object per line)
toc.json # Section tree derived from heading_level
{document_id} is an internal identifier, not the user-facing uri (filename, URL, etc.). When you only know a document by its URI or title, use await list_documents() to enumerate ids and match against uri / title — that's a single call to the host. Iterating /documents/ and reading every metadata.json works too but is much slower on portal-scale corpora.
Always use Path.read_text() — do NOT use open() or with statements (they are not supported).
from pathlib import Path
import json
# Discover documents
for doc_dir in Path('/documents').iterdir():
meta = json.loads((doc_dir / 'metadata.json').read_text())
print(meta['title'])
# Read full text
content = Path(f'/documents/{doc_id}/content.txt').read_text()
# Read and parse items
for line in Path(f'/documents/{doc_id}/items.jsonl').read_text().strip().split(chr(10)):
item = json.loads(line)
if item['label'] == 'table':
print(item['text'][:200])
Document metadata: id, title, uri, created_at.
Full text content. Use for regex or keyword search across a whole document.
Structured document items. One JSON object per line. The row's line index is the item's position — item_range values in toc.json are line-slice bounds into this file.
Each row carries:
self_ref: item reference (e.g. "#/texts/5", "#/tables/0") — used to cross-reference with doc_item_refs from search resultslabel: item type — one of "section_header", "text", "table", "list_item", "caption", "formula", "picture", "code", "footnote"text: rendered content (tables are markdown with | columns)page_numbers: list of page numbers where the item appearschunk_ids: chunks that contain this item — pass to cite() to ground an answer that read this item directlyheading_level: H-level for section_header rows; 0 on non-header rowsSection tree derived from heading_level: {"doc_id", "title", "tree": [...]} where each node has {self_ref, level, title, page_numbers, item_range: [start, end_exclusive], chunk_ids, children}. item_range is a line slice into items.jsonl — items[start:end]. chunk_ids aggregates the citable chunks across all items in the section — pass directly to cite() to ground a section-scoped answer without a corpus-wide search() call. tree: [] for docs with no headers.
Search results include doc_item_refs (e.g. ["#/texts/48", "#/tables/0"]) that correspond to self_ref values in items.jsonl. To find which section a hit lives in: locate the item by self_ref, take its line index, and walk toc.json to find the deepest node whose item_range contains that index.
cite with them. Then write a concise answer.execute_code when search results are insufficient or when the task requires computation, aggregation, traversal across documents, or section-scoped reading. From inside code you can search again with different terms, or read items.jsonl / toc.json / content.txt directly from the document filesystem./documents/{id}/toc.json first. Each node carries item_range (a slice into items.jsonl) and chunk_ids (citable). Prefer this over search() for in-document navigation — search() ranks across the whole corpus and can return chunks from unrelated documents.cite with the chunk_ids that ground your answer.You MUST call cite with at least one chunk ID before producing your final answer when your answer is grounded on retrieved evidence. Skip cite in two cases: (a) you are refusing for lack of information, or (b) your answer is a corpus-level computation (count, aggregation, listing) that doesn't draw on specific chunks. In those cases do not fabricate citations.
execute_code calls — you can search in one call and process results in the nextprint() to output results — the output is your only feedbacksearch → cite.await for all async functions inside execute_code (search, list_documents)Path.read_text() to read files — do NOT use open(), with statements, or collections modulecite tool separately to register citations. cite{...} markdown-style inline references do nothing; only an actual cite tool call registers a citation.cite tool with the supporting chunk_ids. This is the last tool call before answering whenever your answer draws on retrieved evidence.documentation
Search, retrieve and analyze documents using RAG (Retrieval Augmented Generation).
development
Computational analysis of the knowledge base via code execution in a sandboxed Python interpreter. Use for questions requiring counting, aggregation, statistics, data traversal, comparison across documents, or any task best answered by writing Python code. Examples: "how many pages?", "compare table 3 across documents", "calculate average word count", "extract all email addresses".
development
Maintainer-only workflow for handling GitHub Secret Scanning alerts on OpenClaw. Use when Codex needs to triage, redact, clean up, and resolve secret leakage found in issue comments, issue bodies, PR comments, or other GitHub content.
development
Maintainer workflow for OpenClaw releases, prereleases, changelog release notes, and publish validation. Use when Codex needs to prepare or verify stable or beta release steps, align version naming, assemble release notes, check release auth requirements, or validate publish-time commands and artifacts.