skills/insight-knowledge-harvest/SKILL.md
Build and expand an insight-ready raw-material layer by discovering page-level sources, deduplicating them with an internal pre-crawl link index, capturing raw Markdown, verifying metadata in place, and keeping ingest/register state aligned. Use for additive source harvesting, raw webpage capture, source registry maintenance, source/ingest tracking, source/raw downloads, and in-place verification rather than final synthesis.
npx skillsauth add cyberelf/agent_skills insight-knowledge-harvestInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to build and incrementally expand a curated raw-material layer for future viewpoint and insight extraction.
This is a source-ingestion and verification workflow, not a final insight-writing workflow. Its job is to inspect the current KB as context, promote or discover candidate page-level sources, deduplicate them, capture accepted pages into raw Markdown, and keep ingest plus verification state aligned across the project.
The operating model is intentionally simple:
source/raw/ stores one raw page capture per canonical source page or document, plus minimal identity, processing-state, and high-level classification metadata.source/ingest.md tracks minimal pipeline state for every candidate or accepted material.source/registers/ holds curator-facing notes such as rejection reasons, gap lists, deep-read queues, and the classification schema vocabulary.source/.harvest/ optionally stores hidden operational deduplication state and detailed classification metadata in SQLite for broader discovery runs.In default additive mode, existing materials are treated as read-only context for topic promotion, deduplication, and gap detection. The skill normally adds new page-level captures, verifies newly downloaded files in place, and avoids rewriting prior raw materials unless the user explicitly asks for repair or re-verification.
Raw files must remain raw captures, not summaries. Curator judgment belongs in ingest, register notes, or the internal SQLite index, while the raw file keeps only source identity, processing-state metadata, and the high-level classification trio: material_kind, topic_domain, and credibility_tier.
At a glance, this skill is best when you need to:
source/ingest.md, source/registers/*.md, rejection logs, gap lists, deep-read queues, candidate-topic promotion notes, and any curator-written notes.source/raw/; preserve the source page language there.source/raw/ contains raw page captures, not summaries, analysis notes, quality scores, detailed classification blocks, or site overviews.---.material_kind, topic_domain, and credibility_tier.source/verified/ record by default.source/.harvest/link-index.sqlite3, source/registers/, or source/ingest.md, not in source/raw/.source/.harvest/ by default; do not copy hidden operational metadata from that database into source/raw/, source/ingest.md, or downstream insight notes unless the user explicitly asks for an audit/export.material_kind, topic_domain, and credibility_tier belong in the internal SQLite link index by default. This includes evidence_type, ingestion_priority, lifecycle_status, insight_potential, source_bias, and compliance_status.source/registers/; do not rely only on the skill-local reference file.When the user asks to continue, expand, update, or add to a collection, default to discovering and capturing new page-level source materials.
source/ content as read-only context for topic promotion, deduplication, next material ID selection, and gap detection.source/ingest.md or the relevant register; keep moving on new-source capture unless the user redirects to cleanup.source/raw/ as one Markdown file per canonical webpage or document.source/raw/ files.source/ingest.md as the source-ingestion status list.source/
├── .harvest/ # internal SQLite link index and operational crawl state
├── ingest.md # minimal status list for every candidate/material
├── raw/ # one raw Markdown page capture per canonical URL
└── registers/ # collection-level candidate registers, classification schema, and gap/deep-read queues
Use the source material template for raw files in source/raw/.
Use the ingest list template for source/ingest.md.
Copy classification schema into source/registers/classification-schema.md when initializing the project root if it is missing, incomplete, or outdated.
The optional pre-crawl link index database defaults to source/.harvest/link-index.sqlite3. It is an internal deduplication and processing-state cache, not a reader-facing deliverable.
When the user gives a broad collection direction, a vague theme, or no explicit candidate topics, do not treat discovery as starting from zero.
source/registers/, source/ingest.md, verified items in source/raw/, and any adjacent collection notes already present in the project root.source-priority.md.Before accepting or downloading new material candidates, use the bundled CLI to build or update an internal link index when the task involves more than a handful of URLs, a default-source-base expansion, or an incremental collection update.
CLI asset:
scripts/precrawl_link_index.py
Default database:
source/.harvest/link-index.sqlite3
Purpose:
source/ingest.md and source/raw/*.md metadata into the database so incremental runs avoid re-adding already downloaded, metadata-only, rejected, or verified materials.Typical commands:
python scripts/precrawl_link_index.py --workspace . init
python scripts/precrawl_link_index.py --workspace . scan --collection-id <collection-id>
python scripts/precrawl_link_index.py --workspace . scan --collection-id <collection-id> --fetch-seeds
python scripts/precrawl_link_index.py --workspace . pending --limit 100
python scripts/precrawl_link_index.py --workspace . sync --collection-id <collection-id>
python scripts/precrawl_link_index.py --workspace . stats
Important constraints:
scan --fetch-seeds may fetch seed/index/feed pages once to enumerate links, but it must not recursively follow discovered candidate pages, save raw article bodies, or treat seed-page extraction as verification.source/ingest.md remains the minimal candidate/material status list, and source/raw/ remains the source of raw page content.internal_metadata_json, observation history, URL hashes, referrer lists, stripped query parameters, or other operational fields in downstream analysis unless the user asks for a diagnostics export. Raw front matter may surface only the approved high-level classification trio.sync command or otherwise update the database so material_id, raw_path, download_status, verification_status, and detailed classification fields stay aligned with source/ingest.md, source/raw/, and source/registers/.Use only these fields by default unless the user explicitly requests more:
---
material_id:
collection_id:
title:
canonical_url:
source_name:
author_or_org:
publication_date:
retrieved_at:
language:
material_kind:
topic_domain:
credibility_tier:
ingest_status:
download_status:
verification_status:
raw_path:
reviewed_at:
---
Field intent:
material_id, collection_id, title, canonical_url, source_name, author_or_org, publication_date, retrieved_at, and language identify the source.material_kind, topic_domain, and credibility_tier provide durable high-level classification for filtering raw files. Use controlled values from the classification schema; topic_domain may contain comma-separated controlled values when one material clearly spans multiple domains.ingest_status, download_status, verification_status, raw_path, and reviewed_at track pipeline state.Do not put quality scores, detailed classification fields, bias notes, curation summaries, evidence pointers, or candidate insights in raw-file front matter. Store evidence_type, ingestion_priority, lifecycle_status, insight_potential, source_bias, and compliance_status in source/.harvest/link-index.sqlite3 by default, with human-readable summaries in registers only when useful.
A material unit is one canonical source page or document that can be evaluated independently:
For documentation sites, multi-page standards, GitHub repositories, or portals, choose the exact page that matters. Examples: a specific README, methodology page, protocol page, risk page, release note, or paper page. Do not summarize the entire site into one raw file.
Only keep items that satisfy most of these tests:
retrieved_at; older sources require an explicit exception note.If the user provides a collection topic, scope, or research direction but no candidate URLs, do not stop and ask only for links. Use source priority as the default base list for candidate discovery.
source-priority.md as discovery bases, not automatically as raw material units.source/registers/<collection-id>.md or source/ingest.md notes, not in raw-file front matter.source/, source/raw/, source/registers/, source/ingest.md, source/.harvest/link-index.sqlite3, and source/registers/classification-schema.md at the project root as needed. Ensure the source-side classification schema contains the full current controlled vocabulary from the reference file.source-priority.md, then resolve selected items to page-level URLs before ingest. Prefer items published within the active time window.--fetch-seeds, and use the database to suppress duplicate URLs before spending agent attention on quality decisions.material_kind, topic_domain, and credibility_tier when enough evidence is available, and put detailed classification fields into the SQLite link index. Write curator-facing notes in Chinese by default.source/raw/<material-id>-<slug>.md with minimal YAML front matter, the high-level classification trio, and raw page body Markdown.source/.harvest/link-index.sqlite3, source/ingest.md, or source/registers/, not in raw front matter.source/ingest.md or source/registers/.material_id, raw_path, terminal status, and detailed classification state.Use a best-effort retrieval chain for accepted public pages:
canonical_url as the original page-level URL even if a raw endpoint is used as the retrieval mechanism.When scripts or browser tools are used, keep the output shape unchanged: one raw Markdown file per canonical material, minimal raw front matter with the high-level classification trio, and caveats in the SQLite index plus ingest/register notes.
This skill is for an internal experimental raw-material layer. For publicly reachable pages, an absent or unclear reuse license is not by itself a reason to skip raw capture. Prefer capturing the main page body and record caveats separately.
Do not use this policy to bypass authentication, paywalls, CAPTCHA gates, robots-style technical blocks, or other access controls. If only summaries, snippets, search results, or third-party reposts are available, keep the item metadata-only or add a follow-up task for manual review.
The downloader subagent is part of this skill. agents/source-material-downloader.agent.md is only an invocation wrapper and must follow this section when it is used for high-value candidate downloads.
Responsibilities:
source/raw/.source/ingest.md, source/raw/, and the pre-crawl link index when present so a canonical URL is not downloaded twice.material_kind, topic_domain, and credibility_tier when assigned by the parent workflow or obvious from the accepted candidate row.source/ingest.md with pipeline state, paths, retrieval date, and any download caveat.material_id, raw_path, download_status, verification_status, and any detailed classification fields when the database exists.The downloader does not perform final verification, quality scoring, detailed classification, or insight writing. It may copy or lightly infer the high-level classification trio required by raw front matter.
The verifier subagent is part of this skill. agents/source-material-verifier.agent.md is only an invocation wrapper and must follow this section as the follow-up verification step.
Responsibilities:
source/raw/ is a raw page capture for one canonical URL.verification_status, and reviewed_at.source/ingest.md or source/registers/.source/ingest.md with verification_status, reviewed_at, and next action.rejected, watch, revisit, or needs_followup with reasons.The verifier does not write final insight notes unless the user explicitly asks for insight extraction after verification.
Classification is split by storage boundary:
material_kind, topic_domain, and credibility_tier.evidence_type, ingestion_priority, lifecycle_status, insight_potential, source_bias, and compliance_status.source/registers/classification-schema.md carries the controlled vocabulary and may include Chinese curator-facing summaries, but the SQLite index is the default machine-readable store for detailed classification state.When classification is useful, use classification schema and ensure the full schema is copied into source/registers/classification-schema.md inside the project root. Keep only the high-level classification trio in source/raw/; keep detailed classifications in the SQLite index and optional human-readable register notes.
When adding classification to a collection:
python scripts/precrawl_link_index.py --workspace . sync --collection-id <collection-id> after classification edits so register/raw classification state is reflected in source/.harvest/link-index.sqlite3.When scoring is useful, score each dimension from 0 to 5 in the register or ingest notes:
| Dimension | What to check | |---|---| | provenance | Author, institution, canonical URL, date, version. | | evidence | Data, code, benchmark, incident, standard text, or implementation detail. | | durability | Expected shelf life beyond short-term news. | | relevance | Fit to the collection target and enterprise AI-agent KB. | | viewpoint_potential | Clear claims, trade-offs, mental models, or design implications. | | bias_transparency | Incentives and limitations can be described. | | compliance_fit | Access, paywall, excerpting, and storage constraints are manageable; unspecified license on public pages is a caveat, not an automatic blocker for this internal experimental workflow. |
Default decision rules:
source/ingest.md with one row per canonical candidate page and minimal status fields.source/.harvest/link-index.sqlite3 when pre-crawl scanning is used, containing deduplication, processing-state, and detailed classification metadata that is not meant for downstream reading by default.source/registers/classification-schema.md containing the full current classification schema used by the collection.source/registers/<collection-id>.md with candidate source list and accept, reject, watch, revisit, or deep_read decisions.source/raw/<material-id>-<slug>.md raw Markdown page captures, one canonical URL per file, with minimal YAML front matter and the high-level classification trio updated in place after verification.tools
Agent-first graph-backed knowledge wiki builder with a self-contained CLI. Use for Graphwiki init/build/ingest/update, source indexing, semantic entity and relationship extraction, generated wiki pages, graph JSON/HTML explorer, evidence line ranges, query/explain question answering, synthesis pages, HTML reports, adding confirmed entity types, applying patches, cleanup, validation, tasks, and SQLite cache generation.
development
Use when the user asks to export a local HTML file, web page, or invitation page to a single-page PDF, a no-pagination PDF, a long PDF with auto-calculated height, or a PDF without headers and footers. Trigger on phrases like 单页 PDF, 不分页, 自动计算长度, 长图 PDF, 去掉页眉页脚, export HTML to single-page PDF, or print page to one PDF page.
development
Generate a structured, illustrated Q&A HTML document from the current conversation. Scans the conversation for conceptual questions the user asked and Claude's answers, then produces a self-contained HTML file with styled cards and SVG diagrams for technical/architectural topics. If a Q&A HTML file already exists in the current project directory, appends the new Q&As to it instead of creating a new file. Trigger this skill whenever the user asks to "generate Q&A", "create Q&A from conversation", "save Q&A", "document our Q&A", "turn this chat into Q&A", or anything suggesting they want the conversation's questions and answers captured as a document — even if they don't use the exact phrase "Q&A skill".
testing
Create high-quality draw.io diagrams with better layout for architecture diagrams, flowcharts, block diagrams, system maps, data flows, and complex connection-heavy visuals. Use when draw.io routing becomes messy, edges overlap, blocks cross too much, or the user wants cleaner diagram layout than the default drawio skill.