dist/codex/nlweb-protocol/skills/nlweb-data-loading/SKILL.md
Ingest site content into NLWeb's vector store using `db_load.py` — supports RSS/Atom feeds, Schema.org JSON-LD, sitemap-driven URL lists, and CSV. Covers chunking, embedding computation, site partitioning, batch sizing, delete-and-reload, and per-backend write_endpoint targeting. Use when bootstrapping a site's index, refreshing content, or migrating between retrieval backends.
npx skillsauth add orcaqubits/agentic-commerce-claude-plugins nlweb-data-loadingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Fetch live docs:
db_load.py reference.AskAgent/python/data_loading/db_load.py and db_load_utils.py in the live repo for exact CLI flags — they've added flags in recent releases.AskAgent/python/data_loading/rss2schema.py for how RSS items map to Schema.org Article objects.preferred_provider in config_embedding.yaml for the query side — mismatch = silent retrieval failure.docs/setup-snowflake.md, docs/setup-cloudflare-autorag.md, etc. for backend-specific ingest steps (some bypass db_load.py).db_load.py is the canonical ingest pipeline. Given a source and a site name, it:
rss2schema.py for feeds; passes JSON-LD through; maps CSV columns by convention).config_embedding.yaml.write_endpoint configured in config_retrieval.yaml.site value so retrieval can partition.| Source | Detection | Notes |
|--------|-----------|-------|
| RSS / Atom feed | URL ending .rss, .xml, /feed, or content-type | Mapped to Article Schema.org type |
| Schema.org JSON-LD | URL returns application/ld+json or HTML with embedded JSON-LD | Preserved as-is |
| Sitemap.xml | URL ending sitemap.xml | Crawled for child URLs |
| URL list file | --url-list path.txt flag | One URL per line; each fetched and parsed for JSON-LD |
| CSV | .csv extension | Column-to-Schema.org mapping by convention; see docs |
Every record carries a site field. Queries filter by site=<name> to scope retrieval. Choose site names carefully — they're user-visible in /sites and become part of the agent UX. Conventions:
The most common ingest bug: data was loaded with embedding model A (dim 1536), but at query time config_embedding.yaml points to model B (dim 768). Retrieval silently returns garbage because vector dimensions don't align — or fails entirely if the backend enforces dimension constraints. Always verify the embedding provider hasn't changed between ingest and query.
db_load.py writes to one endpoint at a time — the write_endpoint in config_retrieval.yaml, or override with --database <endpoint-name>. If you need data in multiple backends, run db_load multiple times changing the write endpoint each time.
Sites can be wiped:
python -m data_loading.db_load --only-delete delete-site <site-name>
Without --only-delete, the loader does upsert by URL — re-running on the same source updates existing records but leaves stale ones. For full refresh, delete first, then load.
--batch-size N controls how many records are embedded + written per round-trip. Defaults are sane (~100). Increase for large ingests if your embedding provider rate-limit allows.
data_loading/parallel_db_load.sh runs multiple loaders concurrently across sources. Use for cold-start across dozens of feeds. Watch rate limits on the embedding provider — Azure OpenAI has aggressive throttling.
python -m data_loading.db_load https://example.com/feed.xml my-blog
Each item in the feed becomes a Schema.org Article with headline, description, url, datePublished populated from the RSS fields. Embeddings come from concatenating headline + description (verify exact field selection in rss2schema.py).
For sites that already serve JSON-LD (Recipe, Product, Event, Article, Movie, etc.), point db_load at a sitemap or URL list:
python -m data_loading.db_load --url-list urls.txt my-recipes
Each URL is fetched; the embedded JSON-LD is extracted and indexed verbatim. This is the highest-fidelity ingest path — the agent gets the full schema_object back at query time.
python -m data_loading.db_load products.csv my-store
CSV columns must follow the Schema.org property naming convention (or the column-mapping rules in db_load_utils.py — verify). For products, columns like name, description, url, image, offers.price, offers.priceCurrency are common.
python -m data_loading.db_load --database azure_ai_search source.xml my-site
Useful for parallel ingest across backends, or for promoting a dev qdrant_local index to prod Azure AI Search.
# Daily — incremental upsert (existing records updated, new added, stale left)
python -m data_loading.db_load https://example.com/feed.xml my-blog
# Weekly — full refresh
python -m data_loading.db_load --only-delete delete-site my-blog
python -m data_loading.db_load https://example.com/feed.xml my-blog
After ingest:
curl http://localhost:8000/sites — your site should appearcurl 'http://localhost:8000/ask?query=test&site=my-blog&streaming=false&mode=list' — should return non-empty resultsschema_object field — confirm it has the Schema.org properties you expectSome retrieval backends bypass db_load.py entirely:
docs/setup-cloudflare-autorag.md.db_load hangs on embedding — your embedding provider is rate-limiting. Reduce --batch-size or switch provider./ask — check sites: allowlist in config_nlweb.yaml; check that write_endpoint and the enabled read endpoints actually overlap.schema_object is sparse — RSS doesn't carry rich Schema.org metadata. Either accept it or move to JSON-LD ingest.config_embedding.yaml to match what was ingested.Always cross-check flags against the live db_load.py — argument names drift release to release.
development
Build with Spree's headless Next.js storefront — the official `spree/storefront` repo (Next.js 16 App Router with Server Actions and Turbopack, React 19 Server Components, Tailwind CSS 4, TypeScript 5, `@spree/sdk`, Sentry), server-only auth (httpOnly JWT cookies + publishable key), MeiliSearch faceted catalog, one-page checkout with Apple/Google Pay/Klarna/Affirm/SEPA, multi-region market routing, GA4 + JSON-LD SEO, and Vercel/Docker deployment. Use when forking or customizing the storefront, or evaluating headless adoption.
tools
Build Spree extensions as Rails engines — gem scaffolding, `bin/rails g spree:extension`, mounting routes/migrations/assets, the modern `prepend` decorator pattern (`*_decorator.rb` with `self.prepended(base)`), generators (`spree:model_decorator`, `spree:controller_decorator`), the four customization surfaces in preference order (Events > Webhooks > Dependencies > Decorators), Spree::Dependencies for swapping service objects, gem release/versioning, and the deprecated Deface engine. Use when building a reusable Spree extension or adding non-trivial customization to an app.
development
Build with Spree's event bus and Webhooks 2.0 — `Spree::Events` publication, `Spree::Subscriber` DSL with `subscribes_to` and `on`, wildcard matching, lifecycle events (`{model}.created/.updated/.deleted` via `publishes_lifecycle_events`), the canonical event catalog (order.*, payment.*, shipment.*, product.*), Webhooks 2.0 endpoints, HMAC-SHA256 signing (`X-Spree-Webhook-Signature`), exponential-backoff retries, and Sidekiq job orchestration. Use when wiring event-driven business logic, building webhook consumers, or replacing ActiveSupport callback chains.
tools
Cross-cutting Spree development patterns — the customization preference hierarchy (Events > Webhooks > Dependencies > Decorators), `Spree::Dependencies` service-object swapping, the `_decorator.rb` + `prepend` + `self.prepended` idiom, idempotent subscribers and webhook receivers, multi-store scoping discipline, prefixed IDs, calculator polymorphism (shipping/promotion/tax share the base), service-object composition with `dry-monads` or simple results, why to avoid `class_eval` reopening and Deface, and Spree-on-Rails idioms (Hotwire/Turbo Stimulus, ActiveStorage, Action Cable, Sidekiq). Use when designing the architecture of a Spree extension or solving cross-cutting concerns.