Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

orcaqubits/nlweb-data-loading

Name: nlweb-data-loading
Author: orcaqubits

dist/codex/nlweb-protocol/skills/nlweb-data-loading/SKILL.md

npx skillsauth add orcaqubits/agentic-commerce-claude-plugins nlweb-data-loading

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

NLWeb Data Loading

Before writing code

Fetch live docs:

Fetch https://github.com/nlweb-ai/NLWeb/blob/main/docs/tools-database-load.md for the canonical db_load.py reference.
Inspect AskAgent/python/data_loading/db_load.py and db_load_utils.py in the live repo for exact CLI flags — they've added flags in recent releases.
Check AskAgent/python/data_loading/rss2schema.py for how RSS items map to Schema.org Article objects.
Confirm the embedding provider used at ingest matches preferred_provider in config_embedding.yaml for the query side — mismatch = silent retrieval failure.
For partner backends, check docs/setup-snowflake.md, docs/setup-cloudflare-autorag.md, etc. for backend-specific ingest steps (some bypass db_load.py).

Conceptual Architecture

What db_load Does

db_load.py is the canonical ingest pipeline. Given a source and a site name, it:

Fetches the source (RSS feed, JSON-LD URL, sitemap-derived URL list, CSV).
Normalizes each item to a Schema.org JSON object (uses rss2schema.py for feeds; passes JSON-LD through; maps CSV columns by convention).
Chunks long text fields (description, body) if needed.
Computes embeddings via the configured embedding provider in config_embedding.yaml.
Writes to the write_endpoint configured in config_retrieval.yaml.
Tags every record with the site value so retrieval can partition.

Supported Source Types

| Source | Detection | Notes | |--------|-----------|-------| | RSS / Atom feed | URL ending .rss, .xml, /feed, or content-type | Mapped to Article Schema.org type | | Schema.org JSON-LD | URL returns application/ld+json or HTML with embedded JSON-LD | Preserved as-is | | Sitemap.xml | URL ending sitemap.xml | Crawled for child URLs | | URL list file | --url-list path.txt flag | One URL per line; each fetched and parsed for JSON-LD | | CSV | .csv extension | Column-to-Schema.org mapping by convention; see docs |

Site Partitioning

Every record carries a site field. Queries filter by site=<name> to scope retrieval. Choose site names carefully — they're user-visible in /sites and become part of the agent UX. Conventions:

Lowercase, no spaces, hyphens or underscores
One site per logical content domain (not per RSS feed; aggregate related feeds under one site)

Embedding Dimension Trap

The most common ingest bug: data was loaded with embedding model A (dim 1536), but at query time config_embedding.yaml points to model B (dim 768). Retrieval silently returns garbage because vector dimensions don't align — or fails entirely if the backend enforces dimension constraints. Always verify the embedding provider hasn't changed between ingest and query.

Write Endpoint Selection

db_load.py writes to one endpoint at a time — the write_endpoint in config_retrieval.yaml, or override with --database <endpoint-name>. If you need data in multiple backends, run db_load multiple times changing the write endpoint each time.

Delete and Reload

Sites can be wiped:

python -m data_loading.db_load --only-delete delete-site <site-name>

Without --only-delete, the loader does upsert by URL — re-running on the same source updates existing records but leaves stale ones. For full refresh, delete first, then load.

Batch Sizing

--batch-size N controls how many records are embedded + written per round-trip. Defaults are sane (~100). Increase for large ingests if your embedding provider rate-limit allows.

Parallel Loading

data_loading/parallel_db_load.sh runs multiple loaders concurrently across sources. Use for cold-start across dozens of feeds. Watch rate limits on the embedding provider — Azure OpenAI has aggressive throttling.

Implementation Guidance

Loading an RSS Feed

python -m data_loading.db_load https://example.com/feed.xml my-blog

Each item in the feed becomes a Schema.org Article with headline, description, url, datePublished populated from the RSS fields. Embeddings come from concatenating headline + description (verify exact field selection in rss2schema.py).

Loading Schema.org JSON-LD

For sites that already serve JSON-LD (Recipe, Product, Event, Article, Movie, etc.), point db_load at a sitemap or URL list:

python -m data_loading.db_load --url-list urls.txt my-recipes

Each URL is fetched; the embedded JSON-LD is extracted and indexed verbatim. This is the highest-fidelity ingest path — the agent gets the full schema_object back at query time.

Loading CSV

python -m data_loading.db_load products.csv my-store

CSV columns must follow the Schema.org property naming convention (or the column-mapping rules in db_load_utils.py — verify). For products, columns like name, description, url, image, offers.price, offers.priceCurrency are common.

Overriding the Write Endpoint

python -m data_loading.db_load --database azure_ai_search source.xml my-site

Useful for parallel ingest across backends, or for promoting a dev qdrant_local index to prod Azure AI Search.

Incremental Refresh Pattern

# Daily — incremental upsert (existing records updated, new added, stale left)
python -m data_loading.db_load https://example.com/feed.xml my-blog

# Weekly — full refresh
python -m data_loading.db_load --only-delete delete-site my-blog
python -m data_loading.db_load https://example.com/feed.xml my-blog

Verifying a Load

After ingest:

curl http://localhost:8000/sites — your site should appear
curl 'http://localhost:8000/ask?query=test&site=my-blog&streaming=false&mode=list' — should return non-empty results
Inspect a result's schema_object field — confirm it has the Schema.org properties you expect

Backend-Specific Ingest

Some retrieval backends bypass db_load.py entirely:

Cloudflare AutoRAG — ingest is managed by Cloudflare; you upload to R2 and AutoRAG indexes for you. See docs/setup-cloudflare-autorag.md.
Snowflake Cortex Search — data lives in Snowflake tables; Cortex Search indexes are created via SQL. NLWeb just queries.
Shopify MCP — no ingest; NLWeb proxies to Shopify's MCP endpoint live.
Bing Web Search — no ingest; live web search.

Common Failures

db_load hangs on embedding — your embedding provider is rate-limiting. Reduce --batch-size or switch provider.
Records load but never appear in /ask — check sites: allowlist in config_nlweb.yaml; check that write_endpoint and the enabled read endpoints actually overlap.
Loaded RSS but schema_object is sparse — RSS doesn't carry rich Schema.org metadata. Either accept it or move to JSON-LD ingest.
Embedding dim mismatch — re-ingest with the correct provider, or change config_embedding.yaml to match what was ingested.

Always cross-check flags against the live db_load.py — argument names drift release to release.

orcaqubits/nlweb-data-loading

dist/codex/nlweb-protocol/skills/nlweb-data-loading/SKILL.md

Ingest site content into NLWeb's vector store using `db_load.py` — supports RSS/Atom feeds, Schema.org JSON-LD, sitemap-driven URL lists, and CSV. Covers chunking, embedding computation, site partitioning, batch sizing, delete-and-reload, and per-backend write_endpoint targeting. Use when bootstrapping a site's index, refreshing content, or migrating between retrieval backends.

27 stars

development

Updated May 14, 2026

$ install --global

skillsauth

npx skillsauth add orcaqubits/agentic-commerce-claude-plugins nlweb-data-loading

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 14, 2026, 5:57 AM281.2s1 file scanned

SKILL.md

name:: nlweb-data-loading
description:: >

NLWeb Data Loading

Before writing code

Fetch live docs:

Fetch https://github.com/nlweb-ai/NLWeb/blob/main/docs/tools-database-load.md for the canonical db_load.py reference.
Inspect AskAgent/python/data_loading/db_load.py and db_load_utils.py in the live repo for exact CLI flags — they've added flags in recent releases.
Check AskAgent/python/data_loading/rss2schema.py for how RSS items map to Schema.org Article objects.
Confirm the embedding provider used at ingest matches preferred_provider in config_embedding.yaml for the query side — mismatch = silent retrieval failure.
For partner backends, check docs/setup-snowflake.md, docs/setup-cloudflare-autorag.md, etc. for backend-specific ingest steps (some bypass db_load.py).

Conceptual Architecture

What db_load Does

db_load.py is the canonical ingest pipeline. Given a source and a site name, it:

Fetches the source (RSS feed, JSON-LD URL, sitemap-derived URL list, CSV).
Normalizes each item to a Schema.org JSON object (uses rss2schema.py for feeds; passes JSON-LD through; maps CSV columns by convention).
Chunks long text fields (description, body) if needed.
Computes embeddings via the configured embedding provider in config_embedding.yaml.
Writes to the write_endpoint configured in config_retrieval.yaml.
Tags every record with the site value so retrieval can partition.

Supported Source Types

Site Partitioning

Lowercase, no spaces, hyphens or underscores
One site per logical content domain (not per RSS feed; aggregate related feeds under one site)

Embedding Dimension Trap

Write Endpoint Selection

Delete and Reload

Sites can be wiped:

python -m data_loading.db_load --only-delete delete-site <site-name>

Without --only-delete, the loader does upsert by URL — re-running on the same source updates existing records but leaves stale ones. For full refresh, delete first, then load.

Batch Sizing

--batch-size N controls how many records are embedded + written per round-trip. Defaults are sane (~100). Increase for large ingests if your embedding provider rate-limit allows.

Parallel Loading

Implementation Guidance

Loading an RSS Feed

python -m data_loading.db_load https://example.com/feed.xml my-blog

Loading Schema.org JSON-LD

For sites that already serve JSON-LD (Recipe, Product, Event, Article, Movie, etc.), point db_load at a sitemap or URL list:

python -m data_loading.db_load --url-list urls.txt my-recipes

Each URL is fetched; the embedded JSON-LD is extracted and indexed verbatim. This is the highest-fidelity ingest path — the agent gets the full schema_object back at query time.

Loading CSV

python -m data_loading.db_load products.csv my-store

Overriding the Write Endpoint

python -m data_loading.db_load --database azure_ai_search source.xml my-site

Useful for parallel ingest across backends, or for promoting a dev qdrant_local index to prod Azure AI Search.

Incremental Refresh Pattern

# Daily — incremental upsert (existing records updated, new added, stale left)
python -m data_loading.db_load https://example.com/feed.xml my-blog

# Weekly — full refresh
python -m data_loading.db_load --only-delete delete-site my-blog
python -m data_loading.db_load https://example.com/feed.xml my-blog

Verifying a Load

After ingest:

curl http://localhost:8000/sites — your site should appear
curl 'http://localhost:8000/ask?query=test&site=my-blog&streaming=false&mode=list' — should return non-empty results
Inspect a result's schema_object field — confirm it has the Schema.org properties you expect

Backend-Specific Ingest

Some retrieval backends bypass db_load.py entirely:

Cloudflare AutoRAG — ingest is managed by Cloudflare; you upload to R2 and AutoRAG indexes for you. See docs/setup-cloudflare-autorag.md.
Snowflake Cortex Search — data lives in Snowflake tables; Cortex Search indexes are created via SQL. NLWeb just queries.
Shopify MCP — no ingest; NLWeb proxies to Shopify's MCP endpoint live.
Bing Web Search — no ingest; live web search.

Common Failures

db_load hangs on embedding — your embedding provider is rate-limiting. Reduce --batch-size or switch provider.
Records load but never appear in /ask — check sites: allowlist in config_nlweb.yaml; check that write_endpoint and the enabled read endpoints actually overlap.
Loaded RSS but schema_object is sparse — RSS doesn't carry rich Schema.org metadata. Either accept it or move to JSON-LD ingest.
Embedding dim mismatch — re-ingest with the correct provider, or change config_embedding.yaml to match what was ingested.

Always cross-check flags against the live db_load.py — argument names drift release to release.

Related Skills

orcaqubits/spree-headless-storefront

development

VerifiedTrustedCommunity

Build with Spree's headless Next.js storefront — the official `spree/storefront` repo (Next.js 16 App Router with Server Actions and Turbopack, React 19 Server Components, Tailwind CSS 4, TypeScript 5, `@spree/sdk`, Sentry), server-only auth (httpOnly JWT cookies + publishable key), MeiliSearch faceted catalog, one-page checkout with Apple/Google Pay/Klarna/Affirm/SEPA, multi-region market routing, GA4 + JSON-LD SEO, and Vercel/Docker deployment. Use when forking or customizing the storefront, or evaluating headless adoption.

27SKILL.mdUpdated May 14, 2026

orcaqubits/spree-headless-storefront

orcaqubits/spree-extensions

tools

VerifiedTrustedCommunity

Build Spree extensions as Rails engines — gem scaffolding, `bin/rails g spree:extension`, mounting routes/migrations/assets, the modern `prepend` decorator pattern (`*_decorator.rb` with `self.prepended(base)`), generators (`spree:model_decorator`, `spree:controller_decorator`), the four customization surfaces in preference order (Events > Webhooks > Dependencies > Decorators), Spree::Dependencies for swapping service objects, gem release/versioning, and the deprecated Deface engine. Use when building a reusable Spree extension or adding non-trivial customization to an app.

27SKILL.mdUpdated May 14, 2026

orcaqubits/spree-extensions

orcaqubits/spree-events-webhooks

development

VerifiedTrustedCommunity

Build with Spree's event bus and Webhooks 2.0 — `Spree::Events` publication, `Spree::Subscriber` DSL with `subscribes_to` and `on`, wildcard matching, lifecycle events (`{model}.created/.updated/.deleted` via `publishes_lifecycle_events`), the canonical event catalog (order.*, payment.*, shipment.*, product.*), Webhooks 2.0 endpoints, HMAC-SHA256 signing (`X-Spree-Webhook-Signature`), exponential-backoff retries, and Sidekiq job orchestration. Use when wiring event-driven business logic, building webhook consumers, or replacing ActiveSupport callback chains.

27SKILL.mdUpdated May 14, 2026

orcaqubits/spree-events-webhooks

orcaqubits/spree-dev-patterns

tools

VerifiedTrustedCommunity

Cross-cutting Spree development patterns — the customization preference hierarchy (Events > Webhooks > Dependencies > Decorators), `Spree::Dependencies` service-object swapping, the `_decorator.rb` + `prepend` + `self.prepended` idiom, idempotent subscribers and webhook receivers, multi-store scoping discipline, prefixed IDs, calculator polymorphism (shipping/promotion/tax share the base), service-object composition with `dry-monads` or simple results, why to avoid `class_eval` reopening and Deface, and Spree-on-Rails idioms (Hotwire/Turbo Stimulus, ActiveStorage, Action Cable, Sidekiq). Use when designing the architecture of a Spree extension or solving cross-cutting concerns.

27SKILL.mdUpdated May 14, 2026

orcaqubits/spree-dev-patterns

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/orcaqubits/agentic-commerce-claude-plugins.git

# Copy into Claude Code skills folder (global)
cp -r agentic-commerce-claude-plugins/dist/codex/nlweb-protocol/skills/nlweb-data-loading ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

orcaqubits/agentic-commerce-claude-plugins

27 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT