Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

paolomoz/incremental-processor

Name: incremental-processor
Author: paolomoz

skills/incremental-processor/SKILL.md

npx skillsauth add paolomoz/skills incremental-processor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Incremental Processor

Quick Reference

| Category | Trigger | Complexity | Source | |----------|---------|------------|--------| | patterns | "incremental processing", "change detection", "batch processing", "resumable pipelines" | Medium | 3 projects |

Process large datasets efficiently by tracking what has been processed, detecting changes via content hashing, and skipping unchanged items. Combines a StateManager for checkpoint tracking, an AnalysisIndex for hash-based change detection, and disciplined API pacing to build pipelines that resume without losing progress or repeating work.

When to Use

Processing a large backlog where reprocessing everything each run is too slow or expensive
Calling external APIs (LLMs, analytics) where each call costs money and redundant calls waste budget
Building ETL pipelines that run on a schedule and should only process new or changed data
Ingesting from APIs with rate limits where you need pacing and crash-safe resumption

Instructions

Step 1: StateManager for Checkpoints

Tracks per-source high-watermark checkpoints so you know where to resume. Uses atomic writes (write to .tmp then rename) to prevent corruption on crash.

import json
from pathlib import Path
from datetime import datetime

class StateManager:
    STATE_FILE = Path('.state') / 'processor_state.json'

    def __init__(self):
        self._state = {"channels": {}, "analysis": {}}
        if self.STATE_FILE.exists():
            try:
                self._state = json.loads(self.STATE_FILE.read_text())
            except json.JSONDecodeError:
                self.STATE_FILE.rename(self.STATE_FILE.with_suffix('.corrupt'))

    def _save(self):
        self.STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
        tmp = self.STATE_FILE.with_suffix('.tmp')
        tmp.write_text(json.dumps(self._state, indent=2, default=str))
        tmp.rename(self.STATE_FILE)

    def get_last_ts(self, channel: str) -> str | None:
        return self._state["channels"].get(channel, {}).get("last_ts")

    def set_last_ts(self, channel: str, ts: str, count: int, threads: int):
        self._state["channels"][channel] = {
            "last_ts": ts, "message_count": count,
            "thread_count": threads, "updated_at": str(datetime.now())
        }
        self._save()

Key rules:

Save after every checkpoint update, not batched at the end. On crash, you resume from the last completed item.
Handle corrupt state on load: rename the broken file and start fresh rather than crashing.

Step 2: AnalysisIndex for Change Detection

Records a content hash per processed item. On subsequent runs, compares current hashes against recorded hashes to skip unchanged items.

import hashlib

class AnalysisIndex:
    INDEX_FILE = Path('.state') / 'analysis_index.json'

    def __init__(self):
        self._index = {}
        if self.INDEX_FILE.exists():
            self._index = json.loads(self.INDEX_FILE.read_text())

    def _save(self):
        self.INDEX_FILE.parent.mkdir(parents=True, exist_ok=True)
        tmp = self.INDEX_FILE.with_suffix('.tmp')
        tmp.write_text(json.dumps(self._index, indent=2))
        tmp.rename(self.INDEX_FILE)

    @staticmethod
    def compute_hash(content: str) -> str:
        return hashlib.sha256(content.encode('utf-8')).hexdigest()[:16]

    def is_stale(self, source: str, item_id: str, current_hash: str) -> bool:
        entry = self._index.get(f"{source}:{item_id}")
        if entry is None: return True   # New item
        return entry.get("raw_hash") != current_hash  # Changed

    def record(self, source: str, item_id: str, raw_hash: str, result_id: str):
        self._index[f"{source}:{item_id}"] = {
            "raw_hash": raw_hash, "result_id": result_id,
            "processed_at": str(datetime.now())
        }
        self._save()

Hash design:

SHA-256 truncated to 16 hex chars (64 bits): collision probability ~1 in 4B at 100K items.
Hash the raw content string, not serialized objects. Serialization key ordering is non-deterministic.
Strip volatile metadata (timestamps, view counts) before hashing.

Step 3: Processing Loop with Retry and Pacing

Ties StateManager and AnalysisIndex together. For each item: compute hash, check staleness, process if needed, record result.

import time

class IncrementalProcessor:
    def __init__(self, api_client, state: StateManager, index: AnalysisIndex):
        self.api = api_client
        self.state = state
        self.index = index
        self.api_delay = 2.0  # seconds between API calls

    def process_source(self, source_id: str, items: list[dict]):
        processed, skipped = 0, 0
        for item in items:
            content_hash = AnalysisIndex.compute_hash(item["content"])
            if not self.index.is_stale(source_id, item["id"], content_hash):
                skipped += 1
                continue
            try:
                result = self._call_with_retry(item["content"])
                self.index.record(source_id, item["id"], content_hash, result["id"])
                processed += 1
                time.sleep(self.api_delay)
            except ThrottlingError:
                break  # Save state, resume next run
        if items and processed > 0:
            self.state.set_last_ts(source_id, items[-1]["timestamp"], processed, skipped)
        return {"processed": processed, "skipped": skipped}

    def _call_with_retry(self, content: str, max_retries: int = 3) -> dict:
        for attempt in range(max_retries):
            try:
                return self.api.analyze(content)
            except ThrottlingError:
                if attempt == max_retries - 1: raise
                time.sleep(2 ** attempt)  # Exponential backoff: 1s, 2s, 4s

Pacing rules:

Default delay: 2.0s between calls. Conservative but safe for most APIs.
Exponential backoff on throttling: 2^attempt seconds. Never fixed-delay retries.
Break on persistent throttling: save state and stop. Next run resumes from checkpoint.

Step 4: Pipeline Runner

All state lives in .state/ (add to .gitignore; mount as Docker volume for persistence). For multiple sources, namespace files: slack_state.json, github_index.json, etc.

def run_pipeline(api_client, sources: list[str]):
    state = StateManager()
    index = AnalysisIndex()
    processor = IncrementalProcessor(api_client, state, index)
    for source_id in sources:
        last_ts = state.get_last_ts(source_id)
        items = api_client.fetch_items(source_id, since=last_ts)
        if not items:
            continue
        result = processor.process_source(source_id, items)
        print(f"[{source_id}] Processed: {result['processed']}, Skipped: {result['skipped']}")

To force reprocessing after analysis logic changes, clear the index: index._index = {}; index._save(). Without clearing, partial failures resolve naturally -- recorded hashes let the next run skip completed items.

Troubleshooting

| Problem | Cause | Fix | |---------|-------|-----| | Everything reprocesses every run | Non-deterministic content hashing | Normalize before hashing: sort keys, strip whitespace, remove volatile fields | | State file empty or {} | Crash during write | Use atomic writes: .tmp then os.rename | | Rate limit despite pacing | Multiple instances running | Use a lock file in .state/ to prevent concurrent runs | | Index grows indefinitely | Deleted items never cleaned | Periodically prune entries whose source items no longer exist | | Hash collisions skip items | Truncated hash too short | Increase hash from 16 to 32 hex chars for datasets over 1M items |

Cross-References

data-intelligence-pipeline -- Full pipeline architecture using incremental processing as a core pattern
multi-provider-fallback -- API client abstraction used in _call_with_retry for provider switching

paolomoz/incremental-processor

skills/incremental-processor/SKILL.md

Process large datasets incrementally using hash-based change detection, state persistence, and API pacing to avoid redundant work. Use when building "incremental processing", "change detection", "batch processing", or "resumable pipelines".

2 stars

development

Updated Apr 25, 2026

$ install --global

skillsauth

npx skillsauth add paolomoz/skills incremental-processor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 7:28 PM126.2s1 file scanned

SKILL.md

name:: incremental-processor
description:: Process large datasets incrementally using hash-based change detection, state persistence, and API pacing to avoid redundant work. Use when building "incremental processing", "change detection", "batch processing", or "resumable pipelines".

Incremental Processor

Quick Reference

When to Use

Processing a large backlog where reprocessing everything each run is too slow or expensive
Calling external APIs (LLMs, analytics) where each call costs money and redundant calls waste budget
Building ETL pipelines that run on a schedule and should only process new or changed data
Ingesting from APIs with rate limits where you need pacing and crash-safe resumption

Instructions

Step 1: StateManager for Checkpoints

Tracks per-source high-watermark checkpoints so you know where to resume. Uses atomic writes (write to .tmp then rename) to prevent corruption on crash.

import json
from pathlib import Path
from datetime import datetime

class StateManager:
    STATE_FILE = Path('.state') / 'processor_state.json'

    def __init__(self):
        self._state = {"channels": {}, "analysis": {}}
        if self.STATE_FILE.exists():
            try:
                self._state = json.loads(self.STATE_FILE.read_text())
            except json.JSONDecodeError:
                self.STATE_FILE.rename(self.STATE_FILE.with_suffix('.corrupt'))

    def _save(self):
        self.STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
        tmp = self.STATE_FILE.with_suffix('.tmp')
        tmp.write_text(json.dumps(self._state, indent=2, default=str))
        tmp.rename(self.STATE_FILE)

    def get_last_ts(self, channel: str) -> str | None:
        return self._state["channels"].get(channel, {}).get("last_ts")

    def set_last_ts(self, channel: str, ts: str, count: int, threads: int):
        self._state["channels"][channel] = {
            "last_ts": ts, "message_count": count,
            "thread_count": threads, "updated_at": str(datetime.now())
        }
        self._save()

Key rules:

Save after every checkpoint update, not batched at the end. On crash, you resume from the last completed item.
Handle corrupt state on load: rename the broken file and start fresh rather than crashing.

Step 2: AnalysisIndex for Change Detection

Records a content hash per processed item. On subsequent runs, compares current hashes against recorded hashes to skip unchanged items.

import hashlib

class AnalysisIndex:
    INDEX_FILE = Path('.state') / 'analysis_index.json'

    def __init__(self):
        self._index = {}
        if self.INDEX_FILE.exists():
            self._index = json.loads(self.INDEX_FILE.read_text())

    def _save(self):
        self.INDEX_FILE.parent.mkdir(parents=True, exist_ok=True)
        tmp = self.INDEX_FILE.with_suffix('.tmp')
        tmp.write_text(json.dumps(self._index, indent=2))
        tmp.rename(self.INDEX_FILE)

    @staticmethod
    def compute_hash(content: str) -> str:
        return hashlib.sha256(content.encode('utf-8')).hexdigest()[:16]

    def is_stale(self, source: str, item_id: str, current_hash: str) -> bool:
        entry = self._index.get(f"{source}:{item_id}")
        if entry is None: return True   # New item
        return entry.get("raw_hash") != current_hash  # Changed

    def record(self, source: str, item_id: str, raw_hash: str, result_id: str):
        self._index[f"{source}:{item_id}"] = {
            "raw_hash": raw_hash, "result_id": result_id,
            "processed_at": str(datetime.now())
        }
        self._save()

Hash design:

SHA-256 truncated to 16 hex chars (64 bits): collision probability ~1 in 4B at 100K items.
Hash the raw content string, not serialized objects. Serialization key ordering is non-deterministic.
Strip volatile metadata (timestamps, view counts) before hashing.

Step 3: Processing Loop with Retry and Pacing

Ties StateManager and AnalysisIndex together. For each item: compute hash, check staleness, process if needed, record result.

import time

class IncrementalProcessor:
    def __init__(self, api_client, state: StateManager, index: AnalysisIndex):
        self.api = api_client
        self.state = state
        self.index = index
        self.api_delay = 2.0  # seconds between API calls

    def process_source(self, source_id: str, items: list[dict]):
        processed, skipped = 0, 0
        for item in items:
            content_hash = AnalysisIndex.compute_hash(item["content"])
            if not self.index.is_stale(source_id, item["id"], content_hash):
                skipped += 1
                continue
            try:
                result = self._call_with_retry(item["content"])
                self.index.record(source_id, item["id"], content_hash, result["id"])
                processed += 1
                time.sleep(self.api_delay)
            except ThrottlingError:
                break  # Save state, resume next run
        if items and processed > 0:
            self.state.set_last_ts(source_id, items[-1]["timestamp"], processed, skipped)
        return {"processed": processed, "skipped": skipped}

    def _call_with_retry(self, content: str, max_retries: int = 3) -> dict:
        for attempt in range(max_retries):
            try:
                return self.api.analyze(content)
            except ThrottlingError:
                if attempt == max_retries - 1: raise
                time.sleep(2 ** attempt)  # Exponential backoff: 1s, 2s, 4s

Pacing rules:

Default delay: 2.0s between calls. Conservative but safe for most APIs.
Exponential backoff on throttling: 2^attempt seconds. Never fixed-delay retries.
Break on persistent throttling: save state and stop. Next run resumes from checkpoint.

Step 4: Pipeline Runner

All state lives in .state/ (add to .gitignore; mount as Docker volume for persistence). For multiple sources, namespace files: slack_state.json, github_index.json, etc.

def run_pipeline(api_client, sources: list[str]):
    state = StateManager()
    index = AnalysisIndex()
    processor = IncrementalProcessor(api_client, state, index)
    for source_id in sources:
        last_ts = state.get_last_ts(source_id)
        items = api_client.fetch_items(source_id, since=last_ts)
        if not items:
            continue
        result = processor.process_source(source_id, items)
        print(f"[{source_id}] Processed: {result['processed']}, Skipped: {result['skipped']}")

Troubleshooting

Cross-References

data-intelligence-pipeline -- Full pipeline architecture using incremental processing as a core pattern
multi-provider-fallback -- API client abstraction used in _call_with_retry for provider switching

Related Skills

paolomoz/sumi

development

VerifiedTrustedCommunity

Generate artistic infographics from any topic. Runs the Sumi pipeline (analyze → structure → craft prompt → generate image) entirely within Claude Code. Use when "generate infographic", "create infographic", "sumi", "make an infographic about", or "visualize topic".

2SKILL.mdUpdated Apr 25, 2026

paolomoz/sse-streaming

tools

VerifiedTrustedCommunity

Implement Server-Sent Events streaming from Cloudflare Workers to browser clients with reconnection, state persistence, and progress tracking. Use when building "SSE streaming", "real-time updates", "server push", or "event streaming".

2SKILL.mdUpdated Apr 25, 2026

paolomoz/sse-streaming

paolomoz/site-auditor

development

VerifiedTrustedCommunity

Audit websites by cross-referencing query indexes, sitemaps, and navigation to identify content gaps, stale pages, missing metadata, and quality issues. Use when "auditing a website", "finding content gaps", "site quality audit", or "content inventory analysis".

2SKILL.mdUpdated Apr 25, 2026

paolomoz/site-auditor

paolomoz/session-context

data-ai

VerifiedTrustedCommunity

Track user session context across multi-turn interactions using browser sessionStorage and server-side KV caching with TTL. Use when implementing "session tracking", "conversation context", "multi-turn sessions", or "user journey tracking".

2SKILL.mdUpdated Apr 25, 2026

paolomoz/session-context

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/paolomoz/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/skills/incremental-processor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

paolomoz/skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT