Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

organvm-iv-taxis/data-ingestion-pipeline

Name: data-ingestion-pipeline
Author: organvm-iv-taxis

.build/direct/example/data-ingestion-pipeline/SKILL.md

npx skillsauth add organvm-iv-taxis/a-i--skills data-ingestion-pipeline

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Ingestion Pipeline

Extract, validate, and load data from diverse sources into target systems.

Pipeline Architecture

Sources → Extract → Validate → Transform → Stage → Load → Verify
  │          │          │          │          │       │        │
  │          │          │          │          │       │        └─ Row counts match
  │          │          │          │          │       └─ Write to target
  │          │          │          │          └─ Staging table/file
  │          │          │          └─ Normalize, enrich, deduplicate
  │          │          └─ Schema validation, business rules
  │          └─ Pull from source
  └─ APIs, files, databases, streams

Source Extraction

File-Based Sources

from pathlib import Path
import json
import csv
import yaml

class FileExtractor:
    PARSERS = {
        ".json": lambda p: json.loads(p.read_text()),
        ".yaml": lambda p: yaml.safe_load(p.read_text()),
        ".yml": lambda p: yaml.safe_load(p.read_text()),
        ".csv": lambda p: list(csv.DictReader(p.open())),
    }

    def extract(self, path: Path) -> list[dict]:
        parser = self.PARSERS.get(path.suffix)
        if not parser:
            raise ValueError(f"Unsupported format: {path.suffix}")
        data = parser(path)
        return data if isinstance(data, list) else [data]

API Extraction with Pagination

import httpx

async def extract_paginated(base_url: str, params: dict = {}) -> list[dict]:
    all_records = []
    page = 1
    async with httpx.AsyncClient() as client:
        while True:
            response = await client.get(base_url, params={**params, "page": page, "per_page": 100})
            response.raise_for_status()
            data = response.json()
            records = data.get("items", data.get("results", data))
            if not records:
                break
            all_records.extend(records)
            page += 1
    return all_records

Database Extraction

import asyncpg

async def extract_from_db(dsn: str, query: str, batch_size: int = 1000):
    conn = await asyncpg.connect(dsn)
    try:
        async for batch in conn.cursor(query, prefetch=batch_size):
            yield dict(batch)
    finally:
        await conn.close()

Validation

Schema Validation

from dataclasses import dataclass

@dataclass
class ValidationResult:
    valid: list[dict]
    invalid: list[tuple[dict, str]]  # (record, error_message)

def validate_records(records: list[dict], schema: dict) -> ValidationResult:
    result = ValidationResult(valid=[], invalid=[])
    required_fields = schema.get("required", [])

    for record in records:
        errors = []
        for field in required_fields:
            if field not in record or record[field] is None:
                errors.append(f"Missing required field: {field}")

        for field, rules in schema.get("fields", {}).items():
            if field in record and record[field] is not None:
                value = record[field]
                if "type" in rules and not isinstance(value, rules["type"]):
                    errors.append(f"{field}: expected {rules['type'].__name__}")
                if "max_length" in rules and len(str(value)) > rules["max_length"]:
                    errors.append(f"{field}: exceeds max length {rules['max_length']}")

        if errors:
            result.invalid.append((record, "; ".join(errors)))
        else:
            result.valid.append(record)

    return result

Business Rule Validation

def apply_business_rules(records: list[dict]) -> ValidationResult:
    result = ValidationResult(valid=[], invalid=[])
    for record in records:
        errors = []

        # Example: organ must be valid
        if record.get("organ") not in {"I", "II", "III", "IV", "V", "VI", "VII", "META"}:
            errors.append(f"Invalid organ: {record.get('organ')}")

        # Example: status must follow promotion state machine
        valid_statuses = {"LOCAL", "CANDIDATE", "PUBLIC_PROCESS", "GRADUATED", "ARCHIVED"}
        if record.get("status") not in valid_statuses:
            errors.append(f"Invalid status: {record.get('status')}")

        if errors:
            result.invalid.append((record, "; ".join(errors)))
        else:
            result.valid.append(record)
    return result

Deduplication

def deduplicate(records: list[dict], key_fields: list[str]) -> list[dict]:
    seen = set()
    unique = []
    for record in records:
        key = tuple(record.get(f) for f in key_fields)
        if key not in seen:
            seen.add(key)
            unique.append(record)
    return unique

Merge Strategy

from enum import Enum

class MergeStrategy(str, Enum):
    KEEP_FIRST = "keep_first"
    KEEP_LATEST = "keep_latest"
    MERGE_FIELDS = "merge_fields"

def merge_duplicates(records: list[dict], key_fields: list[str], strategy: MergeStrategy) -> list[dict]:
    groups: dict[tuple, list[dict]] = {}
    for record in records:
        key = tuple(record.get(f) for f in key_fields)
        groups.setdefault(key, []).append(record)

    merged = []
    for key, group in groups.items():
        if strategy == MergeStrategy.KEEP_FIRST:
            merged.append(group[0])
        elif strategy == MergeStrategy.KEEP_LATEST:
            merged.append(group[-1])
        elif strategy == MergeStrategy.MERGE_FIELDS:
            result = {}
            for record in group:
                for k, v in record.items():
                    if v is not None:
                        result[k] = v
            merged.append(result)
    return merged

Staging Pattern

from pathlib import Path
from datetime import datetime

class StagingArea:
    def __init__(self, base_dir: str):
        self.base = Path(base_dir)

    def stage(self, batch_id: str, records: list[dict]) -> Path:
        stage_dir = self.base / batch_id
        stage_dir.mkdir(parents=True, exist_ok=True)

        data_path = stage_dir / "data.json"
        meta_path = stage_dir / "metadata.json"

        data_path.write_text(json.dumps(records, indent=2, default=str))
        meta_path.write_text(json.dumps({
            "batch_id": batch_id,
            "record_count": len(records),
            "staged_at": datetime.now().isoformat(),
            "status": "staged",
        }))
        return stage_dir

    def promote(self, batch_id: str) -> list[dict]:
        stage_dir = self.base / batch_id
        data = json.loads((stage_dir / "data.json").read_text())
        meta = json.loads((stage_dir / "metadata.json").read_text())
        meta["status"] = "promoted"
        meta["promoted_at"] = datetime.now().isoformat()
        (stage_dir / "metadata.json").write_text(json.dumps(meta, indent=2))
        return data

Pipeline Orchestration

class IngestionPipeline:
    def __init__(self, extractor, validator, transformer, loader):
        self.extractor = extractor
        self.validator = validator
        self.transformer = transformer
        self.loader = loader

    async def run(self, source: str) -> dict:
        # Extract
        raw = await self.extractor.extract(source)

        # Validate
        validation = self.validator.validate(raw)
        if validation.invalid:
            log.warning("validation_failures", count=len(validation.invalid))

        # Transform
        transformed = self.transformer.transform(validation.valid)

        # Deduplicate
        unique = deduplicate(transformed, key_fields=["id"])

        # Load
        loaded = await self.loader.load(unique)

        return {
            "extracted": len(raw),
            "valid": len(validation.valid),
            "invalid": len(validation.invalid),
            "loaded": loaded,
        }

Anti-Patterns

No validation gate — Always validate before loading; corrupt data is worse than missing data
Loading directly from source — Stage first; staging enables inspection and rollback
No deduplication — Sources often contain duplicates; handle at ingestion
Silent data loss — Log and report every skipped/invalid record
Monolithic pipeline — Break into composable stages for testing and reuse
No idempotency — Pipeline re-runs should produce the same result

organvm-iv-taxis/data-ingestion-pipeline

.build/direct/example/data-ingestion-pipeline/SKILL.md

Build data ingestion pipelines for batch and streaming data from multiple sources. Covers extraction strategies, format normalization, deduplication, validation gates, and staging patterns. Triggers on data ingestion, ETL pipeline, or data import architecture requests.

6 stars

development

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add organvm-iv-taxis/a-i--skills data-ingestion-pipeline

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 30, 2026, 3:18 AM5.0s1 file scanned

SKILL.md

name:: data-ingestion-pipeline
description:: Build data ingestion pipelines for batch and streaming data from multiple sources. Covers extraction strategies, format normalization, deduplication, validation gates, and staging patterns. Triggers on data ingestion, ETL pipeline, or data import architecture requests.
license:: MIT
complexity:: intermediate
time_to_learn:: 30min
governance_phases:: [build]
organ_affinity:: [meta]
triggers:: [user-asks-about-data-ingestion, context:etl, context:data-import, context:data-pipeline]
complements:: [data-pipeline-architect, data-backup-patterns, configuration-management]

Data Ingestion Pipeline

Extract, validate, and load data from diverse sources into target systems.

Pipeline Architecture

Sources → Extract → Validate → Transform → Stage → Load → Verify
  │          │          │          │          │       │        │
  │          │          │          │          │       │        └─ Row counts match
  │          │          │          │          │       └─ Write to target
  │          │          │          │          └─ Staging table/file
  │          │          │          └─ Normalize, enrich, deduplicate
  │          │          └─ Schema validation, business rules
  │          └─ Pull from source
  └─ APIs, files, databases, streams

Source Extraction

File-Based Sources

from pathlib import Path
import json
import csv
import yaml

class FileExtractor:
    PARSERS = {
        ".json": lambda p: json.loads(p.read_text()),
        ".yaml": lambda p: yaml.safe_load(p.read_text()),
        ".yml": lambda p: yaml.safe_load(p.read_text()),
        ".csv": lambda p: list(csv.DictReader(p.open())),
    }

    def extract(self, path: Path) -> list[dict]:
        parser = self.PARSERS.get(path.suffix)
        if not parser:
            raise ValueError(f"Unsupported format: {path.suffix}")
        data = parser(path)
        return data if isinstance(data, list) else [data]

API Extraction with Pagination

import httpx

async def extract_paginated(base_url: str, params: dict = {}) -> list[dict]:
    all_records = []
    page = 1
    async with httpx.AsyncClient() as client:
        while True:
            response = await client.get(base_url, params={**params, "page": page, "per_page": 100})
            response.raise_for_status()
            data = response.json()
            records = data.get("items", data.get("results", data))
            if not records:
                break
            all_records.extend(records)
            page += 1
    return all_records

Database Extraction

import asyncpg

async def extract_from_db(dsn: str, query: str, batch_size: int = 1000):
    conn = await asyncpg.connect(dsn)
    try:
        async for batch in conn.cursor(query, prefetch=batch_size):
            yield dict(batch)
    finally:
        await conn.close()

Validation

Schema Validation

from dataclasses import dataclass

@dataclass
class ValidationResult:
    valid: list[dict]
    invalid: list[tuple[dict, str]]  # (record, error_message)

def validate_records(records: list[dict], schema: dict) -> ValidationResult:
    result = ValidationResult(valid=[], invalid=[])
    required_fields = schema.get("required", [])

    for record in records:
        errors = []
        for field in required_fields:
            if field not in record or record[field] is None:
                errors.append(f"Missing required field: {field}")

        for field, rules in schema.get("fields", {}).items():
            if field in record and record[field] is not None:
                value = record[field]
                if "type" in rules and not isinstance(value, rules["type"]):
                    errors.append(f"{field}: expected {rules['type'].__name__}")
                if "max_length" in rules and len(str(value)) > rules["max_length"]:
                    errors.append(f"{field}: exceeds max length {rules['max_length']}")

        if errors:
            result.invalid.append((record, "; ".join(errors)))
        else:
            result.valid.append(record)

    return result

Business Rule Validation

def apply_business_rules(records: list[dict]) -> ValidationResult:
    result = ValidationResult(valid=[], invalid=[])
    for record in records:
        errors = []

        # Example: organ must be valid
        if record.get("organ") not in {"I", "II", "III", "IV", "V", "VI", "VII", "META"}:
            errors.append(f"Invalid organ: {record.get('organ')}")

        # Example: status must follow promotion state machine
        valid_statuses = {"LOCAL", "CANDIDATE", "PUBLIC_PROCESS", "GRADUATED", "ARCHIVED"}
        if record.get("status") not in valid_statuses:
            errors.append(f"Invalid status: {record.get('status')}")

        if errors:
            result.invalid.append((record, "; ".join(errors)))
        else:
            result.valid.append(record)
    return result

Deduplication

def deduplicate(records: list[dict], key_fields: list[str]) -> list[dict]:
    seen = set()
    unique = []
    for record in records:
        key = tuple(record.get(f) for f in key_fields)
        if key not in seen:
            seen.add(key)
            unique.append(record)
    return unique

Merge Strategy

from enum import Enum

class MergeStrategy(str, Enum):
    KEEP_FIRST = "keep_first"
    KEEP_LATEST = "keep_latest"
    MERGE_FIELDS = "merge_fields"

def merge_duplicates(records: list[dict], key_fields: list[str], strategy: MergeStrategy) -> list[dict]:
    groups: dict[tuple, list[dict]] = {}
    for record in records:
        key = tuple(record.get(f) for f in key_fields)
        groups.setdefault(key, []).append(record)

    merged = []
    for key, group in groups.items():
        if strategy == MergeStrategy.KEEP_FIRST:
            merged.append(group[0])
        elif strategy == MergeStrategy.KEEP_LATEST:
            merged.append(group[-1])
        elif strategy == MergeStrategy.MERGE_FIELDS:
            result = {}
            for record in group:
                for k, v in record.items():
                    if v is not None:
                        result[k] = v
            merged.append(result)
    return merged

Staging Pattern

from pathlib import Path
from datetime import datetime

class StagingArea:
    def __init__(self, base_dir: str):
        self.base = Path(base_dir)

    def stage(self, batch_id: str, records: list[dict]) -> Path:
        stage_dir = self.base / batch_id
        stage_dir.mkdir(parents=True, exist_ok=True)

        data_path = stage_dir / "data.json"
        meta_path = stage_dir / "metadata.json"

        data_path.write_text(json.dumps(records, indent=2, default=str))
        meta_path.write_text(json.dumps({
            "batch_id": batch_id,
            "record_count": len(records),
            "staged_at": datetime.now().isoformat(),
            "status": "staged",
        }))
        return stage_dir

    def promote(self, batch_id: str) -> list[dict]:
        stage_dir = self.base / batch_id
        data = json.loads((stage_dir / "data.json").read_text())
        meta = json.loads((stage_dir / "metadata.json").read_text())
        meta["status"] = "promoted"
        meta["promoted_at"] = datetime.now().isoformat()
        (stage_dir / "metadata.json").write_text(json.dumps(meta, indent=2))
        return data

Pipeline Orchestration

class IngestionPipeline:
    def __init__(self, extractor, validator, transformer, loader):
        self.extractor = extractor
        self.validator = validator
        self.transformer = transformer
        self.loader = loader

    async def run(self, source: str) -> dict:
        # Extract
        raw = await self.extractor.extract(source)

        # Validate
        validation = self.validator.validate(raw)
        if validation.invalid:
            log.warning("validation_failures", count=len(validation.invalid))

        # Transform
        transformed = self.transformer.transform(validation.valid)

        # Deduplicate
        unique = deduplicate(transformed, key_fields=["id"])

        # Load
        loaded = await self.loader.load(unique)

        return {
            "extracted": len(raw),
            "valid": len(validation.valid),
            "invalid": len(validation.invalid),
            "loaded": loaded,
        }

Anti-Patterns

No validation gate — Always validate before loading; corrupt data is worse than missing data
Loading directly from source — Stage first; staging enables inspection and rollback
No deduplication — Sources often contain duplicates; handle at ingestion
Silent data loss — Log and report every skipped/invalid record
Monolithic pipeline — Break into composable stages for testing and reuse
No idempotency — Pipeline re-runs should produce the same result

Related Skills

organvm-iv-taxis/shell-history-hygiene

development

VerifiedTrustedCommunity

Dry-run audit + targeted cleanup for shell command history. Currently wraps atuin (stats today, prune, dedup with dated preview artifacts); extensible to zsh/bash/mcfly backends. Always previews before applying — apply commands are echoed for the human to run, never auto-executed. Triggers on "/shell-history-hygiene", "audit atuin", "audit shell history", "clean shell history", "atuin prune", "atuin dedup", "shell history hygiene", "history cleanup". Replaces ad-hoc one-liners (e.g. `... | tee cmd > file.txt` which wrote two files, swallowed dedup output, and left a junk `cmd` file).

12SKILL.mdUpdated Jul 20, 2026

organvm-iv-taxis/shell-history-hygiene

organvm-iv-taxis/setup-cowork

tools

VerifiedTrustedCommunity

Guided Cowork setup — install role-matched plugins, connect your tools, try a skill.

12SKILL.mdUpdated Jul 20, 2026

organvm-iv-taxis/setup-cowork

organvm-iv-taxis/session-lifecycle-patterns

development

VerifiedTrustedCommunity

Manage AI agent session lifecycles with structured phases (FRAME, SHAPE, BUILD, PROVE), context preservation across sessions, handoff protocols, and session metadata tracking. Triggers on session management, agent lifecycle, or multi-session workflow requests.

12SKILL.mdUpdated Jul 20, 2026

organvm-iv-taxis/session-lifecycle-patterns

organvm-iv-taxis/session-governance-audit

tools

VerifiedTrustedCommunity

Parse a session transcript into a structured Session Governance Index — an annotated bibliography of every file modified and commit made, internal-energy accounting (tool uses, estimated tokens), shipped-vs-tasked atom tally, and classification of missing items as Gaps or Vacuums. Triggers on "visibility-schema-substrate-sweep", "session cascade audit", "session governance audit", or any request to summarize what a session actually produced versus what it was asked to produce.

12SKILL.mdUpdated Jul 20, 2026

organvm-iv-taxis/session-governance-audit

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/organvm-iv-taxis/a-i--skills.git

# Copy into Claude Code skills folder (global)
cp -r a-i--skills/.build/direct/example/data-ingestion-pipeline ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

organvm-iv-taxis/a-i--skills

6 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT