engineering/database-engineering/skills/schema-design/SKILL.md
This skill should be used when the user asks about "database schema", "data modeling", "table design", "column types", "primary key", "foreign key", "indexes", "normalization", "denormalization", "database relationships", "one-to-many", "many-to-many", "entity relationship", "ER diagram", "database constraints", "NOT NULL", "unique constraint", "CHECK constraint", "soft delete", "deleted_at", "audit log", "audit table", "JSONB", "JSON column", "enum type", "PostgreSQL schema", "MySQL schema", "Prisma schema", "SQLAlchemy models", "ActiveRecord migration". Also trigger for "how should I store this data", "how do I model this relationship", "is this database design good", or "how to structure the database for".
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library schema-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-proven patterns for relational database schema design in PostgreSQL/MySQL.
NOT NULL, CHECK, and UNIQUE constraint documents a business rule in the database itself-- Tables: plural, snake_case
CREATE TABLE orders -- ✓
CREATE TABLE order_items -- ✓ (not orderItems, not order_item)
CREATE TABLE OrderTable -- ✗
-- Primary key: always 'id', always the first column
-- Use UUID for: distributed systems, public-facing IDs (prevents enumeration)
-- Use BIGSERIAL for: purely internal tables with very high insert rates
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT NOT NULL, -- 'email' not 'user_email' or 'emailAddress'
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Foreign keys: {singular_referenced_table}_id
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES users(id), -- not user_id (ambiguous), not users_id
...
);
-- Boolean columns: is_, has_, can_ prefix
is_active BOOLEAN NOT NULL DEFAULT true
has_subscription BOOLEAN NOT NULL DEFAULT false
can_edit BOOLEAN NOT NULL DEFAULT false
-- Timestamp columns: always include these
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
deleted_at TIMESTAMPTZ -- Nullable = present; not null = soft-deleted
-- IDs and references
UUID -- Public-facing IDs (opaque, non-enumerable)
BIGINT GENERATED ALWAYS AS IDENTITY -- Internal surrogate keys (compact, sequential)
TEXT -- User-provided string IDs (Stripe charges, etc.)
-- Text
TEXT -- Variable-length text (PostgreSQL: TEXT = VARCHAR(∞), equally fast)
VARCHAR(n) -- When there IS a meaningful max length (e.g., zip codes, country codes)
CHAR(n) -- Almost never — fixed-width strings, pads with spaces
-- Numbers
INTEGER -- Counts, small integers (−2B to 2B)
BIGINT -- Large counts, snowflake IDs
NUMERIC(10, 2) -- Monetary values (exact decimal — NEVER float for money)
-- Or store in cents (integer) and divide when displaying: 4999 → $49.99
-- Floats (only for scientific/measurement values where approximation is acceptable)
DOUBLE PRECISION -- 64-bit float
REAL -- 32-bit float
-- Date and time
TIMESTAMPTZ -- Always: UTC with timezone info (PostgreSQL)
DATE -- Date without time (birthdate, invoice date)
TIME -- Time of day without date (business hours)
INTERVAL -- Duration (3 days, 2 hours)
-- NEVER: Unix epoch as INTEGER (unreadable, error-prone, no DB functions)
-- Booleans
BOOLEAN -- true/false — use for binary state only
-- Don't use TINYINT(1) in PostgreSQL — that's a MySQL pattern
-- JSON
JSONB -- PostgreSQL: binary JSON, indexable, queryable (prefer over JSON)
JSON -- PostgreSQL: text JSON (exact storage, no indexing) — rarely preferred
-- When to use JSONB: config, metadata, infrequently queried flexible attributes
-- When NOT to use JSONB: data you need to query, filter, or join on — make it relational
-- Enums
-- Option 1: PostgreSQL ENUM type (schema change to add values)
CREATE TYPE order_status AS ENUM ('pending', 'processing', 'shipped', 'delivered', 'cancelled');
-- Option 2: TEXT with CHECK constraint (easier to evolve, readable in dumps)
status TEXT NOT NULL CHECK (status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled'))
-- Option 3: Lookup table (most flexible, nullable with FK)
CREATE TABLE order_statuses (id TEXT PRIMARY KEY, label TEXT NOT NULL);
INSERT INTO order_statuses VALUES ('pending', 'Awaiting payment'), ...;
status TEXT NOT NULL REFERENCES order_statuses(id)
CREATE TABLE customers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT NOT NULL UNIQUE -- UNIQUE enforces one account per email
);
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(id) ON DELETE CASCADE,
-- ON DELETE: CASCADE | RESTRICT | NO ACTION | SET NULL | SET DEFAULT
-- CASCADE: deleting customer deletes their orders (use carefully)
-- RESTRICT: prevent deleting customer if they have orders (safer default)
-- SET NULL: set customer_id to null (requires nullable column)
total_cents INTEGER NOT NULL CHECK (total_cents >= 0),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Always index the FK side
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- Without this: DELETE FROM customers WHERE id = ? does a sequential scan of orders
CREATE TABLE products (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL
);
CREATE TABLE tags (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL UNIQUE
);
-- Junction table: named by combining the two entities
CREATE TABLE product_tags (
product_id UUID NOT NULL REFERENCES products(id) ON DELETE CASCADE,
tag_id UUID NOT NULL REFERENCES tags(id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (product_id, tag_id) -- Composite PK prevents duplicates
);
-- Index both directions (PK covers product_id → tag_id direction)
CREATE INDEX idx_product_tags_tag_id ON product_tags(tag_id);
-- This makes queries "which products have tag X?" efficient
-- Adjacency list (simple, but expensive for deep tree queries)
CREATE TABLE categories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
parent_id UUID REFERENCES categories(id) ON DELETE RESTRICT, -- nullable = root
name TEXT NOT NULL
);
CREATE INDEX idx_categories_parent_id ON categories(parent_id);
-- Materialized path (fast reads, batch updates on move)
CREATE TABLE categories (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
path TEXT NOT NULL, -- '/electronics/computers/laptops'
name TEXT NOT NULL
);
CREATE INDEX idx_categories_path ON categories(path);
-- All descendants: WHERE path LIKE '/electronics/%'
-- Closure table (most flexible for tree queries)
CREATE TABLE category_ancestors (
ancestor_id UUID NOT NULL REFERENCES categories(id),
descendant_id UUID NOT NULL REFERENCES categories(id),
depth INTEGER NOT NULL, -- 0 = self
PRIMARY KEY (ancestor_id, descendant_id)
);
-- B-tree index (default) — equality, range, ORDER BY, BETWEEN
CREATE INDEX idx_orders_created_at ON orders(created_at DESC);
-- Composite index — field order matters
-- Rule: equality conditions first, then range conditions
-- Query: WHERE customer_id = ? AND created_at > ?
CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at DESC);
-- This index can answer: WHERE customer_id = ? AND WHERE customer_id = ? AND created_at > ?
-- It CANNOT efficiently answer: WHERE created_at > ? alone
-- Partial index — index only rows matching a condition
-- Query pattern: WHERE deleted_at IS NULL AND status = 'active'
CREATE INDEX idx_orders_active ON orders(customer_id, created_at)
WHERE deleted_at IS NULL AND status IN ('pending', 'processing');
-- Smaller index, much faster for the specific use case
-- Covering index — includes columns to avoid table heap access
-- Query: SELECT id, status FROM orders WHERE customer_id = ?
CREATE INDEX idx_orders_customer_covering ON orders(customer_id) INCLUDE (status);
-- status is in the index, so PostgreSQL never touches the main table
-- GIN index — for array columns, JSONB, and full-text search
CREATE INDEX idx_users_tags ON users USING GIN(tags);
CREATE INDEX idx_products_attrs ON products USING GIN(attributes);
CREATE INDEX idx_posts_search ON posts USING GIN(to_tsvector('english', body));
-- Always add indexes CONCURRENTLY in production (avoids table lock)
CREATE INDEX CONCURRENTLY idx_name ON table(column);
status with only 3 values on a small table)CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(id),
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled')),
total_cents INTEGER NOT NULL CHECK (total_cents >= 0),
discount_cents INTEGER NOT NULL DEFAULT 0 CHECK (discount_cents >= 0),
-- Business rule: discount can't exceed total
CHECK (discount_cents <= total_cents),
-- Exactly one of these is set (polymorphic reference — better as separate tables but this works)
payment_card_id UUID REFERENCES payment_cards(id),
payment_account_id UUID REFERENCES payment_accounts(id),
CHECK (
(payment_card_id IS NOT NULL)::integer +
(payment_account_id IS NOT NULL)::integer = 1
),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Multi-column uniqueness
CREATE UNIQUE INDEX uq_orders_idempotency
ON orders(customer_id, idempotency_key)
WHERE idempotency_key IS NOT NULL;
-- Generic audit log — works for any table
CREATE TABLE audit_logs (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
table_name TEXT NOT NULL,
record_id UUID NOT NULL,
operation TEXT NOT NULL CHECK (operation IN ('INSERT', 'UPDATE', 'DELETE')),
old_data JSONB,
new_data JSONB,
changed_by UUID REFERENCES users(id),
ip_address INET,
user_agent TEXT,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_audit_table_record ON audit_logs(table_name, record_id, occurred_at DESC);
CREATE INDEX idx_audit_user ON audit_logs(changed_by, occurred_at DESC);
-- PostgreSQL trigger to auto-populate audit log
CREATE OR REPLACE FUNCTION audit_trigger_function()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO audit_logs (table_name, record_id, operation, old_data, new_data)
VALUES (
TG_TABLE_NAME,
COALESCE(NEW.id, OLD.id),
TG_OP,
CASE WHEN TG_OP = 'INSERT' THEN NULL ELSE row_to_json(OLD)::JSONB END,
CASE WHEN TG_OP = 'DELETE' THEN NULL ELSE row_to_json(NEW)::JSONB END
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Apply to a table
CREATE TRIGGER orders_audit
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION audit_trigger_function();
-- Add to any table that needs soft delete
deleted_at TIMESTAMPTZ; -- NULL = active, timestamp = deleted
-- Partial index makes queries on active records efficient
CREATE INDEX idx_orders_active ON orders(customer_id, created_at)
WHERE deleted_at IS NULL;
-- PostgreSQL Row Security Policy (enforce soft delete at DB level)
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY orders_hide_deleted ON orders
USING (deleted_at IS NULL);
-- Application-level: always filter
SELECT * FROM orders WHERE customer_id = ? AND deleted_at IS NULL;
-- "Delete"
UPDATE orders SET deleted_at = now() WHERE id = ?;
-- ANTI-PATTERN: EAV — looks flexible, is a performance disaster
CREATE TABLE entity_attributes (
entity_id UUID,
attr_name TEXT,
attr_value TEXT -- All values as strings = no type safety
);
-- BETTER: Use JSONB for truly dynamic attributes
CREATE TABLE products (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
attributes JSONB NOT NULL DEFAULT '{}' -- {"color": "red", "size": "M", "weight_g": 150}
);
CREATE INDEX idx_products_attrs ON products USING GIN(attributes);
-- Query: WHERE attributes->>'color' = 'red'
-- Query: WHERE (attributes->>'weight_g')::int > 100
-- ANTI-PATTERN: polymorphic (can't have FK constraint)
CREATE TABLE comments (
commentable_type TEXT, -- 'Post', 'Video', 'Article'
commentable_id UUID, -- No FK possible — DB can't enforce integrity
...
);
-- BETTER: separate tables (FK integrity preserved)
CREATE TABLE post_comments (post_id UUID REFERENCES posts(id), ...);
CREATE TABLE video_comments (video_id UUID REFERENCES videos(id), ...);
CREATE TABLE article_comments (article_id UUID REFERENCES articles(id), ...);
-- Or: null all but one FK (with check constraint)
CREATE TABLE comments (
post_id UUID REFERENCES posts(id),
video_id UUID REFERENCES videos(id),
article_id UUID REFERENCES articles(id),
CHECK (
(post_id IS NOT NULL)::int +
(video_id IS NOT NULL)::int +
(article_id IS NOT NULL)::int = 1
),
...
);
-- ANTI-PATTERN: comma-separated values in a column
CREATE TABLE users (
tags TEXT -- 'developer,admin,beta-user' — cannot index, cannot FK, hard to query
);
-- BETTER: array column (PostgreSQL)
CREATE TABLE users (
tags TEXT[] NOT NULL DEFAULT '{}'
);
CREATE INDEX idx_users_tags ON users USING GIN(tags);
-- Query: WHERE 'developer' = ANY(tags)
-- OR: proper junction table (best for entity references)
CREATE TABLE user_tags ( user_id UUID, tag_id UUID, PRIMARY KEY (user_id, tag_id) );
For normalization decision guides and advanced schema patterns, see:
references/normalization-guide.md — normal forms with examples, audit trail patterns, soft-delete strategies, and multi-tenancy schema designsreferences/advanced-patterns.md — time-series schema design, graph data in PostgreSQL, polymorphic associations, and event sourcing table layoutstesting
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.