Name: database-design
Author: krzysztofsurdy

Database Design

Good database design determines the long-term maintainability, performance, and correctness of any data-driven application. Schema decisions made early are expensive to reverse later. Every table, column, index, and constraint should exist for a reason backed by access patterns and business rules.

Data Modeling Principles

Start from Access Patterns

Design tables around how the application reads and writes data, not around how entities look in a domain model. Two questions drive every schema decision:

What queries will run most frequently? - these determine table structure, indexes, and denormalization choices
What consistency guarantees does the data need? - these determine normalization level, constraints, and transaction boundaries

Entity Relationships

| Relationship | Implementation | When to Use | |---|---|---| | One-to-one | Foreign key with UNIQUE constraint on the child table | Splitting rarely-accessed columns into a separate table, or enforcing exactly-one semantics | | One-to-many | Foreign key on the child table referencing the parent | Orders to order items, users to addresses | | Many-to-many | Join table with composite primary key | Tags to articles, students to courses | | Many-to-many with attributes | Join table with its own columns beyond the two foreign keys | Enrollment with grade, membership with role | | Self-referential | Foreign key referencing the same table | Org charts, category trees, threaded comments |

Naming Conventions

Consistent naming prevents confusion across teams and tools:

Tables: plural nouns in snake_case (order_items, user_addresses)
Columns: singular snake_case describing the value (created_at, total_amount)
Foreign keys: <referenced_table_singular>_id (user_id, order_id)
Indexes: idx_<table>_<columns> (idx_orders_user_id_created_at)
Constraints: chk_<table>_<rule>, uq_<table>_<columns>, fk_<table>_<referenced>

Normalization vs Denormalization

Normal Forms

| Form | Rule | Violation Example | |---|---|---| | 1NF | Every column holds atomic values; no repeating groups | Storing comma-separated tags in a single column | | 2NF | Every non-key column depends on the entire primary key | In a composite-key table, a column depending on only part of the key | | 3NF | No non-key column depends on another non-key column | Storing both city and zip_code when zip determines city | | BCNF | Every determinant is a candidate key | A scheduling table where room determines building but room is not a key |

When to Denormalize

Normalization prevents anomalies but adds JOINs. Denormalize selectively when:

Read-heavy workloads dominate and JOIN cost is measurable in profiling
Reporting tables need pre-aggregated data that would otherwise require expensive queries
Caching a computed value avoids recalculating on every read (e.g., order_total stored on the order row)
Document-oriented access retrieves an entire aggregate in one read

Rules for safe denormalization:

Always keep the normalized source of truth - denormalized data is a derived cache
Define how and when the denormalized copy is updated (trigger, application event, batch job)
Monitor for drift between the source and the copy
Document why the denormalization exists and what access pattern it serves

Choosing a Database Type

| Type | Strengths | Fits When | |---|---|---| | Relational (PostgreSQL, MySQL) | ACID transactions, complex queries, mature tooling, JOINs | Structured data with relationships, transactional workloads, most CRUD applications | | Document (MongoDB, DynamoDB) | Flexible schema, nested data, horizontal scaling | Aggregates accessed as a unit, rapidly evolving schemas, per-tenant isolation | | Key-value (Redis, Memcached) | Sub-millisecond reads, simple data model | Session storage, caching, counters, rate limiting | | Column-family (Cassandra, ScyllaDB) | High write throughput, wide rows, linear scaling | Time-series, IoT telemetry, append-heavy workloads | | Graph (Neo4j, Neptune) | Traversal queries, relationship-centric data | Social networks, recommendation engines, fraud detection | | Time-series (TimescaleDB, InfluxDB) | Optimized for time-stamped data, automatic partitioning | Metrics, monitoring, financial tick data |

Polyglot persistence - using different databases for different parts of the same system - is valid when access patterns genuinely differ. It is not valid as a way to avoid learning one database well.

Indexing Fundamentals

Indexes accelerate reads at the cost of slower writes and additional storage. Every index must justify its existence through query patterns.

Index Types

| Type | Structure | Best For | |---|---|---| | B-tree | Balanced tree, sorted data | Equality and range queries, ORDER BY, most general-purpose indexing | | Hash | Hash table | Exact equality lookups only; no range support | | GiST | Generalized search tree | Spatial data, geometric queries, range types, nearest-neighbor | | GIN | Generalized inverted index | Full-text search, JSONB containment, array membership | | BRIN | Block range index | Large tables with naturally ordered data (timestamps, sequential IDs) |

Composite Index Design

The order of columns in a composite index matters. The leftmost prefix rule means a composite index on (a, b, c) supports queries filtering on (a), (a, b), or (a, b, c), but not (b, c) alone.

Column ordering guidelines:

Equality conditions first - columns compared with =
Range conditions last - columns compared with >, <, BETWEEN
Most selective column first among equals

Covering and Partial Indexes

Covering index: includes all columns the query needs, so the database reads only the index. Use INCLUDE (PostgreSQL) or just add columns to the index key.
Partial index: indexes only rows matching a condition, reducing size and write overhead. Ideal for querying a small subset of a large table (e.g., WHERE status = 'pending').

See Indexing Strategies Reference for detailed index types, EXPLAIN analysis, and anti-patterns.

Schema Evolution

Schema changes are inevitable. The question is whether they break running applications.

Backward-Compatible Changes (Safe)

Adding a new nullable column
Adding a new table
Adding a new index (may lock briefly on some engines)
Widening a column type (e.g., VARCHAR(50) to VARCHAR(100))

Breaking Changes (Require Migration Strategy)

Renaming or removing a column
Changing a column type in incompatible ways
Adding a NOT NULL constraint to an existing column with null data
Splitting or merging tables

The Expand-Contract Pattern

For breaking changes in production with zero downtime:

Expand - add the new structure alongside the old one
Migrate - backfill data from old to new, dual-write during transition
Switch - update application code to use the new structure
Contract - remove the old structure once nothing references it

See Migration Patterns Reference for zero-downtime strategies, rollback techniques, and multi-tool examples.

Partitioning and Sharding

Table Partitioning (Single Database)

| Strategy | How It Works | Use Case | |---|---|---| | Range | Rows split by value ranges (e.g., by month) | Time-series data, log tables, archival | | List | Rows split by discrete values (e.g., by region) | Multi-tenant data, geographic segmentation | | Hash | Rows distributed by hash of a column | Even distribution when no natural range exists |

Sharding (Multiple Databases)

Sharding distributes data across separate database instances. Use it only after single-instance optimizations (indexing, caching, read replicas) are exhausted.

Shard key selection criteria:

High cardinality - many distinct values to distribute evenly
Present in most queries - avoids scatter-gather across all shards
Stable - values that do not change after creation
Avoid hotspots - do not shard by a value that concentrates writes (e.g., current date)

Quick Reference: Common Design Mistakes

| Mistake | Consequence | Fix | |---|---|---| | No foreign key constraints | Orphaned rows, inconsistent data | Always define foreign keys unless there is a documented reason not to | | Over-indexing | Slow writes, wasted storage | Index only columns used in WHERE, JOIN, ORDER BY of actual queries | | Storing computed values without a refresh strategy | Stale data, silent bugs | Define update triggers, events, or batch jobs alongside any denormalization | | Using ENUM types for values that change | Schema migration for every new value | Use a lookup table with a foreign key instead | | Storing money as floating-point | Rounding errors | Use DECIMAL/NUMERIC or store as integer cents | | Missing created_at / updated_at timestamps | No auditability, difficult debugging | Add timestamp columns to every table by default | | Generic type + type_id polymorphism everywhere | No referential integrity, complex queries | Evaluate STI, CTI, or separate tables first |

Reference Files

| Reference | Contents | |---|---| | Modeling Patterns | Polymorphic associations (STI/CTI/TPT), soft deletes, audit trails, temporal data, self-referential trees, JSON columns | | Indexing Strategies | B-tree/hash/GiST/GIN details, composite index design, covering and partial indexes, EXPLAIN analysis, anti-patterns | | Migration Patterns | Version-based vs state-based migrations, expand-contract, data migrations, rollback strategies, multi-tool examples |

Integration with Other Skills

| Situation | Recommended Skill | |---|---| | Optimizing query performance and caching | Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for performance optimization patterns | | Designing domain models and aggregates | Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for clean architecture and DDD guidance | | Building APIs that expose database-backed resources | Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for API design principles | | Testing database interactions | Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for testing strategies |

Database Design

Data Modeling Principles

Start from Access Patterns

Design tables around how the application reads and writes data, not around how entities look in a domain model. Two questions drive every schema decision:

What queries will run most frequently? - these determine table structure, indexes, and denormalization choices
What consistency guarantees does the data need? - these determine normalization level, constraints, and transaction boundaries

Entity Relationships

Naming Conventions

Consistent naming prevents confusion across teams and tools:

Tables: plural nouns in snake_case (order_items, user_addresses)
Columns: singular snake_case describing the value (created_at, total_amount)
Foreign keys: <referenced_table_singular>_id (user_id, order_id)
Indexes: idx_<table>_<columns> (idx_orders_user_id_created_at)
Constraints: chk_<table>_<rule>, uq_<table>_<columns>, fk_<table>_<referenced>

Normalization vs Denormalization

Normal Forms

When to Denormalize

Normalization prevents anomalies but adds JOINs. Denormalize selectively when:

Read-heavy workloads dominate and JOIN cost is measurable in profiling
Reporting tables need pre-aggregated data that would otherwise require expensive queries
Caching a computed value avoids recalculating on every read (e.g., order_total stored on the order row)
Document-oriented access retrieves an entire aggregate in one read

Rules for safe denormalization:

Always keep the normalized source of truth - denormalized data is a derived cache
Define how and when the denormalized copy is updated (trigger, application event, batch job)
Monitor for drift between the source and the copy
Document why the denormalization exists and what access pattern it serves

Choosing a Database Type

Polyglot persistence - using different databases for different parts of the same system - is valid when access patterns genuinely differ. It is not valid as a way to avoid learning one database well.

Indexing Fundamentals

Indexes accelerate reads at the cost of slower writes and additional storage. Every index must justify its existence through query patterns.

Index Types

Composite Index Design

Column ordering guidelines:

Equality conditions first - columns compared with =
Range conditions last - columns compared with >, <, BETWEEN
Most selective column first among equals

Covering and Partial Indexes

Covering index: includes all columns the query needs, so the database reads only the index. Use INCLUDE (PostgreSQL) or just add columns to the index key.
Partial index: indexes only rows matching a condition, reducing size and write overhead. Ideal for querying a small subset of a large table (e.g., WHERE status = 'pending').

See Indexing Strategies Reference for detailed index types, EXPLAIN analysis, and anti-patterns.

Schema Evolution

Schema changes are inevitable. The question is whether they break running applications.

Backward-Compatible Changes (Safe)

Adding a new nullable column
Adding a new table
Adding a new index (may lock briefly on some engines)
Widening a column type (e.g., VARCHAR(50) to VARCHAR(100))

Breaking Changes (Require Migration Strategy)

Renaming or removing a column
Changing a column type in incompatible ways
Adding a NOT NULL constraint to an existing column with null data
Splitting or merging tables

The Expand-Contract Pattern

For breaking changes in production with zero downtime:

Expand - add the new structure alongside the old one
Migrate - backfill data from old to new, dual-write during transition
Switch - update application code to use the new structure
Contract - remove the old structure once nothing references it

See Migration Patterns Reference for zero-downtime strategies, rollback techniques, and multi-tool examples.

Partitioning and Sharding

Table Partitioning (Single Database)

Sharding (Multiple Databases)

Sharding distributes data across separate database instances. Use it only after single-instance optimizations (indexing, caching, read replicas) are exhausted.

Shard key selection criteria:

High cardinality - many distinct values to distribute evenly
Present in most queries - avoids scatter-gather across all shards
Stable - values that do not change after creation
Avoid hotspots - do not shard by a value that concentrates writes (e.g., current date)

Adoption

krzysztofsurdy/database-design

$ install --global

Security Scan Results

SKILL.md

Database Design

Data Modeling Principles

Start from Access Patterns

Entity Relationships

Naming Conventions

Normalization vs Denormalization

Normal Forms

When to Denormalize

Choosing a Database Type

Indexing Fundamentals

Index Types

Composite Index Design

Covering and Partial Indexes

Schema Evolution

Backward-Compatible Changes (Safe)

Breaking Changes (Require Migration Strategy)

The Expand-Contract Pattern

Partitioning and Sharding

Table Partitioning (Single Database)

Sharding (Multiple Databases)

Quick Reference: Common Design Mistakes

Reference Files

Integration with Other Skills

Related Skills

krzysztofsurdy/dispatching-agent-teams

krzysztofsurdy/agent-teams

krzysztofsurdy/using-ecosystem

krzysztofsurdy/plugin-creator

krzysztofsurdy/database-design

$ install --global

Security Scan Results

SKILL.md

Database Design

Data Modeling Principles

Start from Access Patterns

Entity Relationships

Naming Conventions

Normalization vs Denormalization

Normal Forms

When to Denormalize

Choosing a Database Type

Indexing Fundamentals

Index Types

Composite Index Design

Covering and Partial Indexes

Schema Evolution

Backward-Compatible Changes (Safe)

Breaking Changes (Require Migration Strategy)

The Expand-Contract Pattern

Partitioning and Sharding

Table Partitioning (Single Database)

Sharding (Multiple Databases)

Quick Reference: Common Design Mistakes

Reference Files

Integration with Other Skills

Related Skills

krzysztofsurdy/dispatching-agent-teams

krzysztofsurdy/agent-teams

krzysztofsurdy/using-ecosystem

krzysztofsurdy/plugin-creator