4. Data Model Definition

This step anchors the system in persistence. It explains what entities exist, how they relate, and where they should live.

What to Define

  • Core entities such as users, objects, sessions, events, messages, feeds, or counters.
  • Relationships between entities.
  • Primary keys, partition keys, and secondary indexes.
  • SQL versus NoSQL choice.
  • Data retention, archival, and deletion policies.
  • Which data belongs in object storage, cache, search, analytics, or a transactional database.
  • Which fields must be strongly consistent and which can lag.
  • Which data is immutable, append-only, or frequently updated.

Design Questions

  • What are the access patterns?
  • Which queries must be fast?
  • Which data needs strong consistency?
  • Which data can be denormalized?
  • How will the schema evolve over time?
  • Will this model need to support joins, fan-out, or aggregation?
  • Are there hot partitions or hot keys to avoid?

Data Modeling Choices

  • Use a relational schema when transactions and joins matter.
  • Use a key-value or document store when lookups dominate.
  • Use time-series or log-oriented storage when events arrive continuously.
  • Use object storage for large blobs such as images or video.
  • Use search indexes for text, filtering, and ranking.

SQL vs NoSQL Decision Tree

Choose SQL (RDBMS) When:

  • Structured data with rigid schema (users, transactions, orders).
  • Relational data with complex joins.
  • Need ACID transactions (banking, payments).
  • Developers familiar with SQL.
  • Examples: PostgreSQL, MySQL.

Choose NoSQL When:

  • Semi-structured or dynamic schema (JSON documents).
  • Very large data volumes (TB/PB scale).
  • Very high throughput (millions of writes/sec).
  • Eventual consistency acceptable.

NoSQL Types:

  • Key-Value: O(1) reads/writes (Redis, DynamoDB).
  • Document Store: JSON queries (MongoDB, CouchDB).
  • Wide Column Store: High write throughput (Cassandra, HBase).
  • Graph Database: Complex relationships (Neo4j).

Consistency Patterns (CAP Theorem)

You can guarantee 2 of 3: Consistency, Availability, Partition Tolerance. Since networks fail, choose Consistency or Availability:

Pattern Speed Reliability Use Case
Strong Consistency Slower High Transactions, banking
Eventual Consistency Faster Medium Feeds, DNS, S3
Weak Consistency Fastest Low Gaming, real-time

ACID vs BASE:

  • ACID (SQL): Atomicity, Consistency, Isolation, Durability → strong guarantees.
  • BASE (NoSQL): Basically Available, Soft state, Eventually consistent → higher throughput.

Output of This Step

  • A simple schema or entity model.
  • The database choice and why it fits the workload.
  • The read and write patterns the model must support.
  • The partitioning strategy and indexing approach.

Common Mistakes

  • Starting with tables before access patterns.
  • Ignoring partitioning and indexing.
  • Putting every kind of data into one store.
  • Designing a schema that is hard to scale or evolve.