SQL vs NoSQL and data modeling

Software & architecture · architecture · May 2024

SQL versus NoSQL is really a question about data model and access patterns, not a contest with a winner. Relational databases give flexible queries, joins, and strong transactions; NoSQL stores trade some of that for horizontal scale, schema flexibility, or a specialized shape. The right default is a relational database until a concrete requirement, scale, write throughput, schema churn, or a particular access pattern, pushes you elsewhere.

The deeper reason the choice matters is that relational systems let you decide queries later, since the planner can combine indexes and joins arbitrarily, whereas most NoSQL stores require you to model the data around the queries you already know you will run.

The relational model and ACID

Relational data is rows in tables linked by keys, normalized so each fact lives in one place. Normalization (up to third normal form) removes redundancy and the update anomalies it causes, at the cost of needing joins to reassemble data. On top of this, transactions provide ACID guarantees, so multi-row changes commit atomically and invariants hold under concurrency, which is exactly what you want for money, inventory, and anything with cross-entity rules.

The NoSQL families

  • Key-value (Redis, DynamoDB): lookups by key at huge scale, caching, sessions.
  • Document (MongoDB): flexible or nested documents and fast schema iteration.
  • Wide-column (Cassandra): massive write throughput and known query patterns.
  • Graph (Neo4j): relationship-heavy queries like social graphs and recommendations.

Normalization versus denormalization

Normalized schemas avoid redundancy but pay for it at read time with joins; denormalized schemas store redundant copies so a read is a single lookup, paying instead with write complexity and storage. The why behind most NoSQL modeling is exactly this trade: you precompute and duplicate so the common read is cheap, accepting that writes must update several copies and that consistency becomes the application's job.

Scaling and consistency

A relational database scales reads with replicas easily but scales writes hard, because a single primary serializes them; going further means partitioning or sharding, which sacrifices cheap cross-shard joins and transactions. NoSQL stores are built to partition from the start, which is why they scale writes, but distribution forces the CAP trade-off, so many default to eventual consistency and ask the application to tolerate stale reads.

Choosing

Reach for relational when you need joins, ad-hoc queries, and strong transactions, which is most applications. Reach for a specific NoSQL family when its shape matches your dominant access pattern at a scale a single primary cannot serve. In practice large systems are polyglot: Postgres for core data, Redis for caching, a search engine for full text, a warehouse for analytics.

Worked example

Modeling a social feed shows the trade-off concretely. A relational design joins users, follows, and posts at read time, which is clean but expensive for a celebrity with millions of followers. At scale you instead fan out on write: when someone posts, you append the post id to each follower's feed stored in a key-value or wide-column store keyed by user id. Reads then become a single lookup; the cost is more write work and duplicated data, the standard denormalization bargain.

Follow-up questions

  • Does NoSQL mean no transactions? No; many NoSQL stores offer limited transactions, but cross-partition ACID is still where relational databases shine.
  • What is denormalization and why do it? Storing redundant copies to avoid joins at read time, trading write complexity and storage for read speed.
  • How do you choose a partition or shard key? One that spreads load evenly and matches your main query, so reads hit one partition and no key becomes a hotspot.
  • Why is scaling writes harder than reads? Reads fan out to replicas, but writes must be serialized by a primary; scaling them needs partitioning or sharding.
  • What is polyglot persistence? Using several databases, each for what it does best, rather than forcing one store to serve every access pattern.

References

  1. Kleppmann, Designing Data-Intensive Applications (2017).
  2. Alex Xu, ByteByteGo System Design (vol. 1 and 2).