MongoDB · Stanley Jacob

MongoDB is a document database: it stores JSON-like documents (BSON) grouped into collections, with no fixed schema across rows. That flexibility makes it fast to iterate on a data shape and natural for hierarchical data that would otherwise be spread across many relational tables.

The core modeling decision is embed versus reference. Data that is read together and owned by one parent (an order and its line items) is embedded in one document so a read is a single lookup; data shared across many parents (users referenced by many orders) is referenced by id, like a foreign key. Getting this right is the whole game in MongoDB, because it has no joins as cheap as a relational engine's.

How to use it

You insert documents and query by fields, including nested ones, and run analytics through the aggregation pipeline:

db.users.insertOne({ name: "Ada", roles: ["admin"], profile: { city: "NYC" } })

db.users.find({ "profile.city": "NYC", roles: "admin" })   # query a nested field

db.orders.aggregate([
  { $match: { status: "paid" } },
  { $group: { _id: "$user_id", total: { $sum: "$amount" } } },
  { $sort:  { total: -1 } },
  { $limit: 3 }
])

Key mechanics, and why

Documents are schemaless at the collection level, which trades up-front rigidity for the ability to evolve fields without migrations, but pushes the burden of consistency onto the application. Indexes (including on nested fields and arrays) are essential, since without them queries fall back to full collection scans. MongoDB scales horizontally by sharding on a shard key, so choosing a key that spreads load evenly and matches your queries is critical, mirroring the partition-key problem in any distributed store.

Trade-offs

You gain flexible schemas and easy horizontal scaling and you give up the relational engine's cheap arbitrary joins and decades of query-planner maturity. Modern MongoDB supports multi-document transactions, but if most of your work is multi-entity joins and strong invariants, a relational database is usually the better fit; if your data is document-shaped and read by its natural key, MongoDB is comfortable.

Internals worth knowing

MongoDB stores BSON (binary JSON) documents and runs on the WiredTiger storage engine, which provides document-level concurrency, snapshot isolation via MVCC, and on-disk compression. Every document has an _id, a 12-byte ObjectId encoding a timestamp, machine, and counter, which doubles as the primary key. Its indexes parallel Postgres in spirit: single-field, compound (with a leftmost-prefix rule), multikey over array fields, text, geospatial, and TTL indexes that expire documents automatically.

Availability comes from replica sets: one primary takes writes and streams its oplog to secondaries (which can serve reads under a relaxed read preference), and the set elects a new primary on failure. Horizontal scale comes from sharding on a shard key, where data splits into chunks balanced across shards and routed by mongos. Read and write concern dial consistency, for example waiting for a majority of replicas to acknowledge a write.

Aggregation pipeline: composable stages ($match, $group, $lookup for joins, $facet) run in sequence.
Shard-key choice is effectively permanent and decides whether a query hits one shard or scatter-gathers across all of them.

Using it from Python

from pymongo import MongoClient
db = MongoClient("mongodb://localhost:27017").mydb

db.users.insert_one({"name": "Ada", "roles": ["admin"], "profile": {"city": "NYC"}})
db.users.find_one({"profile.city": "NYC", "roles": "admin"})   # query a nested field

list(db.orders.aggregate([
    {"$match": {"status": "paid"}},
    {"$group": {"_id": "$user_id", "total": {"$sum": "$amount"}}},
    {"$sort": {"total": -1}},
    {"$limit": 3},
]))

Worked example

Embedding makes a common read a single document fetch (illustrative):

db.orders.findOne({ _id: 1001 })
// { _id: 1001, user_id: 42, status: "paid",
//   items: [ { sku: "A1", qty: 2 }, { sku: "B7", qty: 1 } ],
//   total: 59.97 }            // no join needed: the items live inside

In production

MongoDB fits content, catalogs, and anything document-shaped that iterates quickly. A useful cautionary data point comes from Discord, whose first message store was MongoDB; it served them early but they outgrew its single-replica-set performance and memory behavior and moved to Cassandra (Discord Engineering). The takeaway is to match the document model to the access patterns and choose the shard key before scale, not after.

Follow-up questions

Embed or reference? Embed data read together and owned by one parent; reference data shared across many parents, like a foreign key.
Why is the shard key so important? It determines data distribution; a poor key creates hotspots or forces scatter-gather queries across shards.
Does MongoDB have transactions? Yes, multi-document ACID transactions exist, but cross-shard transactions are costly, so designs lean on single-document atomicity.
Why index nested fields? Unindexed queries scan the whole collection; indexes on nested fields and arrays keep lookups fast.
When pick relational over MongoDB? When you need many-table joins and strong cross-entity invariants rather than document-shaped access.

References

MongoDB Documentation.
Kleppmann, Designing Data-Intensive Applications (2017).
Discord Engineering, How Discord Stores Trillions of Messages.