Document databases
What are document databases, and what do you need to know about them to succeed in your systems design interview?
In this post, we will learn about the motivation of why we created them, their data model, and some filtering questions to ask yourself before choosing a document database.
Motivation
Back in the day, there was a #NoSQL craze going on in the tech industry. SQL solutions were having scalability and performance issues in a distributed environment. The demand to move away from monolithic table-based solutions was high.
Initially, the hype was around key-value stores. They were great at addressing scalability issues and were very performant.
But soon enough, people were running into limitations around querying. Complex queries sucked on key-value stores.
This is where document databases come in. They usually come with hyped-up features. JSON (or similar) data models, advanced querying, great horizontal scalability, high availability, fault tolerance, data locality, latency, on-premise & cloud solutions, etc.
Now, document databases may sound great, but that’s not to say they are a replacement for key-value stores or any other database.
Document databases were just one of the various NoSQL solutions. It’s not necessarily better. It’s just a different tool. Like comparing a hammer to a saw, it just works better in different scenarios.
Before choosing this for your systems architecture, you need to recognize the system’s specifications that call for it.
Data Model
Data consists of document collections where individual documents can have multiple fields.
Envisioning a JSON structure is a pretty good example of this.
// Example from https://docs.mongodb.com/
{
  item: "journal",
  qty: 25,
  size: {
    h: 14,
    w: 21,
    uom: "cm"
  },
  status: "A"
}A collection would just be a collection of these.
// Example from https://docs.mongodb.com/
[
   { item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
   { item: "notebook", qty: 50, size: { h: 8.5, w: 11, uom: "in" }, status: "A" },
   { item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" },
   { item: "planner", qty: 75, size: { h: 22.85, w: 30, uom: "cm" }, status: "D" },
   { item: "postcard", qty: 45, size: { h: 10, w: 15.25, uom: "cm" }, status: "A" }
]Each document is referenced by its key which is called document reference. You can see the keys when exploring your database.
Filtering Questions
Here are some considerations that come into choosing a document database
ACID vs BASE
First, check if it makes sense to use a NoSQL solution.
See if you have any have requirements for ACID. If you do, there might be a strong consideration for a SQL solution, as most document databases do not provide ACID compliance.
There are some exceptions to this. CouchDB and AWS DynamoDB offer ACID compliance, but just because it offers it does not mean it’s built for it. For example, you can’t have ACID transactions across multiple servers with CouchDB. This may not be appropriate if you intend to heavily leverage distributed computing.
Storage Locality
One key difference with document databases is storage locality. This can be a huge performance benefit or a terrible drawback depending on your application use-cases.
Document databases will store their data as JSON, XML, or some binary variant like BSON. This means that the entire document is loaded on READ. If your application will often need access to the entire document, this is going to be a systematic advantage. This contrasts with the scenario where the data is split across multiple tables in a SQL solution, as SQL solutions will requiring joins or views to coagulate the data together.
On the flip side, if your system will often access only a small portion of the document, this feature becomes rather wasteful. The database will always load the entire document, even if you need a small portion of it.
As for writes, the entire document needs to be rewritten on writes. Even if you update a small portion of the document, the entire document will be rewritten. The only exception to this is where the document update doesn’t change the encoded size of a document. Therefore, it’s crucial that you keep your documents fairly small, and avoid having to increase the size of the document.
Will your system be able to work with that? Or will it need huge document files, which might work better when broken down into multiple SQL tables?
See if your system will benefit from the storage locality of a document database. What sort of data are you handling? How will the data be updated/accessed?
Dataset
What is your dataset looking like?
A general rule of thumb is this:
- document stores are great for one-to-many (tree-structure) relationship or even none
- relational stores are great for the average many-to-many relationships
- graph databases are great for complex many-to-many relationships
There is a popular acronym in design: KISS. Keep It Simple Stupid.
Try to see which one will yield simpler implementation and code.
You don’t want accidental complexity.
Accidental complexity is the complexity that arises not from solving the problem at hand, but from the method of implementation. It’s like when you use machine learning to check if a character is a number or an alphabetic letter.
In a document database, you want to keep your documents as shallow as possible. Deeply-nested documents are not ideal.
See if you might have complexities around deeply-nested documents with many-to-many relationships.
Example: many-to-many relationships
Here is a quick example of what a many-to-many relationship would look like.
Example from informit.
{
  { courseID: "C1667",
     title: "Introduction to Anthropology",
     instructor: "Dr. Margret Austin",
     credits: 3,
     enrolledStudents: ["S1837", "S3737", "S9825" ...
       "S1847"] },
  { courseID: "C2873",
     title: "Algorithms and Data Structures",
     instructor: "Dr. Susan Johnson",
     credits: 3,
     enrolledStudents: ["S1837","S3737", "S4321", "S9825"
       ... "S1847"] },
  ...Courses maintain a list of identifiers for students, that can then be used to access the student documents directly.
This minimized having to embed nested documents, which will lead to duplicate data.
Students can also maintain a list of identifiers for their courses.
{
 {studentID:"S1837",
   name: "Brian Nelson",
   gradYear: 2018,
   courses: ["C1667", "C2873","C3876"]},
 {studentID: "S3737",
   name: "Yolanda Deltor",
        gradYear: 2017,
        courses: [ "C1667","C2873"]},
    ...Consider how simpler it would be if you used something else, like an SQL database. What does the data look like? Will there be too many JOINs?
Think about which solution is simpler.
That’s all. Hope this helps!
Happy studying!
