Dan D Kim

Let's share stories

Wide-Column Databases

2021-06-07 Dan D. Kimsystem design

What is a wide-column datastore?

Formal definition

A wide-column store (or extensible record stores) is a type of NoSQL database. It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a two-dimensional key–value store.

Source: Wikipedia

Casual definition

I love starting simple and gradually easing into the details. Here is an explanation from Szymon Warda that I really like.

Here is how data is represented in a key-value database:

Key-value pair

Here is the same idea in a wide-column database:

Wide-column data

We can look at wide-column stores as a 2-dimensional key-value store, where the first key is used as a row identifier and the second is used as a column identifier.

Key-value store Wide-column store
key = rowID, value = data key = (rowID, columnID), value = data

That was my ELI5 of what a wide-column database is. Now let’s see more closely at the data model.

Data model

They say a picture is a 1000 words. Here are my 1000 words. It shows the smallest components and how they build up to the full row-based data structure.

Wide-column data model showing column, super column, column family, and super column family

Multiple rows will look something like this:

Image credit: Akshay Pore

Data is stored in column cells, which gets grouped into column families. A group of families is linked to a row key. Each column family typically contains columns that are used together, so the columns are stored as a contiguous block on disk, enhancing performance.

You could see that a wide-column store is like a mix between a relational database and a document store. It still uses rows and columns like that of a relational database. And the column formats can be different for each row in the same table. This combines the regimented tabular structure of the relational model and the flexible data schema of the document model.

Motivation

Why did we come up with wide-column databases? As in, why couldn’t we just stick with key-value and relational databases?

Back in the ages, we had the following pain points:

  • relational model lacked flexibility due to its schema and structure, resulting in a lack of data compression.
  • key-value store had a lack of structure that prevented partial write queries i.e. update column value, add column.
  • key-value stores also cannot filter data by value

Wide-column databases have been in development for a while. But so have other databases, and now the distinctive pain points have changed.

Today, a lot of the pain points mentioned above are… debatable

  • relational model still lacks flexibility but data compression warrants a whole other discussion.
  • tools like Redis offer key-value stores with structure and partial updates. Is it performant though? Is it good enough for a petabyte-and-beyond scale?
  • tools like Redis offer filtering by value using secondary indexes. Is it great? Maybe, maybe not.

Pros and Cons

Here are some key benefits of a wide-column datastore

  • MASSIVE scalability. Petabyte-scale data. Even beyond petabytes.
  • consistent, performant performance on HEAVY write loads

Here are the drawbacks

  • inefficient updates
  • inefficient joins / aggregations

Use Cases

Situations where it’s ideal

  • WRITEs far exceed the READs
  • data is rarely updated
  • read access is known by a primary key
  • there is no need for joins or aggregates

Examples

  • financial data (time-series)
  • temperature monitoring (time-series)
  • log data
  • analytical data
  • IoT sensor data

That’s all. Hope you learned something.

Happy studying!