Wide-Column Databases

2021-06-07 Dan D. Kimsystem design

What is a wide-column datastore?

Formal definition

A wide-column store (or extensible record stores) is a type of NoSQL database. It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a two-dimensional key–value store.

Source: Wikipedia

Casual definition

I love starting simple and gradually easing into the details. Here is an explanation from Szymon Warda that I really like.

Here is how data is represented in a key-value database:

Here is the same idea in a wide-column database:

We can look at wide-column stores as a 2-dimensional key-value store, where the first key is used as a row identifier and the second is used as a column identifier.

Key-value store	Wide-column store
key = rowID, value = data	key = (rowID, columnID), value = data

That was my ELI5 of what a wide-column database is. Now let’s see more closely at the data model.

Data model

They say a picture is a 1000 words. Here are my 1000 words. It shows the smallest components and how they build up to the full row-based data structure.

Multiple rows will look something like this:

Image credit: Akshay Pore

Data is stored in column cells, which gets grouped into column families. A group of families is linked to a row key. Each column family typically contains columns that are used together, so the columns are stored as a contiguous block on disk, enhancing performance.

You could see that a wide-column store is like a mix between a relational database and a document store. It still uses rows and columns like that of a relational database. And the column formats can be different for each row in the same table. This combines the regimented tabular structure of the relational model and the flexible data schema of the document model.

Motivation

Why did we come up with wide-column databases? As in, why couldn’t we just stick with key-value and relational databases?

Back in the ages, we had the following pain points:

relational model lacked flexibility due to its schema and structure, resulting in a lack of data compression.
key-value store had a lack of structure that prevented partial write queries i.e. update column value, add column.
key-value stores also cannot filter data by value

Wide-column databases have been in development for a while. But so have other databases, and now the distinctive pain points have changed.

Today, a lot of the pain points mentioned above are… debatable

relational model still lacks flexibility but data compression warrants a whole other discussion.
tools like Redis offer key-value stores with structure and partial updates. Is it performant though? Is it good enough for a petabyte-and-beyond scale?
tools like Redis offer filtering by value using secondary indexes. Is it great? Maybe, maybe not.