Graph Databases

2021-05-25 Dan D. Kimsystem design

What are graph databases, and what do you need to know about them to succeed in your systems design interview?

In this post, we will learn about the motivation of why we created them, their data model, and some filtering questions to ask yourself before choosing a graph database.

Motivation

Document databases were great for one-to-many relationships (tree structure).

Relational databases were okay for average many-to-many relationships.

But neither were suitable for complex many-to-many relationships, especially when each node was somehow related to every other node.

Think about a social network like Facebook. Profiles, groups, posts, pictures, videos, pages, advertisements… They can all form a relationship with YOU. How would you go about representing this in a relational database? You are going to end up with a crazy amount of JOINs and WITH RECURSIVE syntax.

Graph databases to the rescue.

Data Model

Basics

There are different graph data models, but the foundation is the same.

Graph databases consist of two things - verticies (nodes) and edges (relationships).

Verticies can be anything, as in it could be a Person, Location, City, BlogPost, or whatever type you have in your database. This is something that really makes graph databases powerful and suitable for complex datasets.

Edges are used to define the relationship between verticies.

Property graph

The two popular data models are property graph and triple-store.

For the purposes of this blog post, I will just choose Neo4j as our graph database example, which uses the property graph model. It’s the graph database being used at Microsoft, Adobe, Ebay, and Lyft.

In a property graph, a vertex will consist of the following:

identifier
set of incoming edges
set out outgoing edges
collection of properties

An edge will consist of the following:

identifier
vertex where the edge starts (tail vertex)
vertex where the edge ends (head vertex)
label
collection of properties

Example

Simple Scenario

For those of us who didn’t come across graph databases much, let’s go over what a basic scenario would look like.

Say we had the following relation:

We have two people nodes and two company nodes, and a relationship between them.

Here is the Cypher query for creating the above:

CREATE
  (Jared:Person {name: 'Jared', type: 'Person'})
  (Alicia:Person {name: 'Alicia', type: 'Person'})
  (Google:Company {name: 'Google', type: 'Company'})
  (Facebook:Company {name: 'GoogleFacebook', type: 'Company'})
  (Jared) -[:DATES]-> (Alicia)
  (Alicia) -[:DATES]-> (Jared)
  (Jared) -[:WORKS_AT]-> (Google)
  (Alicia) -[:WORKS_AT]-> (Facebook)

Here is the query to find the name of people working at Google:

MATCH
  (person) -[:WORKS_AT]-> (google:Company {name: 'Google'})
RETURN person.name

The query basically reads as follows: the person node has an outgoing edge to a vertex. From that vertex, you can find one that matches type Company and name Google. For such person, return the name property.

At this point, you might be thinking “Well, this is useless, you could easily do this with a relational database”.

Sure. This example is way too simple to argue the benefits of a graph database. It was only meant to show you what your queries would look like, in case you never worked with Cypher.

Graph databases begin to shine when you need to handle chains of edges.

Consider the next example.

Example - Multiple Edges

It’s same as the above example, but this time we see that Google has an OWNED_BY relationship to Alphabet.

How can we find the name of people that are working under the umbrella company Alphabet?

MATCH
  (person) -[:WORKS_AT]-> () [:OWNED_BY*0..]-> (google:Company {name: Google})
RETURN person.name

The query now reads as follows: the person node has an outgoing edge to a vertex. From that vertex, you can follow a chain of outgoing OWNED_BY edges until you find a vertex that matches type Company and name Google. For such person, return the name property.

Hopefully the example can give you a sense of how useful graph databases can be when it comes to declaring queries that need to traverse the graph.

You can technically achieve the same thing with SQL’s WITH_RECURSIVE syntax, but your query will be huge and it will be an inefficient use of your time to develop and maintain them.

Filtering Questions

So, when do you use a graph database?

Dataset

Remember the general rule of thumb:

document stores are great for one-to-many (tree-structure) relationship or even none
relational stores are great for the average*many-to-many relationships
graph databases are great for complex many-to-many relationships

What does your data look like? Is your data disconnected and relationships do not matter? Then don’t use graph.

Are you expecting to only write to the database without much reading or analysis? Then don’t use graph.

Are you expecting to do a lot of bulk data scans i.e. return all people with name Bob, instead of returning all people in RELATION to something? Then don’t use graph.

Is your data heavily interconnected with each other? Do you need to handle complex many-to-many relationships? Then consider graph.

That’s all. Hope you learned something!

Happy studying!