Graph Databases
What are graph databases, and what do you need to know about them to succeed in your systems design interview?
In this post, we will learn about the motivation of why we created them, their data model, and some filtering questions to ask yourself before choosing a graph database.
Motivation
Document databases were great for one-to-many relationships (tree structure).
Relational databases were okay for average many-to-many relationships.
But neither were suitable for complex many-to-many relationships, especially when each node was somehow related to every other node.
Think about a social network like Facebook. Profiles, groups, posts, pictures, videos, pages, advertisements… They can all form a relationship with YOU. How would you go about representing this in a relational database? You are going to end up with a crazy amount of JOIN
s and WITH RECURSIVE
syntax.
Graph databases to the rescue.
Data Model
Basics
There are different graph data models, but the foundation is the same.
Graph databases consist of two things - verticies (nodes) and edges (relationships).
Verticies can be anything, as in it could be a Person, Location, City, BlogPost, or whatever type you have in your database. This is something that really makes graph databases powerful and suitable for complex datasets.
Edges are used to define the relationship between verticies.
Property graph
The two popular data models are property graph and triple-store.
For the purposes of this blog post, I will just choose Neo4j as our graph database example, which uses the property graph model. It’s the graph database being used at Microsoft, Adobe, Ebay, and Lyft.
In a property graph, a vertex will consist of the following:
- identifier
- set of incoming edges
- set out outgoing edges
- collection of properties
An edge will consist of the following:
- identifier
- vertex where the edge starts (tail vertex)
- vertex where the edge ends (head vertex)
- label
- collection of properties
Example
Simple Scenario
For those of us who didn’t come across graph databases much, let’s go over what a basic scenario would look like.
Say we had the following relation:
We have two people nodes and two company nodes, and a relationship between them.
Here is the Cypher query for creating the above:
CREATE
(Jared:Person {name: 'Jared', type: 'Person'})
(Alicia:Person {name: 'Alicia', type: 'Person'})
(Google:Company {name: 'Google', type: 'Company'})
(Facebook:Company {name: 'GoogleFacebook', type: 'Company'})
(Jared) -[:DATES]-> (Alicia)
(Alicia) -[:DATES]-> (Jared)
(Jared) -[:WORKS_AT]-> (Google)
(Alicia) -[:WORKS_AT]-> (Facebook)
Here is the query to find the name of people working at Google:
MATCH
(person) -[:WORKS_AT]-> (google:Company {name: 'Google'})
RETURN person.name
The query basically reads as follows: the person
node has an outgoing edge to a vertex. From that vertex, you can find one that matches type Company
and name Google
. For such person
, return the name
property.
At this point, you might be thinking “Well, this is useless, you could easily do this with a relational database”.
Sure. This example is way too simple to argue the benefits of a graph database. It was only meant to show you what your queries would look like, in case you never worked with Cypher.
Graph databases begin to shine when you need to handle chains of edges.
Consider the next example.
Example - Multiple Edges
It’s same as the above example, but this time we see that Google has an OWNED_BY
relationship to Alphabet.
How can we find the name of people that are working under the umbrella company Alphabet?
MATCH
(person) -[:WORKS_AT]-> () [:OWNED_BY*0..]-> (google:Company {name: Google})
RETURN person.name
The query now reads as follows: the person
node has an outgoing edge to a vertex. From that vertex, you can follow a chain of outgoing OWNED_BY
edges until you find a vertex that matches type Company
and name Google
. For such person
, return the name
property.
Hopefully the example can give you a sense of how useful graph databases can be when it comes to declaring queries that need to traverse the graph.
You can technically achieve the same thing with SQL’s WITH_RECURSIVE
syntax, but your query will be huge and it will be an inefficient use of your time to develop and maintain them.
Filtering Questions
So, when do you use a graph database?
Dataset
Remember the general rule of thumb:
- document stores are great for one-to-many (tree-structure) relationship or even none
- relational stores are great for the average*many-to-many relationships
- graph databases are great for complex many-to-many relationships
What does your data look like? Is your data disconnected and relationships do not matter? Then don’t use graph.
Are you expecting to only write to the database without much reading or analysis? Then don’t use graph.
Are you expecting to do a lot of bulk data scans i.e. return all people with name Bob
, instead of returning all people in RELATION to something? Then don’t use graph.
Is your data heavily interconnected with each other? Do you need to handle complex many-to-many relationships? Then consider graph.
That’s all. Hope you learned something!
Happy studying!