Graph Databases : Book Summary

cropped-graphdatabases_cover390x5121

Neo4j’s founder Emil Eifrem shares a bit of history about the way Neo4j was started. Way back in 1999, his team realized that the database that was being internally used had a lot of connections between discrete data elements. Like many successful companies which grow out of a founder’s frustration with status quo, Neo4j began its life from the founding team’s frustration with the fundamental problem with the design of relational databases. The team started experimenting on various data models centered around graphs. Much to their dismay, they found no readily available graph database to store their connected data. Thus began the team’s journey in to building a graph database from scratch. Project Neo was born. What’s behind the name Neo4j ? The 4 in Neo4j does not stand for version number. All the versions numbers are appended after the word Neo4j. I found one folksy explanation on stackoverflow that goes like this,

The Neo series of databases was developed in Sweden and attracted the ‘j’ suffix with the release of version 4 of the graph database. The ‘j’ is from the word ‘jätteträd’, literally “giant tree”, and was used to indicate the huge data structures that could now be stored.

Incidentally the ‘neo’ portion of the name is a nod to a Swedish pop artist Linus Ingelsbo, who goes by the name NEO. In Neo1 the example graph was singers and bands and featured the favourite artists of the developers, including NEO. This was changed to the movie data set in the version 4 release.

There are other people speculating that Neo refers to “The Matrix” character Neo, fighting the “relational tables”. It was recently announced that Neo4j would be called Neo5j as a part of latest branding exercise. In one of the recent blog post from the company, it was said that j in Neo4j stood for Java as the original graph database was writtein as a java library.

Preface

This talks about the purpose of the book, i.e to introduce graphs and graph databases to technology practitioners, including developers, database professionals, and technology decision makers. It also explains the main changes in the content of this book as compared to the first edition. The changes are mainly in the areas of Cypher syntax and modeling guidelines.

Introduction

A graph is nothing but a collection of edges and vertices. Having said that, there are many different ways in which the graph can be stored. One of the most popular form of graph model is the labeled property graph, which is characterized as

It contains nodes and relationships
Node contain properties(key-value pairs)
Nodes can be labeled with one or more labels
Relationships are names and directed, and always have a start and end node
Relationships can also contain properties

Rough Classification

Given the plethora of products in this space, it is better to have some sort of binning. The chapter bins the products in to two groups.

Technologies used primarily for transactional online graph persistence, typically accessed directly in real time from an application
Technologies used primarily for offline graph analytics, typically performed as a series of batch steps

Broader Classification

The chapter also says that one of the other ways to slice the graph space is via graph data models, i.e. property graph, RDF triples and hypergraphs. In the appendix for the book, there is a much broader classification given for the NOSQL family of products. I will summarize the contents of appendix in this section as I found the appendix very well written.

Rise of NOSQL:

The NOSQL movement has mainly arisen because of 3 V’s

Volume of data : There has been a deluge of semi-structured data in the recent decade and storing it all in an structured relational data format has been fraught with difficulties. Storing connections gives rise to complicated join queries for CRUD operations.
Velocity of data read and writes and schema changes
Variety of data

Relational databases are known to provide ACID transactions; Atomic, Consistent, Isolated, Durable. NOSQL databases instead have BASE properties; Basic availability, Soft-state, Eventual consistency. Basic availability means that the data store appears to work most of the time. Soft-state stores don’t have to be write-consistent, nor do replicas have to be mutually consistent all the time. Eventual consistency stores exhibit consistency at some later point.

Types of NOSQL Databases:

NOSQL databases can be divided in to 4 types, i.e.

Document Stores :Document databases store and retrieve documents, just like an electronic filing cabinet. These stores act like a key-value pair with some sort of indexing in place for quicker retrieval. Since the data model is one of disconnected entities, stores tend to scale horizontally. Also writes are atomic at a document level. Technology has not matured for writes across multiple documents. MongoDB and CouchDB are two examples of Document stores.
Key-Value Stores :Key-value stores are cousins of the document store family, but their lineage comes from Amazon’s Dynamo database. They act like large, distributed hashmap data structures that store and retrieve opaque values by key. A client stores a data element by hashing a domain-specific key. The hash function is crafted such that it provides a uniform distribution across the available buckets, thereby ensuring that no single machine becomes a hotspot. Given the hashed key, the client can use that address to store the value in a corresponding bucket. These stores are similar to document stores but offer a higher level of insight in to the data stores. Riak-for instance- also offers visibility into certain types of structured data. In any case, these stores are optimized for high availability and scale.Riak and Redis are two example of Key-Value stores.
Column Family : These stores are modeled on Google’s BigTable. The data model is based on sparsely populated table whose rows can contain arbitrary columns, the keys which provide natural indexing. The four building blocks of Column family databasestore are column, super column, column family and super column family. HBase is an example of Column-oriented database.
Graph Databases : All the three previous types of databases are still aggregate stores. Querying them for insight into data at scale requires processing by some external application infrastructure. Aggregate stores are not built to deal with highly connected data This is where Graph databases step in. A graph database is an online, operational database management with CRUD methods that expose a graph model. Graph databases are generally built for use with transactional systems.There are two properties that are useful to understand while investigating graph technologies
- The underlying storage: Some graph databases use “native graph storage”, which is optimized and designed for storing and managing graphs.Not all graph database technologies use native graph storage. Some serialize the graph data in to relational database,OOPS database etc
- Processing engine : Some graph databases are capable of index-free adjacency, i.e. nodes point to each other in the database. For graphs that use index-free processing, the authors use the term “native graph processing”
Besides adopting a specific approach to storage and processing, Graph databases also adopt a specific model. There are various models such as property graphs, hypergraphs and triples. Hypergraphs are mainly useful for representing many-to-many relationships. Triple stores typically provide SPARQL capabilities to reason about and store RDF data. Triple stores generally do not support index-free adjacency and are not optimized for storing property graphs. To perform graph queries, triple stores must create connected structures from independent facts, which adds latency to each query. Hence the sweet sport for a triple store is analytics, where latency is a secondary consideration, rather than OLTP.

Power of Graph Databases:

The power of graph databases lies in performance, agility and flexibility. Performance comes from the fact that search can be localized to a portion of graph. Flexibility comes from the fact that there is little impedance between the way business communicates the requirements and the way the graph is modeled. Agility comes from the fact that graph databases are schema less and hence all the new connections can easily be accommodated.

Options for Storing Connected Data

This chapter takes a simple example of modeling friends and friend of friends relations via a RDBMS. SQL code is given to extract a few basic questions an user might have, while looking at a social graph. The queries, even for simple questions, become very complex. It quickly becomes obvious to anyone reading this chapter that modeling connections via RDMBS is challenging, for the following reasons:

Join tables add accidental complexity
Foreign key constraints add additional development and maintenance overhead
Sparse tables with nullable columns require special checking in code
Several expensive joins are needed to discover nested relationships
Reverse queries are difficult to execute as the SQL code becomes extremely complicated to write and maintain

NOSQL databases are also not scalable for storing connected data. By the very nature of document stores/key-value stores/ columnar family, the lack of connection or relation as a first class object makes it difficult to achieve scale. Even though there is some sort of indexing available in most of the NOSQL databases, they do not have index free adjacency feature. This implies that there is a latency in querying connected data.

The chapter ends with showcasing a graph database to store connections and it is abundantly clear by this point in the chapter that, by giving relationship the status of first class object, graph databases make it extremely easy to represent and maintain connected data. Unlike RDBMS and NOSQL stores in which connected data necessitates developers to write data processing logic in the application layer, graph databases offer a convenient and a powerful alternative.

Data Modeling with Graphs

The path from a real life representation of data to a data model in the graph database world is short. In fact one can use almost the same lingo to talk about various aspects while referring to graph data. This is unlike the RDBMS world where one needs to translate the real life representation to a logical model via “normalization-denormalization” route. In the latter world, there is a semantic dissonance between conceptualization of a model and database’s instantiation of that model

The chapter begins by introducing Cypher, an expressive graph database query language. Cypher is designed to be easily read and understood by developers, database professionals, and business stakeholders. Its ease of use derives from the fact that it is in accord with the way we intuitively describe graphs using diagrams. Cypher is composed of clauses like most other query languages. The chapter introduces some of the main clauses such as MATCH, RETURN, WHERE, CREATE, MERGE, DELETE, SET, FOREACH, UNION, WITH, START.

There are two examples used in the chapter that serve to show the contrast in the steps involved in modeling data in the relational world and graph world. In the relational world, one takes the white-board model, creates an ER diagram, and then a normalized representation of the model and finally a denormalized form. Creating a normalized representation involves incorporating foreign key relationships to avoid redundancy. Once the database model has been adapted to a specific domain, there is an additional denormalization work that needs to be done so as to suit the database and not the user. This entire sequence of steps creates an impedance between real life representation and the data base representation. One of the justifications given to this effort expended is that it is one time investment and it will payoff as the data grows. Sadly, given that data has 3V’s in the current age, i.e. Volume, Velocity and Variety, this argument no longer holds good. It is often seen that RDBMS data models undergo migrations with changing data structures. What once seemed to be a solid top-down robust approach falls apart quickly. In the graph world, the effort put in translating a white-board model in to a database model is minimal, i.e. what you sketch on a keyboard is what you store in the database. These two examples mentioned illustrate the fact that conceptual to implementation dissonance is is far less for graph database than an RDBMS.

The chapter also gives some guidelines in creating graphs so that they match up the BASE properties of NOSQL databases. By the end of the chapter, a reader ends up getting a good understanding of Cypher query language. If one is a complete newbie to Neo4j, it might be better to take a pause here and experiment with Cypher before going ahead with the rest of the book.

Building a Graph Database application

The basic guidelines for data modeling are

Use nodes to represent entities - that is, the things in our domain that are interesting to us
Use relations to represent relations between entities and to establish a semantic context for each entity
Use relationship direction to further clarify relationship semantics. Many relationships are asymmetric and hence it makes sense to always represent a relation with a certain direction
Use node attributes to represent entity attributes, plus any entity meta-data
Use relationship attributes to represent the strength, weight, quality of the relationship etc.
While modeling a node, make sure that it does not connect two nodes
It is always better to represent both a fine grained and a coarse grained relationship between two nodes. This helps in quickly querying at a coarse grained level.
The key idea of using graph as a way to model the data is that one can do an iterative model for creating nodes and relationships

Application Architecture

There are two ways in which Neo4j can be used in an application. One is the embedded-server mode and the rest is REST API mode.

Clustering

All writes to Neo4j can be written to the master or a slave. This capability at an application level is an extremely powerful feature. These kind of powerful features are not present in other graph databases.

Load Balancing

There is no native load balancing available in Neo4j. However it is up to the network infra to create load balancing. One needs to introduce a network level rebalancer to separate out the read and write queries. Infact Neo4j exposes an API call to indicate whether the server is a master or slave.

Since cache partitioning is an NP Hard problem, one can use a technique called “cache-sharding” to route the query to that specific server that has a highest probability of finding a cache to serve the query. The way to route and cache queries is dependent on the domain in which the graph database is being developed.

Testing

If you need to debug any function written in any language, one needs sample input. In the case of testing graph, one needs to create a small graph of a few dozen nodes and relationships, so that this localized graph can be used to check any anomalies in the graph model. It is always better to write a lot of simple tests checking various parts of the graph than relying on one single universal test for the entire graph. As the graph evolves, an entire set of regression framework of tests can be formulated.

Importing and Bulk Loading Data

Most of the deployments of any kind of database requires some sort of content to be imported in to the database. This section of the chapter talks about importing data from a csv format. The headers in the csv file should be specified in such a way that they reflect whether the pertinent columns is an ID/LABEL/TYPE/property. Once the csv files are in the relevant format, they can be imported via neo4j-import command. One can also do a batch import via Cypher queries using LOAD CSV. In my line of work, I deal with a lot of RDF input files and there is an open source stored procedure that one can use to import any large RDF in to Neo4j.

Usually doing a bulk import of any file should be preceded by index creation process. Indexes will make lookups faster during and after the load process. Creating indexes on any type of is quite straightforward via Cypher commands. If the indexing is only helpful during data loading process, one can delete the index after the bulk data load is completed. For large datasets, it is suggested that one uses PERIODIC COMMIT command so that the transactions are committed only after a certain set of rows have been processed.

Graphs in the Real World

Why Organizations choose Graph Databases ?

Some of the reasons that organizations choose graph databases are :

“Minutes to Millisecond” performance requirement - Using an index free adjacency, a graph database turns complex joins into fast graph traversals, thereby maintaining millisecond performance irrespective of the overall size of the dataset
Drastically accelerated development cycles : Graph model reduces the impedance mismatch between the technical and the business domains
Extreme business responsiveness- Schema-free nature of the graph database coupled with the ability to simultaneously relate data elements in lots of different ways allows a graph database solution evolve as the business needs evolve.
Enterprise ready -The technology must be robust, scalable and transactional. There are graph databases in the market that provide all the -ilities, i.e transactionability, high-availability, horizontal read scalability and storage of billions of triples

Common Use cases

The chapter mentions the following as some of the most common usecases in the graph database world:

Social Networks Modeling
Recommendations (Inductive and Suggestive)
Geospatial data
Master Data Management
Network and Data Center Management
Authorization and Access Control

Graphs in the real world

There are three usecases that are covered in detail. Reading through these usecases gives one a good introduction of various types of data modeling aspects including writing good cypher queries. Reading through the Cypher queries pre-supposes that the reader is slightly comfortable with Cypher. To get the best learning out of these examples, it is better to write out the Cypher queries before looking at the constructed queries. One gets an immediate feedback about the way to construct efficient Cypher queries.

Graph Database Internals

This chapter gives a detailed description of Neo4j graph database internals.

neo4jarch - Copy

Native Graph Processing

A Graph database has native processing capabilities if it exhibits a property called index-free adjacency. A database engine that utilizes index-free-adjacency is one that maintains direct reference to its adjacent nodes. This enables the query times to be independent of the graph size.

A nonnative graph database uses global indexes to link nodes. These indexes add a layer of indirection to each traversal, thereby incurring computational costs. To traverse a network of m steps, the cost of indexed approach is O(m log n) whereas the cost is O(m) for an implementation that uses index-adjacency. To create a index-free adjacency in graphs, it is important to create an architecture that supports this property. Neo4j has painfully built this over a decade and one can see their efforts by querying a large graph. In my work, I have worked with pure triple stores and some sort of global indexing built on the triple stores. The performance of Neo4j is zillion times better than the databases that I have worked on. Hence a big fan of Neo4j.

Native Graph Storage

Neo4j stores data in a number of different file formats. There are separate data files for nodes, relationships and properties. Node store is a fixed size record store.Each node record is 9 bytes in length. The Node ID is created in such a way that node lookup is O(1) instead of O(log n). The constituents of the node record are 1) ID of the first relationship, 2) ID of the first property 3) label store of the node. Relationships are also stored in fixed length records whose constituents are 1) start Node ID 2) end Node ID 3) pointer to the previous relationship and pointer to the next relationship.

With fixed-sized records, traversals are implemented simply by chasing the pointers around a datastructure. The basic idea of searching in a Neo4j boils down to locating the first record in the relationship chain and then chasing various pointers. Apart from this structure, Neo4j also has a caching feature that makes increases its performance.

Programmatic APIs

Neo4j exposes four types of API for a developer and they are

Kernel API
Core API
Traverser API

The Core API allows developers to fine-tune their queries so that they exhibit high affinity with the underlying graph. A well-written Core API query is often faster than any other approach. The downside is that such queries can be verbose, requiring considerable developer effort. Moreover, their high affinity with the underlying graph makes them tightly coupled to its structure. When the graph structure changes, they can often break. Cypher can be more tolerant of structural changes things such as variable-length paths help mitigate variation and change. The Traversal Framework is both more loosely coupled than the Core API (because it allows the developer to declare informational goals), and less verbose, and as a result a query written using the Traversal Framework typically requires less developer effort than the equivalent written using the Core API. Because it is a general-purpose framework, however, the Traversal Framework tends to perform marginally less well than a well-written Core API query

Nonfunctional characteristics

Evaluation

One of the ways to evaluate the performance of database is via 1) the number of transactions that can be handled in an ACID way and 2) the number of read and write queries that can be processed on a database.

Transactions are semantically identical to traditional database transactions. Writes occur with in the same transaction context, with write locks being taken for consistency purposes on any nodes and relationships involved in the transaction. The following talks about the way transaction is implemented

The transaction implementation in Neo4j is conceptually straightforward. Each transaction is represented as an in-memory object whose state represents writes to the database. This object is supported by a lock manager, which applies write locks to nodes and relationships as they are created, updated, and deleted. On transaction rollback, the transaction object is discarded and the write locks released, whereas on successful completion the transaction is committed to disk.

Scale

For a platform like Neo4j, one cannot talk of scale based on # of transactions per second. One needs to think about scale along atleast three different axis.

Capacity: The current release of Neo4j can support single graphs having tens of billions of nodes, relationships, and properties.The Neo4j team has publicly expressed the intention to support 100B+ nodes/relationships/properties in a single graph as part of its road map
Latency: The architecture of Neo4j makes the performance almost constant irrespective of the size of the graph. Most queries follow a pattern whereby an index is used to find a starting node and the remainder is pointer chasing. Performance does not depend on the size of the graph but depends on the data being queried.
Throughput: Neo4j has a constant time performance irrespective of the graph size. We might think that extensive read writes will make the performance of the entire database to go down. However one sees that the typical read, write queries happen on a localized graph and hence there is scope to optimize at an application level.

Holy grail of scalability

The future goal of most graph databases is to be able to partition a graph across multiple machines without application-level intervention, so that read and write access to the graph can be scaled horizontally. In the general case this is known to be an NP Hard problem, and thus impractical to solve.

Predictive Analysis with Graph Theory

The chapter starts off with a basic description of depth-first and breadth-first graph traversal mechanisms. Most useful algorithms aren’t pure breadth-first but are informed to some extent. Breadth-first search underpins numerous classical graph algorithms, including Dijkstra’s algorithm. The last chapter talks about several graph theoretical concepts that can be used to analyze Networks.

Takeaway

The book gives a good introduction to various aspects of any generic graph database, even though most of the contents are Neo4j specific. This book is 200 pages long but it packs so much content in it that one needs to read slowly to understand thorughly various aspects of a graph databases. If you have been exposed to native triple stores,then this book will be massively useful to get an idea about Neo4j’s implementation of property-graph model – an architecture that makes all the CRUD operations on a graph database very efficient.