Learning Cypher : Summary
Cypher is a query language for Neo4j graph database. The basic model in Neo4j can be described as
-
Each node can have a number of relationships with other nodes
-
Each relationship goes from one node either to another node or to the same node
-
Both nodes and relationships can have properties, and each property has a name and a value
Cypher was first introduced in Nov 2013 and since then the popularity of graph databases as a category has taken off. The following visual shows the pivotal moment:
Looking at the popularity of Cypher, Neo4j was made open source in October 2015. Neo4j founders claim that the rationale behind the decision was that a common query syntax could be followed across all the graph databases. Cypher provides a declarative syntax, which is readable and powerful and a rich set of graph patterns can be recognized in a graph.
Via Neo4j’s blog:
Cypher is the closest thing to drawing on a white board with a keyboard. Graph databases are whiteboard friendly; Cypher makes them keyboard friendly.
Given that Cypher has become open source and has the potential to become the de facto standard in graph database segment, it becomes important for anyone working with graph data to have a familiarity with the syntax. Since the syntax looks like SQL syntax, has some pythonic element to the query formulation, it can be easily picked up by reading a few articles on it. Do you really need a book for it ? Not necessarily. Having said that, this book reads like a long tutorial and is not dense. It might be worth one’s time to read this book to get a nice tour of various aspects of Cypher.
Chapter 1 : Querying Neo4j effectively with Pattern Matching
Querying a graph database using API is usually very tedious. I have had this experience first hand while working on a graph database that had ONLY API interface to obtain graph data. SPARQL is a relief in such situations but SPARQL has a learning curve. I would not call it steep, but the syntax is a little different and one needs to get used to thinking in triples, for writing effective SPARQL queries. Writing effective SPARQL queries entails thinking in subject-predicate-object terms. Cypher on the other hand is a declarative query language, i.e. it focuses on the aspects of the result rather than on methods or ways to get the result. Also it is human-readable and expressive
The first part of the chapter starts with instructions to set up a new Neo4j instance. Neo4j server can be run as a standalone machine with the client making API calls OR can be run as an embedded component in an application. For learning purpose, working with standalone server is the most convenient option as you have a ready console to test out sample queries. The second part of the chapter introduces a few key elements of Cypher such as
-
MATCH
-
RETURN
-
() for nodes
-
[] for relations
-
-> for directions
-
– for choosing bidirectional relations
-
Filtering matches via specifying node labels and properties
-
Filtering relationships via specifying relationship labels and properties
-
OPTIONAL to match optional paths
-
Assigning the entire paths to a variable
-
Passing parameters to Cypher queries
-
Using built in functions such as allShortestPaths
-
Matching paths that connect nodes via a variable number of hops
Chapter 2 : Filter, Aggregate and Combine Results
This chapter introduces several Cypher statements that can be used to extract summary statistics of various nodes and relationships in a graph. The following are the Cypher keywords explained in this chapter
-
WHERE for text and value comparisons
-
IN to filter based on certain values
-
“item identifier IN collection WHERE rule” pattern that can be used to work with collections. This pattern is similar to list comprehension in python
-
LIMIT and SKIP for pagination purposes. The examples do not use ORDER BY which is crucial for obtaining paginated results
-
SORT
-
COALESCE function to work around null values
-
COUNT(*) and COUNT(property value) - Subtle difference between the two is highlighted
-
math functions like MIN, MAX, AVG
-
COLLECT to gather all the values of properties in a certain path pattern
-
CASE WHEN ELSE pattern for conditional expressions
-
WITH to separate query parts
-
UNION and UNION ALL
Chapter 3 : Manipulating the Database
This chapter talks about Create, Update and Delete operations on various nodes and relations. The Cypher keywords explained in the chapter are
-
CREATE used to create nodes, relationships and paths
-
CREATE UNIQUE
-
SET for changing properties and labels
-
MERGE to check for an existing pattern and create the pattern if it does not exist in the database
-
MERGE SET and MERGE CREATE for setting properties during merge operations
-
REMOVE for removing properties and labels
-
DELETE
-
FOREACH pattern to loop through nodes in a path
By the end of this chapter, any reader should be fairly comfortable in executing CRUD queries. The queries comprise three phases
-
READ : This is the phase where you read data from the graph using MATCH, OPTIONAL, and MATCH clauses
-
WRITE : This is the phase where you modify the graph using CREATE, MERGE, SET and all other clauses
-
RETURN : This is the phase where you choose what to return to the caller
Improving Performance
This chapter mentions the following guidelines for creating queries in Neo4j :
-
Use Parametrized queries: Wherever possible, write queries with parameters that allows engine to reuse the execution of the query. This takes advantage of the fact the Neo4j engine can cache the query
-
Avoid unnecessary clauses such as DISTINCT based on the background information of the graph data
-
Use direction wherever possible in match clauses
-
Use a specific depth value while searching for varying length paths
-
Profile queries so that the server does not get inundated by inefficient query construction
-
Whenever there is large number of nodes belonging to a certain label, it is better to create index. In fact while importing a large RDF it is always better to create indices on certain types of nodes.
-
Use constraints if you are worried about property redundancy
Chapter 4 : Migrating from SQL
The chapter talks about various tasks involved in migrating data from a RDBMS to a graph database. There are three main tasks in migrating from SQL to a graph data base :
-
Migrating the schema from RDBMS to Neo4j
-
Migrating the data from tables to Neo4j
-
Migrating queries to let your application continue working
It is better to start with an ER diagram that is close to the white-board representation of the data. Since graph databases can closely represent a white-board than the Table structure mess(primary key, foreign key, cardinality), one can quickly figure out the nodes and relationships needed for the graph data. For migrating the actual data, one needs to import the data in to relevant CSV and load the CSV in to Neo4j. The structure of various CSV files to be generated depends on the labels, nodes, relationships of the graph database schema. Migrating queries from RDMBS world in to graph database world is far more easier as Cypher is a declarative syntax. It is far quicker to code the various business requirement queries using Cypher syntax.
Chapter 5 : Operators and Functions
The last section of the book contains a laundry list of operators and functions that one can use in creating a Cypher query. It is more like a cheat sheet but with elaborate explanation of various Cypher keywords
Takeaway
This book gives a quick introduction to all the relevant keywords needed to construct a Cypher query. In fact it is fair to describe the contents of the book as a long tutorial with sections and subsections that can quickly bring a Cypher novice up to speed.