Cypher

Cypher was the fifth query system I worked on at Neo4j.

When I joined Neo4j in 2007, it was an embedded database engine that ran inside a Java application. Querying was done through an API for enumerating relationships from or to a node, and through a slightly higher-level node.traverse(...) method. That method followed relationships with selected types in either breadth-first or depth-first order, with callback functions controlling when to stop descending a path and which nodes to include in the result.

Before Cypher

The first query system I worked on was an API for interacting with the embedded Neo4j engine from a separate Java process. At first it mostly replicated the basic API for enumerating the relationships of a node. That was good enough for tooling, but inefficient for actual querying.

To improve on that, I designed a system based on sets of nodes and relationships that could be expanded and filtered step by step. Each step was lazy: no request was sent to the database process until the contents of a set were actually needed. At that point, a binary query representation was assembled and sent to the server, which returned the requested set. The project was shut down before it reached users. At the time, the founders at Neo4j were strongly opposed to query languages in general. Even a binary query language conflicted with the early Neo4j design philosophy that the database should live inside the Java process.

The second query system I worked on reached customers around 2009. It was a more powerful evolution of the node.traverse(...) API. For that work I introduced the notion of paths into Neo4j’s traversal framework. By evaluating predicates on a path rather than only on a single node or relationship, it became possible to remember how a node had been reached and to express multi-step predicates naturally.

Beyond that, the enhanced traversal framework defined a state machine for traversal steps, similar in spirit to my earlier set-based query API. Each step could define both which relationships to follow from a node and which step to evaluate next. When I left Neo4j in 2021, this API was still an important tool used by Neo4j support engineers to build efficient embedded procedures for customer logic. Cypher did not fully recover that level of expressive power until regular path queries entered GQL.

REST, Gremlin, and the road to Cypher

The original philosophy of keeping the database confined to a Java process eventually broke down. As the NoSQL movement gathered momentum and REST interfaces became the fashionable way to expose data systems, Neo4j followed that trend. This led to the third and fourth query systems I worked on.

In 2009 we started building an HTTP API for Neo4j. Relationship types mapped naturally onto URL path segments, so I explored how far that idea could be taken. The harder part was expressing predicates on properties, especially comparisons between properties on different nodes or relationships along a path.

At that point the work split in two directions.

One direction was to lean further into URL-shaped queries, at least for an interesting subset of graph queries. I proposed using XPath, since it resembles HTTP URLs closely enough to feel like a natural fit. Peter Neubauer, one of Neo4j’s co-founders, knew Marko Rodriguez and brought him in to prototype the idea. That prototype became the first version of Gremlin. Peter and Marko later co-founded TinkerPop. Over time, the tinkerpop ecosystem became part of a competing graph stack. Later versions of Gremlin replaced XPath with a Groovy DSL and ended up somewhat resembling the set-based query system I had built for Neo4j in 2007.

The other direction was to send JavaScript to the server to express traversal steps and predicates. That was the direction Neo4j chose to pursue internally, and it became the fourth query system I worked on.

Cypher begins

Sending JavaScript to the server was not something the Neo4j engineers, myself included, found elegant. The main problem was security: evaluating arbitrary JavaScript inside the database server was a serious risk. Gremlin was not much better in that respect, since arbitrary Groovy had the same basic problem.

Several of us argued that Neo4j should design a query language of its own, but management did not favor the idea. Eventually Andrés Taylor ignored that guidance and built a prototype anyway. He called the language Cypher, after the traitor in The Matrix, on the theory that such a bad name would have to be changed before release.

The design of Cypher was grounded in the diagrams I used to draw on whiteboards to describe how a graph should be traversed to answer a query. Our colleague Mattias Finné would often transcribe those diagrams into ASCII art inside code comments as we implemented the logic using the traversal framework. Cypher made those diagrams executable.

In the early development of Cypher, Andrés led implementation while I led much of the language design. Stefan Plantikow joined shortly afterward and became part of the early core of that work.

Early language design

One of the first substantial internal projects to use Cypher was a Neo4j performance testing system that I built together with Alistair Jones. We stored test configurations and executions in a Neo4j database. That work led me to propose what we later called linear composition of queries, implemented through the WITH clause. It allowed a query to perform aggregation and then continue querying within the same statement, avoiding client round trips that did nothing except hand identifiers back to the server for the next step.

Lookup was another major limitation in early Cypher. The main way to locate data was by internal identifiers, but real applications typically want lookup by domain values. If you want to find a user, you want to use something like an email address or application-specific user id.

At the time, a common pattern in Neo4j was to navigate from a node with a well-known internal id to some organizing node, and from there traverse relationships to candidate user nodes before filtering by user id. That worked reasonably well in the embedded setting, especially with helper libraries that structured the graph as a lookup tree from a well-known anchor node.

We could have exposed that approach in Cypher, but writes were much harder. When creating or updating a node, there was no way to tell what kind of node it was, and therefore no principled way to determine which well-known structure it should belong to.

To solve that, I proposed and drove a change to the Neo4j data model that introduced labels on nodes. Labels made node kinds explicit, which in turn made it possible to associate nodes with the right lookup structures. In implementation terms, we did not continue very far with the original well-known-node approach. Instead, Neo4j moved toward a more conventional database architecture and added indexes for looking up nodes by label and property.

openCypher and GQL

Cypher continued to evolve after that. Neo4j formed an internal Cypher Language Group, initially consisting of Andrés, Stefan, Petra Selmer, and me.

Later we started an effort called openCypher to make Cypher available beyond Neo4j. Design documents from that phase are available in the openCypher archives:

Cypher Improvement Proposals
Additional CIPs I wrote, which never completed the openCypher process because attention shifted toward GQL standardization

openCypher had some success in encouraging adoption by other vendors, but Neo4j ultimately aimed higher: an ISO/IEC standard query language for graph databases to stand alongside SQL for relational systems. That led to the GQL effort, where I was heavily involved on behalf of Neo4j and still participate.