How to reduce the gap between OLTP and OLAP graph solutions with Spark-Tinkerpop

By Fabiana Lanotte

OLTP versus OLAP graph solutions

Graphs are all around us. They can be made to model countless real-world phenomena ranging from the social to the scientific including engineering, biology, medical systems, IoT systems, and e-commerce systems. They allow us to model and structure entities (i.e. graph’s nodes) and relationships among entities (i.e. graph’s edges) in natural way.  As an example, a website can easily be represented as a graph. In a simple approach one can model web pages as nodes and hyperlinks as relationships among web pages. Then, graph theory algorithms can be easily applied to extract new valuable knowledge. For example, by applying these algorithms on a website graph one can discover how information is propagated among nodes, or organize web pages in clusters having similar topics and strictly connected.

For this reason, in the last years graph databases have gained a lot a popularity.  Differently from traditional relational databases (or other storage paradigms, such as document databases), graph databases store entities in terms of their direct relationships (e.g. adjacency matrix, incidence matrix, etc. ) instead of inferring connections among entities through costly join operations. This means that relationships are modeled in a graph database as first-class citizens.   

Nowadays, different and complementary Big Data solutions have been developed for providing either on-line transaction processing (OLTP) or on-line analytical processing (OLAP) on graphs. OLTP systems are characterized by a large number of short on-line transactions which involve small portions of graph data (e.g. inserting new nodes or edges, simple queries, etc. ). The main emphasis of these solutions is put on very fast query processing and maintaining data integrity in multi-access environments, with effectiveness measured by number of transactions per millisecond. Titan, OrientDB, Neo4j are examples of OLTP tools. OLAP systems instead perform complex analysis (e.g. graph traversal, data aggregation) which involve extremely large graphs. Spark GraphX, Apache Giraph and SparkGraphComputer are examples of OLAP tools.

Although OLTP and OLAP tools represent fundamental components for the Big Data field, existing infrastructures are not yet mature enough to integrate OLAP systems with OLTP graph solutions.

In this post, we present Spark-Tinkerpop, a Scala API for converting a graph database, such as Titan, Neo4J, OrientDB or GraphSON database, into a GraphX format, and viceversa.  It  represents a further layer of abstraction on top of Tinkerpop which is necessary for bridging the gap between the various graph-vendor implementations and the intended underlying Spark implementations.  All that Spark-Tinkerpop needs is to define a bijection function to convert the Spark vertices’s and edges’ types into a Map[AnyRef]

Spark-Tinkerpop

In the following we show a snapshot of the object TitanGraphProvider which provides ‘native’ communications between Spark and Titan. This connector, by-passing some of the Tinkerpop APIs, is able to write/read data with the underlying graph engine.

Lets try a simple example of usage.

Suppose we have a GraphX graph of connected people as described here and we want to store it as a Titan graph.

As said before, the only thing we need is to define a conversion strategy for translating the types Relationship (i.e. GraphX edges) and Person (i.e. GraphX vertices) into a Map types. For this reason, we define the classes RelationshipRawPropSetArrow and RelationshipPropSetArrow for converting a Relationship type into a TinkerPropMap[AnyRef]:   

and the classes PersonRawPropSetArrow and PersonPropSetArrow for converting Person instances:

After defining the previous classes, we can include an instance of those as implicit variables in an object GraphArrow; it will contain all the implicit parameters required by TitanGraphProvider for the conversion process.

Finally we create a main class which uses TitanGraphProvider and GraphArrow

That’s it. By playing with this API one can combine the high scalability of the Titan graph database with a powerful and resilient graph computing system such as GraphX.