Recently, I had the chance to install, modify, compile, and run Apache Giraph; a system that processes batch-jobs (analytics tasks) over large-scale graphs (e.g., shortest path, PageRank). Giraph is deemed an open source implementation of Pregel which was introduced by Google in the first place. To achieve scalability, Giraph is built as a layer on top of Hadoop to leverage its parallelism and cluster management capabilities. Giraph loads graph data stored in Hadoop HDFS into the memory of a cluster of servers. Therefore, the designated graph processing algorithm executes as a giant map phase with no reduce phase involved to avoid unnecessary data shuffling across the network. During the map phase, each worker asynchronously communicates with other workers using Netty to exchange messages during a graph processing job.

A salient feature of Giraph is its simple, yet effective, programming model. A Giraph programmer has to think like a "Vertex". In other words, the programmer implements a method, called `compute()`

, that is invoked for each visited (active) vertex. An example of the `compute() `

method for the single source shortest path algorithm is as follows:

public void compute(Iterable

messages) {

double minDist = Double.MAX_VALUE;

for (DoubleWritable message : messages) {

minDist = Math.min(minDist, message.get());

}

if (minDist < getValue().get()) {

setValue(new DoubleWritable(minDist));

for (Edgeedge : getEdges()) {

double distance = minDist + edge.getValue().get();

sendMessage(edge.getTargetVertexId(),

new DoubleWritable(distance));

}

}

voteToHalt();

}

In Giraph, a graph processing job consists of a set of iterations, called SuperSteps. Initially, all graph vertices are active. Therefore, only a subset of graph vertices are active in subsequent supersteps. Giraph employs a bulk synchronous parallel computation approach to implement a barrier between consecutive supersteps in order to synchronize computation among all graph workers. Each vertex performs local computation and/or send messages to other vertices inside the `compute()`

method. The `compute()`

method eventually invokes the `voteToHalt()`

method to broadcast that the current vertex is done with computation. The graph processing job terminates only if no more vertices are to be visited.