About Giraph: Large-Scale (Big) Graph Analytics

| No Comments

Recently, I had the chance to install, modify, compile, and run Apache Giraph; a system that processes batch-jobs (analytics tasks) over large-scale graphs (e.g., shortest path, PageRank). Giraph is deemed an open source implementation of Pregel which was introduced by Google in the first place. To achieve scalability, Giraph is built as a layer on top of Hadoop to leverage its parallelism and cluster management capabilities. Giraph loads graph data stored in Hadoop HDFS into the memory of a cluster of servers. Therefore, the designated graph processing algorithm executes as a giant map phase with no reduce phase involved to avoid unnecessary data shuffling across the network. During the map phase, each worker asynchronously communicates with other workers using Netty to exchange messages during a graph processing job.

A salient feature of Giraph is its simple, yet effective, programming model. A Giraph programmer has to think like a "Vertex". In other words, the programmer implements a method, called compute(), that is invoked for each visited (active) vertex. An example of the compute() method for the single source shortest path algorithm is as follows:

public void compute(Iterable messages) {
double minDist = Double.MAX_VALUE;
for (DoubleWritable message : messages) {
minDist = Math.min(minDist, message.get());
if (minDist < getValue().get()) {
setValue(new DoubleWritable(minDist));
for (Edge edge : getEdges()) {
double distance = minDist + edge.getValue().get();
new DoubleWritable(distance));

In Giraph, a graph processing job consists of a set of iterations, called SuperSteps. Initially, all graph vertices are active. Therefore, only a subset of graph vertices are active in subsequent supersteps. Giraph employs a bulk synchronous parallel computation approach to implement a barrier between consecutive supersteps in order to synchronize computation among all graph workers. Each vertex performs local computation and/or send messages to other vertices inside the compute() method. The compute() method eventually invokes the voteToHalt() method to broadcast that the current vertex is done with computation. The graph processing job terminates only if no more vertices are to be visited.

Leave a comment

About this Entry

This page contains a single entry by M. Sarwat published on July 2, 2013 11:00 PM.

Playing with Twitter Data (Retrieve Top 10 hashtags) was the previous entry in this blog.

Geographical Changes Matter is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.


Powered by Movable Type 4.31-en