SAGA Hadoop

Introduction

Hadoop has become a fixture where big data is concerned but it has been difficult to use in HPC and HTC cluster environments. This is becoming unfortunate as an increasing number of new algorithms assume Hadoop’s an option. I tried SAGA-Hadoop first but it should be noted that, myHadoop from U. of Indiana sought to remedy this a few years ago. If someone has experience with both or would like to comment on differences, that would be appreciated.

SAGA Hadoop

SAGA-Hadoop installs, configures and executes Hadoop on clusters running batch schedulers for which SAGA has adapters. At least it’s headed in that direction.

SAGA is the Simple API for Grid Applications and an Open Grid Forum standard (GFD.90) for interfacing with diverse cluster batch scheduling systems. It is a large and complex standard so we’ll leave it at that for now. For our purposes, suffice to say that it works with PBS.

The Bliss project provides Python bindings for SAGA. Like many Python APIs, it takes a minimalist approach, not covering the entire standard and demonstrating a strong preference for simplicity and brevity.

As the helpful introductory blog post above (SAGA-Hadoop) describes, it runs Hadoop.

But Can this be Interesting on the OSG?

To make this plausible for the OSG:

      • We have to be able to automate the process entirely
      • It would be really good if there were a practical way to use command line tools as the map and reduce steps in a MapReduce computation.

So this is the story of finding those things out.

Automating Installation

The RENCI Blueridge cluster runs the PBS job manager. I scripted the installation but nothing is pretty about it yet. Except that it  demonstrates  automation of setting Hadoop up on an HPC cluster.

There’s Python involved so my first step was to set up a virtualenv to manage our project’s dependencies:

wget --timestamping \
   https://raw.github.com/pypa/virtualenv/master/virtualenv.py
python virtualenv.py venv
source venv/bin/activate
pip install bliss uuid

uuid is a dependency that, for whatever reason, needed to be explicitly named to pip.

the uuid module uses ifconfig. Putting it in the path changed nothing. So we force the issue by editing the module file. Again, it’s in our virtualenv so the copy’s entirely ours:

sed --in-place=.orig                       \
     s,\'ifconfig\',\'/sbin/ifconfig\',     \
     venv/lib/python2.4/site-packages/uuid.py

Make sure JAVA_HOME is set.

Then, after unzipping Hadoop, we

      • Get the source code for SAGA-Hadoop
      • Edit it to specify the login node for our cluster
      • Edit the bootstrap script to alter the Hadoop configs after installation. It
        • Deletes an unnecessary os.makedirs() call
        • Corrects the makedirs for log_dir to be for self.job_log_dir
        • Uncomments the configuration of the Hadoop data node
        • Alters the network itnerface to use to the locally correct eth0
        • Sets JAVA_HOME
        • Sets HADOOP_HEAP_SIZE
svn co https://svn.cct.lsu.edu/repos/saga-projects/applications/SAGAHadoop/saga-hadoop
sed --in-place=.orig \
    -e "s,india,br0," saga-hadoop/launcher.py
HADOOP_ENV=hadoop-1.0.0/conf/hadoop-env.sh
sed --in-place=.orig \
    -e "s,os.makedirs(job_dir),," \
    -e "s,os.makedirs(log_dir),os.makedirs(self.job_log_dir)," \
    -e "s,\<!--,," \
    -e "s,\-\-\>,," \
    -e "s,eth1,eth0," \
    -e "s,tar \-xzf hadoop.tar.gz,tar -xzf hadoop.tar.gz; echo export JAVA_HOME=$JAVA_HOME >> $HADOOP_ENV; echo export HADOOP_HEAP     _SIZE=2000 >> $HADOOP_ENV; cat $HADOOP_ENV," \
    saga-hadoop/bootstrap_hadoop.py

This is all saved as a script called build.

Running it is relatively uninteresting. It installs the virtualenv, Python dependencies and edits the config files as you’d expect.

Running Hadoop

We save these commands to a script called start:

source venv/bin/activate
./saga-hadoop/bootstrap_hadoop.py

Running it logs plenty of Hadoop information to the console and starts the server.

Command Line MapReduce with Hadoop Streaming

The command line is the lingua franca of the OSG. There’s nothing finer for figuring out a problem quickly than the command line. So how are we going to run MapReduce jobs from the command line – especially in this dynamically created environment?

Hadoop Streaming lets us run programs of our choice as the map and reduce steps. But first we need an additional Java archive to make it work. The following commands go in a file called stream. This first batch gets the JAR file and copies it to the right location

jar_url=http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-streaming/1.0.0/hadoop-streaming-1.0.0.jar
lib=hadoop-streaming-1.0.0.jar work/hadoop-1.0.0/share/hadoop/contrib/streaming
mkdir -p $lib
curl $jar_url  > $lib/hadoop-streaming-1.0.0.jar

Once that’s in place, we get rid of the old output directory (not likely to be relevant for an OSG job), then

inputs=$1
output=$2
mapper=$3
reducer=$4

rm -rf out

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/contrib/streaming/hadoop-streaming-1.0.0.jar \
     -input $inputs \
     -output $output \
     -mapper $mapper \
     -reducer $reducer \
     -jobconf mapred.reduce.tasks=2

We execute the map reduce with whatever’s passed in. Then, we create a trivial program called script to be our map operator:

for x in $(seq 0 100); do
    echo pattern $x
done

Next, we run stream like this:

./stream in out script wc

Which produces the sum of word count of our 101 output lines times three

 [scox@br0:~/dev/saga-hadoop]$ cat out/part-00000
     303     606    3609

because the stream is executed once per file in our input directory (in):

 in
 |-- a
 |-- b
 `-- c

Summary

We’ve just done a completely automated install, configuration and execution of a small Hadoop cluster overlay on top of a PBS HPC cluster. We also saw automation for installnig Hadoop streaming and running a test MapReduce job using command line programs as map and reduce operators.

The overall stack looks like this:

It is probably clear that much more would be necessary to properly configure and deploy a useful cluster, particularly interfacing it properly to a PBS job context. That part will have to wait for another day. And again, if anyone reading this can comment on myHadoop, I’d appreciate the insight.


This entry was posted in Compute Grids, Data Grids, Hadoop, High Throughput Computing (HTC), High Throughput Parallel Computing (HTPC). Bookmark the permalink.

One Response to SAGA Hadoop

  1. I tried both myHadoop from U. of Indiana and SAGA Hadoop. As a new user of Hadoop I found myHadoop is more coupled to the application i.e the Hadoop cluster is alive only when the application is running. Once the application completes the Hadoop cluster goes down automatically. SAGA Hadoop provides similar capabilities but decouples applications and hadoop cluster setup. It just setup the Hadoop cluster and allows user to take care of running application on the cluster. This allows users to run multiple jobs using the same Hadoop cluster.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s