Hadoop has become a fixture where big data is concerned but it has been difficult to use in HPC and HTC cluster environments. This is becoming unfortunate as an increasing number of new algorithms assume Hadoop’s an option. I tried SAGA-Hadoop first but it should be noted that, myHadoop from U. of Indiana sought to remedy this a few years ago. If someone has experience with both or would like to comment on differences, that would be appreciated.
SAGA-Hadoop installs, configures and executes Hadoop on clusters running batch schedulers for which SAGA has adapters. At least it’s headed in that direction.
SAGA is the Simple API for Grid Applications and an Open Grid Forum standard (GFD.90) for interfacing with diverse cluster batch scheduling systems. It is a large and complex standard so we’ll leave it at that for now. For our purposes, suffice to say that it works with PBS.
The Bliss project provides Python bindings for SAGA. Like many Python APIs, it takes a minimalist approach, not covering the entire standard and demonstrating a strong preference for simplicity and brevity.
As the helpful introductory blog post above (SAGA-Hadoop) describes, it runs Hadoop.
But Can this be Interesting on the OSG?
To make this plausible for the OSG:
- We have to be able to automate the process entirely
- It would be really good if there were a practical way to use command line tools as the map and reduce steps in a MapReduce computation.
So this is the story of finding those things out.
The RENCI Blueridge cluster runs the PBS job manager. I scripted the installation but nothing is pretty about it yet. Except that it demonstrates automation of setting Hadoop up on an HPC cluster.
There’s Python involved so my first step was to set up a virtualenv to manage our project’s dependencies:
wget --timestamping \ https://raw.github.com/pypa/virtualenv/master/virtualenv.py python virtualenv.py venv source venv/bin/activate pip install bliss uuid
uuid is a dependency that, for whatever reason, needed to be explicitly named to pip.
the uuid module uses ifconfig. Putting it in the path changed nothing. So we force the issue by editing the module file. Again, it’s in our virtualenv so the copy’s entirely ours:
sed --in-place=.orig \ s,\'ifconfig\',\'/sbin/ifconfig\', \ venv/lib/python2.4/site-packages/uuid.py
Make sure JAVA_HOME is set.
Then, after unzipping Hadoop, we
- Get the source code for SAGA-Hadoop
- Edit it to specify the login node for our cluster
- Edit the bootstrap script to alter the Hadoop configs after installation. It
- Deletes an unnecessary os.makedirs() call
- Corrects the makedirs for log_dir to be for self.job_log_dir
- Uncomments the configuration of the Hadoop data node
- Alters the network itnerface to use to the locally correct eth0
- Sets JAVA_HOME
- Sets HADOOP_HEAP_SIZE
svn co https://svn.cct.lsu.edu/repos/saga-projects/applications/SAGAHadoop/saga-hadoop sed --in-place=.orig \ -e "s,india,br0," saga-hadoop/launcher.py HADOOP_ENV=hadoop-1.0.0/conf/hadoop-env.sh sed --in-place=.orig \ -e "s,os.makedirs(job_dir),," \ -e "s,os.makedirs(log_dir),os.makedirs(self.job_log_dir)," \ -e "s,\<!--,," \ -e "s,\-\-\>,," \ -e "s,eth1,eth0," \ -e "s,tar \-xzf hadoop.tar.gz,tar -xzf hadoop.tar.gz; echo export JAVA_HOME=$JAVA_HOME >> $HADOOP_ENV; echo export HADOOP_HEAP _SIZE=2000 >> $HADOOP_ENV; cat $HADOOP_ENV," \ saga-hadoop/bootstrap_hadoop.py
This is all saved as a script called build.
Running it is relatively uninteresting. It installs the virtualenv, Python dependencies and edits the config files as you’d expect.
We save these commands to a script called start:
source venv/bin/activate ./saga-hadoop/bootstrap_hadoop.py
Running it logs plenty of Hadoop information to the console and starts the server.
Command Line MapReduce with Hadoop Streaming
The command line is the lingua franca of the OSG. There’s nothing finer for figuring out a problem quickly than the command line. So how are we going to run MapReduce jobs from the command line – especially in this dynamically created environment?
Hadoop Streaming lets us run programs of our choice as the map and reduce steps. But first we need an additional Java archive to make it work. The following commands go in a file called stream. This first batch gets the JAR file and copies it to the right location
jar_url=http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-streaming/1.0.0/hadoop-streaming-1.0.0.jar lib=hadoop-streaming-1.0.0.jar work/hadoop-1.0.0/share/hadoop/contrib/streaming mkdir -p $lib curl $jar_url > $lib/hadoop-streaming-1.0.0.jar
Once that’s in place, we get rid of the old output directory (not likely to be relevant for an OSG job), then
inputs=$1 output=$2 mapper=$3 reducer=$4 rm -rf out $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/contrib/streaming/hadoop-streaming-1.0.0.jar \ -input $inputs \ -output $output \ -mapper $mapper \ -reducer $reducer \ -jobconf mapred.reduce.tasks=2
We execute the map reduce with whatever’s passed in. Then, we create a trivial program called script to be our map operator:
for x in $(seq 0 100); do echo pattern $x done
Next, we run stream like this:
./stream in out script wc
Which produces the sum of word count of our 101 output lines times three
[scox@br0:~/dev/saga-hadoop]$ cat out/part-00000 303 606 3609
because the stream is executed once per file in our input directory (in):
in |-- a |-- b `-- c
We’ve just done a completely automated install, configuration and execution of a small Hadoop cluster overlay on top of a PBS HPC cluster. We also saw automation for installnig Hadoop streaming and running a test MapReduce job using command line programs as map and reduce operators.
The overall stack looks like this:
It is probably clear that much more would be necessary to properly configure and deploy a useful cluster, particularly interfacing it properly to a PBS job context. That part will have to wait for another day. And again, if anyone reading this can comment on myHadoop, I’d appreciate the insight.