On the Open Science Grid Trail

The open science grid is a distributed heterogeneous network of computing clusters. Its infrastructure and protocols allow members to submit high throughput compute jobs for remote execution. All use is authenticated and authorized via a PKI infrastructure which associates jobs to a user and the virtual organization (VO) they belong to.

Build and use OSG cyberinfrastructure. There are posts on submitting and managing job workflows, installing OSG components like Compute Elements (CE) as well as infrastructure for the Engage VO.

High Throughput Parallel Computing investigates using emerging multi-core architectures on the OSG and Blueridge to run performance intensive systems, especially molecular dynamics. This work is in collaboration with the OSG HTPC collaboration.

Data Grids play an increasingly important role in grid computing. We’ll be investigating the expanding role of iRODS as a system for transparent distribution of grid computing data artifacts.

Visualize the OSG using the OSG Map. Nodes in the graph are clickable, as is information in the left navigation pane. Use the search field to search for specific sites. Expand sub-clusters to see their attributes.

Cyberinfrastructure sustainability is a critical challenge. Work on this site makes extensive use of continuous integration as an approach to improve reliability, sustainability and ultimately, repeatability.

Aside | Posted on by | Leave a comment

Duke Physics MPI at RENCI Blueridge with Grayson and Pegasus 4.0


Duke Physics will be running MPI jobs on RENCI Blueridge. The model is new and expected to grow. It’s been built with cluster specific libraries so it’ll be executed at Blueridge for the foreseeable future.

At the same time, we’d like the workflow to be easy to visualize, maintain, extend and debug. So we’ll use Pegasus 4.0 and the Grayson modelling and debugging tools to start the workflow with a single, simple component.


For now, we’re keeping this very simple. A script is the only executable. There’s one input file and one output file, both of which are tar archives. The generic workflow module looks like this:

We’ll be launching this from RENCI’s Engage submit host, engage-submit3.renci.org, to the RENCI-Blueridge cluster. To do this, we’ll provide a context file that configures inputs, executables and outputs for this context. Here’s the configuration:

The first column imports blueridge.blueridge-basic from the standard library. This contains objects like MPI_8 which specify the correct globus RSL for an 8-way job on Blueridge.

It also imports cpic3d-flow. This is the model shown above.

In the second column, we make cpic3d.sh an MPI_8 job. This means that when it’s submitted, it will be with the appropriate RSL.

We also tag it as a local executable. This object’s description contains this JSON to tag its origin:

 "type"      : "abstract",
 "urlPrefix" : "gsiftp://${FQDN}/${appHome}/bin",
 "site"      : "${clusterId}"

The output DAX will contain an executable tag pointing at the physical file name (PFN) of the executable to use. That PFN will use this URL:


and specify the site as RENCI-Blueridge.

The local-input and local objects use very similar syntax to designate the locations of the input and output files.

Using Grayson and Pegasus should make it easier to run, debug, grow and change this workflow. This may include pointing it at different clusters as the needs of the research team change, making it hierarchical, and so on.


Source Code:

The easiest way to get the workflow is to check it out from SVN:

 svn co https://renci-ci.svn.sourceforge.net/svnroot/renci-ci/trunk/duke/cpic3d


Setup the environment:

cd duke/cpic3d
source ./setup.sh

As with all OSG job submissions, make sure you have a valid grid certificate with:


And if not, get one with

voms-proxy-init --voms Engage --valid 24:00

Then execute the workflow is as follows.

./submit blueridge --execute

This will generate the Pegasus artifacts for the workflow and submit it. It will also launch the Pegasus command line monitoring console to track the workflow.


The pegasus-status console will show which jobs have completed and their status.

All workflow outputs are available in



The Duke Physics community will want to edit bin/cpic3d.sh. Currently, it contains:


set -x


echo $app/bin/cpic3dmpi.mvapich2_gnu-1.6

tar cvzf output.tar.gz $app/run/*.dat

exit 0

So, changing the $app variable will be necessary to point to the most recent executable.

This file is staged to the cluster with each execution of the workflow so it offers a place to make things a bit more flexible.


See work/outputs for the output file.

Monitoring and Debugging

Launching the job will monitor the workflow with a command like this:

pegasus-status -l /home/scox/dev/cpic3d/daxen-blueridge/work/scox/pegasus/cpic3d-flow/20120625T094503-0400

The directory specified is the unique run directory for this execution.

That directory can also be used with the pegasus-analyzer command as follows:

pegasus-analyzer -d <directory>

to provide a summary of the workflow.

For now, more detailed debugging information is available in the files in the run directory. As the project moves on, we’ll explore additional tools for debugging workflows.

Posted in Uncategorized | Leave a comment

SAGA Hadoop


Hadoop has become a fixture where big data is concerned but it has been difficult to use in HPC and HTC cluster environments. This is becoming unfortunate as an increasing number of new algorithms assume Hadoop’s an option. I tried SAGA-Hadoop first but it should be noted that, myHadoop from U. of Indiana sought to remedy this a few years ago. If someone has experience with both or would like to comment on differences, that would be appreciated.

SAGA Hadoop

SAGA-Hadoop installs, configures and executes Hadoop on clusters running batch schedulers for which SAGA has adapters. At least it’s headed in that direction.

SAGA is the Simple API for Grid Applications and an Open Grid Forum standard (GFD.90) for interfacing with diverse cluster batch scheduling systems. It is a large and complex standard so we’ll leave it at that for now. For our purposes, suffice to say that it works with PBS.

The Bliss project provides Python bindings for SAGA. Like many Python APIs, it takes a minimalist approach, not covering the entire standard and demonstrating a strong preference for simplicity and brevity.

As the helpful introductory blog post above (SAGA-Hadoop) describes, it runs Hadoop.

But Can this be Interesting on the OSG?

To make this plausible for the OSG:

      • We have to be able to automate the process entirely
      • It would be really good if there were a practical way to use command line tools as the map and reduce steps in a MapReduce computation.

So this is the story of finding those things out.

Automating Installation

The RENCI Blueridge cluster runs the PBS job manager. I scripted the installation but nothing is pretty about it yet. Except that it  demonstrates  automation of setting Hadoop up on an HPC cluster.

There’s Python involved so my first step was to set up a virtualenv to manage our project’s dependencies:

wget --timestamping \
python virtualenv.py venv
source venv/bin/activate
pip install bliss uuid

uuid is a dependency that, for whatever reason, needed to be explicitly named to pip.

the uuid module uses ifconfig. Putting it in the path changed nothing. So we force the issue by editing the module file. Again, it’s in our virtualenv so the copy’s entirely ours:

sed --in-place=.orig                       \
     s,\'ifconfig\',\'/sbin/ifconfig\',     \

Make sure JAVA_HOME is set.

Then, after unzipping Hadoop, we

      • Get the source code for SAGA-Hadoop
      • Edit it to specify the login node for our cluster
      • Edit the bootstrap script to alter the Hadoop configs after installation. It
        • Deletes an unnecessary os.makedirs() call
        • Corrects the makedirs for log_dir to be for self.job_log_dir
        • Uncomments the configuration of the Hadoop data node
        • Alters the network itnerface to use to the locally correct eth0
        • Sets JAVA_HOME
        • Sets HADOOP_HEAP_SIZE
svn co https://svn.cct.lsu.edu/repos/saga-projects/applications/SAGAHadoop/saga-hadoop
sed --in-place=.orig \
    -e "s,india,br0," saga-hadoop/launcher.py
sed --in-place=.orig \
    -e "s,os.makedirs(job_dir),," \
    -e "s,os.makedirs(log_dir),os.makedirs(self.job_log_dir)," \
    -e "s,\<!--,," \
    -e "s,\-\-\>,," \
    -e "s,eth1,eth0," \
    -e "s,tar \-xzf hadoop.tar.gz,tar -xzf hadoop.tar.gz; echo export JAVA_HOME=$JAVA_HOME >> $HADOOP_ENV; echo export HADOOP_HEAP     _SIZE=2000 >> $HADOOP_ENV; cat $HADOOP_ENV," \

This is all saved as a script called build.

Running it is relatively uninteresting. It installs the virtualenv, Python dependencies and edits the config files as you’d expect.

Running Hadoop

We save these commands to a script called start:

source venv/bin/activate

Running it logs plenty of Hadoop information to the console and starts the server.

Command Line MapReduce with Hadoop Streaming

The command line is the lingua franca of the OSG. There’s nothing finer for figuring out a problem quickly than the command line. So how are we going to run MapReduce jobs from the command line – especially in this dynamically created environment?

Hadoop Streaming lets us run programs of our choice as the map and reduce steps. But first we need an additional Java archive to make it work. The following commands go in a file called stream. This first batch gets the JAR file and copies it to the right location

lib=hadoop-streaming-1.0.0.jar work/hadoop-1.0.0/share/hadoop/contrib/streaming
mkdir -p $lib
curl $jar_url  > $lib/hadoop-streaming-1.0.0.jar

Once that’s in place, we get rid of the old output directory (not likely to be relevant for an OSG job), then


rm -rf out

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/contrib/streaming/hadoop-streaming-1.0.0.jar \
     -input $inputs \
     -output $output \
     -mapper $mapper \
     -reducer $reducer \
     -jobconf mapred.reduce.tasks=2

We execute the map reduce with whatever’s passed in. Then, we create a trivial program called script to be our map operator:

for x in $(seq 0 100); do
    echo pattern $x

Next, we run stream like this:

./stream in out script wc

Which produces the sum of word count of our 101 output lines times three

 [scox@br0:~/dev/saga-hadoop]$ cat out/part-00000
     303     606    3609

because the stream is executed once per file in our input directory (in):

 |-- a
 |-- b
 `-- c


We’ve just done a completely automated install, configuration and execution of a small Hadoop cluster overlay on top of a PBS HPC cluster. We also saw automation for installnig Hadoop streaming and running a test MapReduce job using command line programs as map and reduce operators.

The overall stack looks like this:

It is probably clear that much more would be necessary to properly configure and deploy a useful cluster, particularly interfacing it properly to a PBS job context. That part will have to wait for another day. And again, if anyone reading this can comment on myHadoop, I’d appreciate the insight.

Posted in Compute Grids, Data Grids, Hadoop, High Throughput Computing (HTC), High Throughput Parallel Computing (HTPC) | 1 Comment

NAMD with PBS and Infiniband on NERSC Dirac


NAMD simulates molecular motion, especially of large molecules so it’s often used to simulate molecular docking problems. One particularly interesting class of docking problem is the interaction of protein molecules with other molecules such as the cell membrane. The enormous number of atoms involved in these simulations confine the kinds of information we’re able to learn about how proteins interact with and shape their environments because more atoms require more computing power. So we’re investigating using GPU accelerated nodes in a shared memory cluster to speed up simulation time.

This describes running NAMD in a multi-node configuration at NERSC Dirac to determine if we want to build out a Pegasus workflow executing in this mode through the OSG compute element. The process is, as usual with MPI codes using cluster interconnects, highly cluster specific. The next step is to determine if it’s worth it and what our alternatives are.


If you’re having a hard time running NAMD in a PBS environment over an Infiniband interconnect, you are not alone. The NAMD release notes come right to the point:

“Writing batch job scripts to run charmrun in a queueing system can be challenging.”

These links, in addition to the release notes cited above provide useful insights:

And without further delay, here’s the approach that worked on Dirac. Mileage on your cluster may vary.

set -x
set -e

# build a node list file based on the PBS
# environment in a form suitable for NAMD/charmrun

echo group main > $nodefile
nodes=$( cat $PBS_NODEFILE )
for node in $nodes; do
   echo host $node >> $nodefile

# find the cluster's mpiexec
MPIEXEC=$(which mpiexec)

# Tell charmrun to use all the available nodes, the nodelist built  above and the cluster's MPI.
CHARMARGS="+p32 ++nodelist $nodefile"

As an additional wrinkle, we want to run the GPU accelerated version. That’s why we use the +idlepoll argument to NAMD.

After setting NAMD_HOME, the command to execute NAMD is:

${NAMD_HOME}/charmrun \

${CHARMARGS} ++mpiexec ++remote-shell \

${MPIEXEC} ${NAMD_HOME}/namd2 +idlepoll <input_file>

The beginning of NAMD’s output looks like this:

Info: 1 NAMD 2.8 Linux-x86_64-ibverbs-CUDA 16 dirac48 stevecox
Info: Running on 16 processors, 16 nodes, 2 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.025701 s
Pe 5 sharing CUDA device 0 first 0 next 6
Pe 5 physical rank 5 binding to CUDA device 0 on dirac48: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3
Pe 10 sharing CUDA device 0 first 8 next 11
Pe 10 physical rank 2 binding to CUDA device 0 on dirac47: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3
Pe 8 sharing CUDA device 0 first 8 next 9
Pe 8 physical rank 0 binding to CUDA device 0 on dirac47: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3
Pe 2 sharing CUDA device 0 first 0 next 3
Did not find +devices i,j,k,… argument, using all
Pe 2 physical rank 2 binding to CUDA device 0 on dirac48: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3

Of particular importance, note that there is a pre-built executable specific to ibverbs-CUDA – that is, it works with infiniband connected clusters with CUDA accelerated nodes.

These are the parameters of the dirac_reg queue:

[stevecox@cvrsvc01 namd]$ qstat -Qf dirac_reg
Queue: dirac_reg
 queue_type = Execution
 Priority = 10
 max_user_queuable = 500
 total_jobs = 39
 state_count = Transit:0 Queued:4 Held:27 Waiting:0 Running:8 Exiting:0
 acl_user_enable = False
 resources_max.nodect = 12
 resources_max.walltime = 06:00:00
 resources_min.nodect = 1
 resources_default.walltime = 00:05:00
 mtime = 1323823829
 resources_assigned.nodect = 34
 max_user_run = 2
 enabled = True
 started = True

So to test jobs, I ran qsub like this:

qsub -I -q dirac_reg -l walltime=06:00:00 -l nodes=4:ppn=8

The -I parameter tells qsub to start an interactive job. The walltime parameter overrides the very low default walltime. Fnially, nodes tells PBS how many cluster nodes to use and ppn specifies the processes per node to start.

After debugging, I ran the script like this:

qsub -q dirac_reg -l walltime=06:00:00 -l nodes=4:ppn=8 ./callnamd


I did three runs with 2, 4, and 8 nodes. The interesting performance number for a NAMD run is days/ns or days of computation time required per nanosecond of simulation.

[stevecox@cvrsvc01 ~]$ grep -i days dev/dukechem/osg/namd/run.* | sed -e “s,.txt,,” -e “s,.*run.,,”
2way:Info: Initial time: 16 CPUs 0.0617085 s/step 0.357109 days/ns 91.0601 MB memory
2way:Info: Initial time: 16 CPUs 0.0613538 s/step 0.355057 days/ns 94.0546 MB memory
2way:Info: Initial time: 16 CPUs 0.0619225 s/step 0.358348 days/ns 94.7324 MB memory
2way:Info: Benchmark time: 16 CPUs 0.0620334 s/step 0.35899 days/ns 94.8284 MB memory
2way:Info: Benchmark time: 16 CPUs 0.0621472 s/step 0.359648 days/ns 95.09 MB memory
2way:Info: Benchmark time: 16 CPUs 0.0620733 s/step 0.359221 days/ns 95.162 MB memory
4way:Info: Initial time: 32 CPUs 0.0472537 s/step 0.273459 days/ns 83.8981 MB memory
4way:Info: Initial time: 32 CPUs 0.0470766 s/step 0.272434 days/ns 84.8605 MB memory
8way:Info: Initial time: 64 CPUs 0.0406125 s/step 0.235026 days/ns 81.0847 MB memory
8way:Info: Initial time: 64 CPUs 0.0406405 s/step 0.235188 days/ns 82.1035 MB memory
8way:Info: Initial time: 64 CPUs 0.0407004 s/step 0.235534 days/ns 82.2474 MB memory
8way:Info: Benchmark time: 64 CPUs 0.0407453 s/step 0.235794 days/ns 82.3482 MB memory
8way:Info: Benchmark time: 64 CPUs 0.040858 s/step 0.236447 days/ns 82.3975 MB memory
8way:Info: Benchmark time: 64 CPUs 0.0406536 s/step 0.235264 days/ns 82.4038 MB memory

Here are some details of NERSC Dirac’s configuration:

Dirac is a 50 GPU node cluster connected with QDR IB.  Each GPU node also contains 2 Intel 5530 2.4 GHz, 8MB cache, 5.86GT/sec QPI Quad core Nehalem processors (8 cores per node) and 24GB DDR3-1066 Reg ECC memory.

  •  44 nodes:  1 NVIDIA Tesla C2050 (code named Fermi) GPU with 3GB of memory and 448 parallel CUDA processor cores.
  • 4 nodes:  1 C1060 NVIDIA Tesla GPU with 4GB of memory and 240 parallel CUDA processor cores.
  • 1 node:  4 NVIDIA Tesla C2050 (Fermi) GPU’s, each with 3GB of memory and 448 parallel CUDA processor cores.
  • 1 node:  4 C1060 Nvidia Tesla GPU’s, each with 4GB of memory and 240 parallel CUDA processor cores.

Here are results from earlier runs on a cluster with far fewer GPUs but a configuration in which accelerated nodes contain four Nvidia Teslas (like one of the Dirac nodes):

  • 4CPU: 0.998798 days/ns
  • 8CPU: 0.565848 days/ns
  • And with the production sample at 8CPU:  0.288802
While these findings are preliminary, indications are that having four GPUs on a single node makes a substantial performance difference.
Posted in Engage VO, GPGPU, High Throughput Computing (HTC), High Throughput Parallel Computing (HTPC), multicore, NAMD, OSG, Uncategorized | Leave a comment

Virtual Machines for OSG


The ability to run a virtual machine with a self-contained computing environment has major advantages. Users can

    • Choose the operating system that’s best for the application
    • Execute programs that require elevated privileges
    • Install any software they need
    • Dynamically configure machine attributes like the number of cores to suit the host environment

The Engage VO is beginning to see researchers new to OSG whose default mode of operation is to spin up a VM on EC2. They quickly get used to having complete control of the computing environment.


This capability has been explored for the Open Science Grid before. Clemson built Kestrel which supports KVM based virtualization with an XMPP communication architecture. Then STAR used Kestrel with great success. Clemson now also provides OneCloud based on OpenNebula.

There’s also been work by Brian Bockelman, Derek Weitzel and others to configure virtual machines running Condor to join the submit host’s pool. Infrastructure background for that work and lots of great information is available at the team’s blog.

Recently, I’ve had new Engage users who are heavy users of virtualization. As mentioned before, they tend to assume control over the environment. This background can make the need to specially prepare executables for the OSG by static compilation and other packaging seem onerous. Many Engage users, it should be added, have input and output file sizes in the low number of gigabytes and are not familiar with High Throughput Computing or a command line approach to virtualization.

They asked if it was possible to run virtual machines on the OSG so I set out to look for an approach that would allow researchers to

      • Create virtual machines on their desktops using simple graphical tools
      • Deploy virtual machines onto the OSG
      • Transfer input files to and from the virtual machine
      • Avoid complex interactions with HTC plumbing like configuring X.509 certs, Condor, etc.

Virtualization on RENCI-Blueberry

Since virtualization is not a currently supported technology on the OSG, step one is to create a small area in which we change that. RENCI’s Blueberry is a new cluster made of older machines that we’ve recently brought online. It’s a ROCKS, Centos 5.7, Torque/PBS cluster with a small number of virtualization capable nodes.

Here’s an overview of changes we made to the cluster:

First, we installed these packages on the virtualization capable nodes:

        • Libvirt: A library of low level capabilities supporting virtualization.
        • QEMU: Virtual machine emulation layer
        • KVM: A virtualization kernel module
        • XMLStarlet: A command line XSLT engine

We configured QEMU to allow the engage user to execute virsh.

Then we created a Torque/PBS queue called virt grouping the upgraded machines.
A new GlideinWMS group was added on the Engage VO frontend. Jobs deploying virtual machines are decorated with the following modifications:
        • Job Requirements: && (CAN_RUN_VIRTUAL_MACHINE == TRUE)
        • Job Attributes: +RequiresVirtualization=TRUE

A new resource was added to the GlideinWMS factory with RSL pointing at the virt queue on RENCI-Blueberry.

Creating and Running a VM

Virtual machines were created using virt-manager, the Virtual Machine Manager. It’s a graphical application providing a wizard like interface for creating and managing VMs.

We used the command line virsh tool to export an XML description of the running virtual machine. Then the XML description and the disk image (the large file containing all of the VMs data) were moved to the Engage submit host.

The OSG job was designed to

      • Download the virtual machine’s XML description
      • Determine the number of CPUs on the machine
      • Modify the XML description to specify
        • The appropriate number of CPUs
        • The correct location for the VM image file
      • Download the virtual machine
      • Execute the virtual machine

This works. Jobs configured to run in the GlideinWMS virt group on the Engage submit node map to glideins on RENCI-Blueberry. There, the jobs download the XML config and the image, make the needed edits and spawn the virtual machine.

Getting Work to the Virtual Machine

Now, if you’ve tried to do this kind of thing before, you realize this is where things get tricky.

When the virtual machine launches, it has no idea what to do. This is part of the reason that some previous approaches put Condor on the machine. That way, it can join an existing Condor pool and has all the good things that Condor brings us in terms of file transfer, matching and so on. But getting credentials into the virtual machine securely to allow it to join the Engage pool is tricky. If you know how to do that, please leave a comment.

Alternatives … and OS Versions

Now, in principle, there are two other ways to do this that would work fine. If we can get files onto and off of the machine, it would be Ok to transfer them into and out of the worker node the old fashioned way – globs-url-copy. So here are two other mechanisms for file exchange between a host and a guest:

Shared Host/Guest Filesystem: More recent versions of Libvirt/QEMU/KVM that support sharing filesystems between the host and guest. In this model, the guest’s XML description can specify a directory on the host that should be mounted within the guest. But, as I mentioned, the RENCI-Blueberry cluster runs CentOS 5.7. As such, only a significantly older version of the virtualization stack is supported. We discussed upgrading to a newer version but that would prevent this solution from being generally reproducible on OSG.

Guestfish: Next, there’s libguest and the associated interactive shell guestfish. Guestfish lets you mount a disk image in user space. That is, there’s no need to use root privileges. It also has convenient wrapper scripts for copying a file into and out of an image. But, again, it requires a version of CentOS   significantly greater than 5.7.

From this angle, it looks like VMs on OSG could be an every day occurrence if it were not for very low OS version numbers.

File Sharing REST API

Before giving the approach up for dead, I decided to try something off the beaten path.

Beanstalk: I installed beanstalk on the submit node. Beanstalk is a very simple HTTP based queue. You can put messages on the queue and get them off. You can name queues – which it refers to, weirdly – as tubes. Beanstalk does not have a notion of authentication so that’s not great.

Beanstalkc: This is the Python client for Beanstalk.

Box.net: One of many file sharing sites with a REST API.


A command line Box.net authentication script does token negotiation with Box.net mostly from the command line using wget and curl.

Then, Box.net URLs are published into the Engage event queue.

When virtual machines run, they install and run the boot script.

It installs the Beanstalk client and reads a single item from the event queue which it downloads from Box.net and processes. Queues are appropriately named so that different users and jobs never collide.

Finally, it converts the download URL to an upload URL and publishes the results of the run via the file sharing API.

So at the end of my run of 3 VM’s on RENCI-Blueberry, there were three files waiting for me at Box.net.


I invite comments on how others have secured communication to a VM on OSG. I’d love to hear.

In particular, as mentioned above, I’d love to hear how others have gotten X.509 credentials onto a VM in this environment.

Anyone else running VMs on the OSG?

Posted in Uncategorized | Leave a comment

Grayson – Workflow for the Hybrid Grid


I mentioned in a recent post about running NAMD on NERSC’s Dirac GPU cluster there would be a discussion of how we’re managing hierarchical workflows. To do that, we need to introduce Grayson.

Grayson is a workflow design and execution environment for the OSG and XSEDE. It applies principles of model driven architecture to high throughput computing in response to challenges to the design, execution, troubleshooting and reuse of workflows we’ve seen researcher’s encounter. From a design perspective, Grayson borrows ideas and approaches from pioneering tools like Kepler, Taverna and Triana while aggressively reusing HTC infrastructure including Pegasus, GlideinWMS, Condor and Globus to create an environment addressing the user community and challenges the OSG Engage Virtual Organization serves.


For our purposes, a workflow is the coordination plan for a set of computational jobs in a high-throughput computing environment used to answer a scientific question. Typically, workflows choreograph the interactions of a large number of programs and their input and output files. The programs are often written in different languages and may require substantial infrastructure support to facilitate their execution at a remote host. For example, we stage in MPI support executables and libraries to run many programs. Workflows in a high-throughput computing setting transfer inputs and executables to the compute host, execute the job and transfer outputs back to the submit host.


In practice, workflows become enormously complex to define, extend, comprehend and target to emerging architectures. The Engage Virtual Organization has been working to broaden the reach of non-High Energy Physics researchers to the Open Science Grid for over four years. In that time we’ve seen a series of challenges researchers face to using the infrastructure to its full potential:

  • Hardware Architecture Heterogeneity: A single workflow may have components requiring widely divergent hardware architectures. This is increasingly true as emerging parallel architectures like GPUs and multicore CPUs become pervasive in HPC and HTPC computing.  Some jobs will require single core CPUs, others will require multicore CPUs and others will require GPUs. It is desirable, from a performance perspective to have highly granular control over the selection of hardware for the execution of a job. But it is difficult and undesirable from configuration management and reuse perspectives to have what is conceptually a single and highly abstract process cluttered with and tightly coupled to this level of detail. Further, when the same workflow can usefully be targeted at multiple heterogeneous hardware configurations, the problem is no longer practical to manage manually and requires tools and approaches to facilitate this increasingly vital usage pattern.
  • Conceptual Complexity: In other sub-fields of software development design patterns have played a key role in allowing groups to work on large, complex software systems by creating and integrating separately developed and tested components. These components, or modules enable teams to decompose large problems. Design patterns also contribute key semantic models which serve as common conceptual tools as well as a framework for others to understand complex systems. Examples include the Chain of Responsibility pattern, Inversion of Control and Aspect Oriented Programming. Tools for the conceptual composition of large workflows are sorely needed. Currently, teams of scientists have few tools that support a practical model of component oriented development or reuse. Those tools that do exist to support these capabilities are rarely used in the context of high throughput computing with the largest infrastructures such as the Open Science Grid (OSG).
  • Environmental Complexity: In general, the software stack required to use and debug high throughput computing workflows is enormously complex. The skills, patterns and habits required to effectively troubleshoot workflows is also very different than that of most researchers. Tools are needed that open large scale computational infrastructures to practical usage by non-expert research communities. Researchers shouldn’t need to develop expertise in Condor, GlideinWMS, Pegasus and Globus to design, run and debug a workflow.
  • Reusability: Few existing workflow systems focus on a practical pattern of component reusability. Software development in general has thrived by making component orientation the basis of all development. Separate compilation is taken for granted in all major programming languages. That is, it must be possible to compose a program from separate files. The files must be able to reference other files and to reference symbols and constructs in those separately compiled units. It’s not hard to see how the absence of this capability would cripple productivity. Existing approaches to developing high-throughput computing workflows emphasize the creation of large scripts with little thought to modularity or reuse. They are often tightly coupled to a cluster, batch scheduling system or other infrastructure component. Approaches are needed that encourage researchers to isolate the smallest reusable modules of their workflows and define these in a way that makes reuse within and beyond the originating team the most natural approach.
  • Execution Control and Analysis: Once a workflow is launched and executing it can spawn thousands of jobs. These jobs, especially during development, will fail. The process of discovering the specific cause of the failure of a job within a large workflow is tedious and error prone in proportion to the complexity of the workflow. Large directory hierarchies are manually navigated to find the appropriate log file from which error information must be extracted. As workflows become more complex – for example hierarchical workflows containing serial and parallel steps on multiple hardware architectures – it becomes prohibitively expsensive to debug and analyze these systems. Tools are needed to provide visualization of the execution, to mine data to present answers regarding problems directly to the researcher, provide visual analytic tools and provide rapid navigation for the researcher to artifacts enabling in depth research.

Submit Host

The submit host is a machine managed by an OSG virtual organization (VO) which lets VO members submit computation workloads to OSG resources. In general a submit host – and these are true of the Engage Submit Host – will run

    • Condor:  The scheduler coordinates submission of jobs via Globus to compute sites. Condor also provides the DAGMAN workfow planner which lets users model job precedence and retry semantics.
    • Globus: Globus is used as the job submission protocol between Condor and OSG compute elements (CE)
    • GlideinWMS: GlideinWMS is the OSG standard pilot-based job submission engine. It submits glideins to compute nodes which then consume user job workloads efficiently.
    • Pegasus WMS: Pegasus provides a layer of abstraction above grid services allowing researchers to model workflows as abstract processes independent of underlying computational infrastructure.

Usage Models


In conventional use, researchers log in to the submit host via SSH and execute shell scripts to submit jobs. These scripts

    • Ensure the user has a valid grid proxy and prompts to create one if not
    • Generate a DAGMAN submit file specifying the workflow. This is shell code for doing variable replacement in static text to generate a file.
    • Generate Condor submit files for each job in the DAG. This is more shell script to generate text.
    • Coordinate the locations of input, output and log files to known locations
    • Specify wrapper scripts to be run at the compute node which, in turn, will
      • Stage in executables
      • Stage in input files
      • Manually manage and track program execution status
      • Stage output files back to the submit host
    • Submit the DAG to Condor

Workflow progress can be followed by tailing the log file for the Condor DAG.

If an error occurs during execution, the user searches the log files for causes.


Pegasus reads an XML specification of the workflow and generates a number of artifacts to do the majority of the steps outlined in the approach above. Specifically, it generates a DAGMAN, submits it to Condor, stages inputs and executables to compute nodes and stages outputs back to the submit host. It also provides tools for monitoring the workflow as well as some additional command line support for determining reasons for a workflow’s failures. Generating Pegasus workflows involves writing, for complex hierarchical workflows, a substantial amount of code in Python, Java or Perl to create the DAX, site, replica and transformation catalogs. As such, the user must have substantial knowledge of the programming language and each of these formats.

Grayson Architecture

At the highest level, Grayson consists of three components:

    • Editor: The yEd graph editor from yWorks is used for drawing and annotating workflow graphs.
    • Compiler: The Grayson compiler enables a range of features to facilitate the component oriented development of complex hierarchical workflows. Some of these include:
      • Separate compilation and reference
      • Functional Programming (FP)
      • Aspect Oriented Programming (AOP)
      • Generates all code required for a Pegasus workflow (DAX & catalogs)
    • Execution Environment: The execution environment is a web based graphical interface using HTML5 capabilities including asynchronous event notification and the Scalable Vector Graphics (SVG) protocol to render a visual model of an executing workflow enhanced with live or historical execution events.

Architecturally, Grayson is a distributed architecture using AMQP and node.js for asynchronous event distribution in addition to Apache with ModWSGI and Django for conventional web tier services.

It interfaces to Pegasus by generating Pegasus DAX workflows as well as site, replica and transformation catalogs. It submits these to Pegasus for execution. It then monitors execution and reports events to the user interface via the AMQP/Node.js event pipeline.

On the client, Grayson uses JQuery and HTML5 features through RaphaelJS and Socket.IO to provide a rich and highly interactive user experience.

Figure 1 : Grayson Architecture


Users define workflow components using yEd. Each component is drawn as a graph and annotated with JSON (JavaScript Object Notation). Notations tell Grayson the type of object each graph node represents. External models can be imported and their elements referenced from other graphs.

Figure 2 : Grayson Editor – yEd

Models are packaged using the Grayson compiler. The compiler detects the locations of all referenced models and archives everything needed by the execution environment into a compressed archive.  The user uploads the archive to the execution environment which compiles the model into artifacts appropriate for Pegasus and submits the workflow. From there, execution is controlled by Condor and DAGMAN as in a normal Pegasus workflow execution.

Events resulting from the execution are detected by the Celery asynchronous task queue and transmitted to subscribed clients first as AMQP messages then, by the Node.js Socket.IO pipeline to the WebSocket client. Event notifications are mapped to graphical elements to create a dynamic, visual execution monitoring and debugging experience. Job execution as well as file transfer status are all visually represented.

Figure 3 : Grayson Execution Environment – Lambda Uber

Much of the power and significance of Grayson stems from its use of functional and aspect oriented programming to allow researchers to create and effectively execute complex hierarchical workflow. In the case of the example below, the lambda-flow sub-workflow has been targeted by a chain of map operators. The map operators recursively create instances of their target workflow according the operator’s annotations. As a result, in the executing instance below, there are 21 total instances of lambda-flow.

The expression of this multiplicity in the model is succinct, manageable and informative. Similarly, the visual representation in the execution environment condenses the multiplicity into a manageable display. Users can see the grid of workflow instances by hovering over a sub-worfklow and the selected workflow is highlighted. As a result, display performance and visual complexity is not on the order of the actual size of the workflow but on the order of it’s generalized structure which, in general, will be orders of magnitude smaller.

Figure 4 : Grayson Execution Environment – Lambda Uber – Hierarchical

Once a sub-workflow has been selected the user may select the workflow node to navigate to the tab containing the graph of that sub-workflow. All events pertaining to the selected instance will be rendered onto the graph so that the view is always contextually driven by the user’s selection.

Selecting a job in the workflow renders a dialog containing detailed information about the job’s execution including:

      • General information like the job’s execution host, duration and exit status
      • Standard error and output
      • The Pegasus workflow log
      • The Condor DAGMAN log the job was a part of

A detail view provides extensive information about the workflow’s execution.

Figure 5 : Grayson Execution Environment – Lambda Uber – Hierarchical Drill-down

The detail view renders a tabbed view of a variety of additional statistics of interest. One tab contains a tabular view of execution events such as job submission, execution and exit status. The transfer view renders a tabular view of each transfer event with assorted statistics. The monitor view shows the output of pegasus-status for the workflow. Finally, the files tab provides a hierarchical file browser targeted at the root of the workflow execution directory hierarchy.

Next Steps

Grayson is in alpha. Many things will likely change soon. To date, it has been used to compose a few moderately complex sample applications and one workflow for an Engage researcher. That workflow addresses interesting challenges of High Throughput Parallel Computing (HTPC) and hierarchical workflow in general which we’ll discuss more later.

Some major next steps involve migrating Grayson from Pegasus 3.0.2 to 3.1.0. The latest version adds improvements to Pegasus STAMPEDE which promise to greatly improve what is currently the weakest link in Grayson – the ability to rigorously and programmatically scan a workflow for events. STAMPEDE presents a Python (among others) interface to a relational database schema of events for a workflow. Migrating to this will eliminate ad hoc methods improvised to work against 3.0.2.

Significant progress has been made on the Continuous Integration environment for Grayson. A battery of automated tests ensure that the compiler is generating expected outputs for a given set of inputs. It also uses a headless WebKit browser to execute a variety of automated user interface tests. In addition to testing, documentation, packaging and code coverage are all automated. Still, this area is key to quality and development scalability and will be a place we invest on an ongoing basis.

Another need is for a product management system. This will track features, existing and future, serving as a guide for development, quality assurance, documentation and support. Coming soon.

There’s much more to say about Grayson but this will serve as an introduction.

Posted in Uncategorized | Leave a comment

CAMx on OSG with UNC Environmental Science and Engineering

Dr. William Vizuete’s team at the UNC department of Environmental Science and Engineering  evaluates and uses a number of air quality models which are, in turn used by regulatory agencies to understand air quality issues. From Dr. Vizuete’s research page:

“Using high performance computers and three dimensional models to simulate the atmosphere, I am working to improve our understanding of the formation of atmospheric air pollution. These computer models improve our understanding of the extremely complex chemical and physical processes that occur in the atmosphere. A better understanding of the atmosphere gives us the knowledge to improve the tools and methods that policy makers use to make effective control strategies to clean the air above our dirtiest cities.”

Two models, CAMx by ENVIRON and CMAQ are heavily used.

We chose to start with CAMx because word on the street is that it’s easier to build than CMAQ. Well, it’s all relative.

Here’s the new automated build process for CAMx. It uses the Intel fortran compiler version 11.1 and builds an x86_64 binary that’s statically linked.

Also built in that process is a program called avgdif from the ENVIRON website. It is a standalone fortran program that compares the average difference between two runs. This allows us to validate execution of the CAMx test cases. It is slightly modified in order to have enough memory for this model (maxx and maxy are set to 120 near the top of the file). Finally, there’s a custom makefile for it to work with the Intel compiler.

CAMx has for several versions been OpenMP enabled. This is well suited for the OSG’s HTPC node concurrency model since the job can be configured to take advantage of all the CPUs on a node but does not require the additional complexity of a statically linked MPI launcher.

The new job is being tested now.  A GlideinWMS submit script has been set up to launch the job to various HTPC compute venues. The job itself will:

  • Fetch binaries from the RENCI continuous integration environment
  • Fetch test cases from ENVIRON’s website
  • Execute the test
  • Execute avgdif on the outputs
  • Return the results of avgdif

Once this is done, the environmental engineering community will be able to give it a try once they’ve joined the Engage VO.

Posted in Uncategorized | Tagged | Leave a comment

Pegasus WMS and NERSC Dirac

The Engage VO’s recently begun working with Duke Chemistry on a protein folding simulation  that’s taking us to interesting places.

In particular, deep inside the National Energy Research Scientific Center – NERSC and specifically, to their 50 node General Purpose Graphics Processor (GPGPU) cluster, Dirac. You know you’re having a good time when the cluster you’re using is at a cutting edge DOE research lab and they consider it experimental.

As noted in previous posts, performance for many molecular dynamics applications is greatly improved on GPUs. We’re using NAMD which provides pre-built binaries for GPUs. The binaries ship with dynamic libraries for CUDA, its only external dependency. As such, it’s practical for use on the Open Science Grid since it can be relocated to new clusters trivially.

NAMD, like Amber and many scientific codes, especially ones that simulate natural phenomena, supports restart. That is, output from one run can be used as input to another to continue the calculation. This is important since ultimately we’re modeling a biological system that doesn’t have a particular endpoint – it just keeps going. It also has implications for the shape of our workflow. In this case we have one computation activity – execute NAMD – which we’d like to repeat, feeding the input from one run into the next.

We used Pegasus WMS as the wrokflow execution engine. There are lots of reasons for this but here are some of the basics:

  • It’s state of the art for creating a useful layer of workflow abstraction over the OSG’s runtime infrastructure.
  • The support list is incredibly helpful.
  • It allows me to use my personal OSG/DOE X.509 user credential which is required by NERSC policy.

Each compute job stages in

  • The GPU accelerated version of NAMD
  • The input data files
  • A statically linked version of MPICH2
  • The ouptut file from the previous run if one exists

Output is written from this job, archived and transferred back to the submit host.

One of the interesting aspects of this is how to generate workflows with these characteristics. Each compute job is somewhat logically self contained. We’d like to be able to repeat them in chains and otherwise organize them hierarchically – for example – to do parallel simulations, each of which consists of a chain of compute jobs. More on this in a later post.

Another challenge, from a workflow maintenance and configuration perspective is how to get very hardware specific in our job specifications while keeping the workflow abstract. Here’s the job specification for running NAMD on Dirac’s Tesla GPUs.

<!– part 3: Definition of all jobs/dags/daxes (at least one) –>
<job id=”1n3″ namespace=”namd-flow.0″ name=”namd”>
<argument>–model=ternarycomplex119819 –config=ternarycomplex_popcwimineq-05 –slice=0 –namdType=CUDA –runLength=100</argument>
<profile namespace=”globus” key=”queue”>dirac_reg</profile>
<profile namespace=”globus” key=”xcount”>8:tesla</profile>
<profile namespace=”globus” key=”maxWallTime”>240</profile>
<profile namespace=”globus” key=”jobType”>mpi</profile>
<profile namespace=”globus” key=”host_xcount”>1</profile>
<profile namespace=”condor” key=”x509userproxy”>/home/scox/dev/grayson/var/proxy/x509_proxy_scox</profile>
<uses name=”mpich2-static-1.1.1p1.tar.gz” link=”input”/>
<uses name=”cpuinfo” link=”input”/>
<uses name=”beratan-0.tar.gz” link=”input”/>
<uses name=”namd.tar.gz” link=”input”/>
<uses name=”out-0.tar.gz” link=”output”/>

Again, more on how we’re managing this in a later post.

Posted in Uncategorized | Leave a comment