High Throughput Parallel Molecular Dynamics on OSG

The Goal

RENCI’s working with researchers interested in running high throughput parallel molecular dynamics simulations on OSG.

Amber9 PMEMD

The program we’d like to execute is called PMEMD (Particle Mesh Ewald Molecular Dynamics). PMEMD is a high-performance, parallel component of the larger Amber9 molecular dynamics package. It is designed to be a high performance version of Amber9’s sander relying heavily on the message passing interface (MPI) architecture to achieve this.

HTPC

High Throughput Parallel Computing (HTPC) is an emerging computing model on the Open Science Grid driven by the increasing prevalence of multi-core processors. While the OSG has traditionally served large numbers of serial jobs, it is increasingly feasible to execute small MPI jobs on the OSG. In general, an HTPC job will

  • Reserve the whole machine, that is, an entire worker node in OSG terminology.
  • Launch an MPI job using
    • A statically compiled version of mpiexec
    • Shared memory rather than the cluster-native interconnect

The HTPC group has identified a set of clusters and procedures to enable this model. The work below takes advantage of those findings.

This is in many ways a continuation of earlier work done in the area of MPI on OSG. Some new aspects of this approach include

  • The use of the RENCI CyberInfrastructure (RCI) scripting framework that generalizes aspects of the OSGMM advanced job example. This approach reduces the amount of new code a researcher must develop to run a grid job and increases reuse of job management infrastructure.
  • Modular, statically compiled binary versions of MPI launch programs (like mpiexec and mpirun) from multiple MPI implementations are made available via a web based artifact repository. This is one of the recommended future directions cited in the paper.
  • Integration of effort with the High Througput Parallel Computing (HTPC) community.

Approach Overview

At a high level, the job is submitted to OSG from a submit node. In this context this occcurs via OSGMM and condor on the Engage submit node at RENCI. It is targeted to subclusters with multi-core architectures and processor models like machines where the program is known to run successfully.

When the program runs at a compute node, it reserves an entire worker node, fetches the RCI script library, statically compiled versions of PMEMD and both of the targeted MPI library execution environments (MPICH and MPICH2). It then invokes PMEMD via an MPI exec command running as many parallel processes as there are processors on the machine.

When execution completes, the script GridFTP’s outputs back to the submit host. The following figure illustrates these interactions:

 

Architecture Context and Approach

This effort is influenced by a set of primary factors that are functions of the architecture of the program (PMEMD) and the architecture of the execution platform (OSG).

Broadly speaking, these are

  • Heterogeneity of the Open Science Grid.
  • Complexities of delivering Message Passing Interface (MPI) based systems.
  • Sustainability of the resulting approach.

Heterogeneity

The OSG is a highly distributed and heterogeneous network of computing resources. Machines use many different CPU vendors and architectures, operating systems and batch schedulers. Historically, MPI programs have been compiled natively on a specific architecture to achieve optimal performance at the cost of a very high degree of coupling to a specific hardware architecture and often to a specific cluster.

Methods for establishing the set of resources advertised in the OSG that are compatible with a given MPI job are under development. One common approach is to specifically target grids in the HTPC collaboration.

Message Passing Interface (MPI)

MPI is a specification of an application programmer interface (API). The API abstracts the concepts involved in allowing a program executing in parallel to share information. This is achieved in a variety of different ways. Some implementations use shared memory, others use connection technologies like Infiniband and Myrinet. Ultimately, the program’s source code is compiled against a specific MPI implementation. For our purposes that MPI implementation

  • Must be one with which PMEMD is compatible.
  • Should rely as little as possible on highly specialized hardware like interconnects.
  • Is unlikely to be available on the OSG compute resource the job lands on.
  • Must be statically linked so that executables carry all their dependencies.
  • Generally requires a separate application launching program (mpiexec)

The application was built against a number of MPI libraries which resulted in two approaches to this aspect of the problem. In the first PMEMD is compiled against MVAPICH and MVAPICH2 MPI libraries on the RENCI Blueridge cluster. In the second, PMEMD is compiled against statically compiled versions of the MPICH and MPICH2 MPI libraries for execution on OSG.

Blueridge Native

For the Blueridge scenario the executable was built for optimal performance against a 2000+ core cluster with high speed interconnects. As described above, the resulting executable runs only on that cluster. Program execution for the set of data used generally completes in under three minutes.

Currently, our researcher is able to execute the suite on Blueridge via this mechanism. Unfortunately, the job submission occurs directly through Globus at the moment so that executions are not logged in the OSG stats. This will be remedied once the Blueridge CE comes on line fully.

OSG Execution

For the OSG scenario two MPI libraries ( MPICH-1.2.7p1 and MPICH2-1.1.1p1 ) and PMEMD were compiled and linked statically using the Intel Fortran compiler. All three components were also added to a continuous integration environment. Each is automatically built whenever a code check-in occurs, the source is searchable online and all binary artifacts are published to the RENCI software artifact respository. When the job executes at an OSG compute resource, the statically linked binary artifacts are downloaded and invoked.

The end result is that the statically linked PMEMD executable and associated MPI libraries are assembled and executed on a whole machine via the OSG interface. The following figure shows the Blueridge cluster executing PMEMD on a single node:

The highly utilized node in the upper left is running PMEMD.

Targeting Blueridge:

Targeting the RENCI Blueridge cluster consists of two modifications to the Condor submit script. In the first, Globus RSL specific to the Torque work load manager is included:

(jobtype=single)(xcount=8)(host_xcount=1)(maxWallTime=5600)

Then the RENCI Blueridge cluster is targeted via the Condor job requirements:

... && ((TARGET.GlueSubClusterUniqueID == "RENCI-Blueridge-RENCI-Blueridge")) 

A number of launching strategies were tried with mixed results. In the first, the MPICH version is configured to set two environment variables and run pmemd directly, without using mpiexec. It turns out that for this version of MPICH, if these environment variables are set:

export MPIRUN_DEVICE=ch_shmem
export MPICH_NP=${actualprocs}

It’s not necessary to use mpiexec.

MPICH2 supports a number of launch models including the message passing daemon (MPD) as well as mpiexec variants known as Hydra and Gforker. MPD requires starting a long running MPD process and using a few other tools to set up the environment for the job. This is not an optimal scenario for the OSG environment due to its additional complexity so these options were abandoned.

The current job configuration uses the Gforker version of MPICH2’s mpiexec program. The following figure shows the text output of an execution at a worker node with each phase of execution described:

Execution on Blueridge using MPICH2 takes around seven minutes.

Targeting Prairiefire:

The Prairiefire cluster at Nebraska (UNL) uses Condor. Its configuration requires different RSL. Here’s what we came up with working with the system administrators at UNL:

 (condorsubmit=('+RequiresWholeMachine' TRUE)('Requirements' 'isWholeMachineSlot=?=TRUE && TotalSlots == 9'))

With this configuration the job runs on 8 cores in about 15 minutes.

Community and Sustainability

There’s effort under way to ensure artifacts from this effort continue to be available to multiple collaborating communities:

RENCI staff developing the solution need to be able to quickly get on the same page. Answers should be readily forthcoming to questions like: What’s the design, where’s the source, how are the experiments conducted, etc. Information transparency makes it much easier to access expert knowledge, for example in the area of MPI.

The Science Community interested in running this program and other MPI systems should have ready access to reusable artifacts without human intervention. Reuse scenarios may be macro reuse such as appropriation of the entire approach targeted at new data or micro reuse of individual artifacts such as the statically linked MPI program launchers (mpiexec).

The OSG HTPC Team has experience with developing solutions like this so it’s important that the approach and artifacts be shared with that group.

Over the longer term, artifacts should continue to be built continuously and publicly accessible. We want systems in place to minimize the adverse impact of time and transitions of ownership.

To achieve these sustainability goals, the system’s objectives and design will continue to be documented here. Its components are built via a continuous integration (CI) process and the artifacts deployed to the RENCI software artifact repository.

Status

Amber 9 PMEMD can now be targeted to execute on OSG clusters in an HTPC configuration.

There is a working Globus RSL configuration for the RENCI Blueridge cluster. This cluster has not been configured in any particular way to support HTPC. It runs the OSG VDT 1.2.13 stack and the PBS/Torque Work Load Manager.

It has also been successfully targeted to Nebraska’s Prairiefire system. Prairiefire is a Condor system and requires custom RSL discussed above.

Do you have MPI based code that could take advantage of 8-way parallelism in an HTPC context?

This entry was posted in Amber9, Compute Grids, condor, Continuous Integration (CI), Engage VO, grid, High Throughput Computing (HTC), High Throughput Parallel Computing (HTPC), multicore, OSG, pmemd, RENCI. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s