NAMD with PBS and Infiniband on NERSC Dirac

Overview

NAMD simulates molecular motion, especially of large molecules so it’s often used to simulate molecular docking problems. One particularly interesting class of docking problem is the interaction of protein molecules with other molecules such as the cell membrane. The enormous number of atoms involved in these simulations confine the kinds of information we’re able to learn about how proteins interact with and shape their environments because more atoms require more computing power. So we’re investigating using GPU accelerated nodes in a shared memory cluster to speed up simulation time.

This describes running NAMD in a multi-node configuration at NERSC Dirac to determine if we want to build out a Pegasus workflow executing in this mode through the OSG compute element. The process is, as usual with MPI codes using cluster interconnects, highly cluster specific. The next step is to determine if it’s worth it and what our alternatives are.

Approach

If you’re having a hard time running NAMD in a PBS environment over an Infiniband interconnect, you are not alone. The NAMD release notes come right to the point:

“Writing batch job scripts to run charmrun in a queueing system can be challenging.”

These links, in addition to the release notes cited above provide useful insights:

And without further delay, here’s the approach that worked on Dirac. Mileage on your cluster may vary.
#!/bin/bash

set -x
set -e

# build a node list file based on the PBS
# environment in a form suitable for NAMD/charmrun

nodefile=$TMPDIR/$PBS_JOBID.nodelist
echo group main > $nodefile
nodes=$( cat $PBS_NODEFILE )
for node in $nodes; do
   echo host $node >> $nodefile
done

# find the cluster's mpiexec
MPIEXEC=$(which mpiexec)

# Tell charmrun to use all the available nodes, the nodelist built  above and the cluster's MPI.
CHARMARGS="+p32 ++nodelist $nodefile"

As an additional wrinkle, we want to run the GPU accelerated version. That’s why we use the +idlepoll argument to NAMD.

After setting NAMD_HOME, the command to execute NAMD is:

${NAMD_HOME}/charmrun \

${CHARMARGS} ++mpiexec ++remote-shell \

${MPIEXEC} ${NAMD_HOME}/namd2 +idlepoll <input_file>

The beginning of NAMD’s output looks like this:

Info: 1 NAMD 2.8 Linux-x86_64-ibverbs-CUDA 16 dirac48 stevecox
Info: Running on 16 processors, 16 nodes, 2 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.025701 s
Pe 5 sharing CUDA device 0 first 0 next 6
Pe 5 physical rank 5 binding to CUDA device 0 on dirac48: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3
Pe 10 sharing CUDA device 0 first 8 next 11
Pe 10 physical rank 2 binding to CUDA device 0 on dirac47: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3
Pe 8 sharing CUDA device 0 first 8 next 9
Pe 8 physical rank 0 binding to CUDA device 0 on dirac47: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3
Pe 2 sharing CUDA device 0 first 0 next 3
Did not find +devices i,j,k,… argument, using all
Pe 2 physical rank 2 binding to CUDA device 0 on dirac48: ‘Tesla C1060’ Mem: 4095MB Rev: 1.3

Of particular importance, note that there is a pre-built executable specific to ibverbs-CUDA – that is, it works with infiniband connected clusters with CUDA accelerated nodes.

These are the parameters of the dirac_reg queue:

[stevecox@cvrsvc01 namd]$ qstat -Qf dirac_reg
Queue: dirac_reg
 queue_type = Execution
 Priority = 10
 max_user_queuable = 500
 total_jobs = 39
 state_count = Transit:0 Queued:4 Held:27 Waiting:0 Running:8 Exiting:0
 acl_user_enable = False
 resources_max.nodect = 12
 resources_max.walltime = 06:00:00
 resources_min.nodect = 1
 resources_default.walltime = 00:05:00
 mtime = 1323823829
 resources_assigned.nodect = 34
 max_user_run = 2
 enabled = True
 started = True

So to test jobs, I ran qsub like this:

qsub -I -q dirac_reg -l walltime=06:00:00 -l nodes=4:ppn=8

The -I parameter tells qsub to start an interactive job. The walltime parameter overrides the very low default walltime. Fnially, nodes tells PBS how many cluster nodes to use and ppn specifies the processes per node to start.

After debugging, I ran the script like this:

qsub -q dirac_reg -l walltime=06:00:00 -l nodes=4:ppn=8 ./callnamd

Results

I did three runs with 2, 4, and 8 nodes. The interesting performance number for a NAMD run is days/ns or days of computation time required per nanosecond of simulation.

[stevecox@cvrsvc01 ~]$ grep -i days dev/dukechem/osg/namd/run.* | sed -e “s,.txt,,” -e “s,.*run.,,”
2way:Info: Initial time: 16 CPUs 0.0617085 s/step 0.357109 days/ns 91.0601 MB memory
2way:Info: Initial time: 16 CPUs 0.0613538 s/step 0.355057 days/ns 94.0546 MB memory
2way:Info: Initial time: 16 CPUs 0.0619225 s/step 0.358348 days/ns 94.7324 MB memory
2way:Info: Benchmark time: 16 CPUs 0.0620334 s/step 0.35899 days/ns 94.8284 MB memory
2way:Info: Benchmark time: 16 CPUs 0.0621472 s/step 0.359648 days/ns 95.09 MB memory
2way:Info: Benchmark time: 16 CPUs 0.0620733 s/step 0.359221 days/ns 95.162 MB memory
4way:Info: Initial time: 32 CPUs 0.0472537 s/step 0.273459 days/ns 83.8981 MB memory
4way:Info: Initial time: 32 CPUs 0.0470766 s/step 0.272434 days/ns 84.8605 MB memory
8way:Info: Initial time: 64 CPUs 0.0406125 s/step 0.235026 days/ns 81.0847 MB memory
8way:Info: Initial time: 64 CPUs 0.0406405 s/step 0.235188 days/ns 82.1035 MB memory
8way:Info: Initial time: 64 CPUs 0.0407004 s/step 0.235534 days/ns 82.2474 MB memory
8way:Info: Benchmark time: 64 CPUs 0.0407453 s/step 0.235794 days/ns 82.3482 MB memory
8way:Info: Benchmark time: 64 CPUs 0.040858 s/step 0.236447 days/ns 82.3975 MB memory
8way:Info: Benchmark time: 64 CPUs 0.0406536 s/step 0.235264 days/ns 82.4038 MB memory

Here are some details of NERSC Dirac’s configuration:

Dirac is a 50 GPU node cluster connected with QDR IB.  Each GPU node also contains 2 Intel 5530 2.4 GHz, 8MB cache, 5.86GT/sec QPI Quad core Nehalem processors (8 cores per node) and 24GB DDR3-1066 Reg ECC memory.

  •  44 nodes:  1 NVIDIA Tesla C2050 (code named Fermi) GPU with 3GB of memory and 448 parallel CUDA processor cores.
  • 4 nodes:  1 C1060 NVIDIA Tesla GPU with 4GB of memory and 240 parallel CUDA processor cores.
  • 1 node:  4 NVIDIA Tesla C2050 (Fermi) GPU’s, each with 3GB of memory and 448 parallel CUDA processor cores.
  • 1 node:  4 C1060 Nvidia Tesla GPU’s, each with 4GB of memory and 240 parallel CUDA processor cores.

Here are results from earlier runs on a cluster with far fewer GPUs but a configuration in which accelerated nodes contain four Nvidia Teslas (like one of the Dirac nodes):

  • 4CPU: 0.998798 days/ns
  • 8CPU: 0.565848 days/ns
  • And with the production sample at 8CPU:  0.288802
Conclusions
While these findings are preliminary, indications are that having four GPUs on a single node makes a substantial performance difference.
This entry was posted in Engage VO, GPGPU, High Throughput Computing (HTC), High Throughput Parallel Computing (HTPC), multicore, NAMD, OSG, Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s