Whole Genome Sequencing and Pegasus

There’s a group at RENCI working on Next generation Genome Sequencing technologies (NGS) and Whole Genome Sequencing (WGS) in particular. I’ve been helping them to get a new workflow executing on our Blueridge cluster.

As a first step, they joined the OSG Engage VO. This provides them access to the new Engage Submit node, its GlideinWMS submission engine and to the Pegasus Workflow Management System it hosts.

The whole genome sequencing workflow is a multi-step process involving about a half dozen different executables. It runs on Blueridge and Kure – local clusters -but there’s strong interest in making the workfow portable for execution on other systems including the Open Science Grid.

The new users were provisioned on the submit node and DOE certificates obtained according to the usual Engage process.

Then I worked with the team to create a Pegasus DAX representing the workflow. Our main challenges involved

  • Debugging: Getting used to searching the Pegasus logs for output status. Pegasus creates a directory for each workflow submission. Each job in the workflow will produce a series of separate files, one of which is an XML file containing details of the execution including fully qualified paths to executables and files. It also contains standard output and error and exit status codes of each job. These are indispensable for debugging.
  • Files: Some of the files used in this workflow were pre-staged on the local cluster to allow us to get the workflow up and running. The DAX file elements needed to specify site=’local’ attributes to tell Pegasus not to try to stage the files.
  • Environment: We initially had the user executing the workflow mapped to the generic engage user. Because we were adapting the workflow to use a legacy data setup, this ran into trouble with Unix file permissions. I reconfigured the user’s DN to map to their cluster local user so we could proceed.
  • Pegasus and Scripts: One of the most difficult items to figure out was caused by something very simple. One of the scripts in the workflow had appropriate ownership and execute privileges configured but failed each execution. It turns out that it did not have the #!/bin/bash at the top of the script. So look out for that if you’ve exhausted other debugging avenues.

Next steps are to prepare the workflow – or components of it – for execution on OSG. This will involve

  • Collecting the required input files into archives that are entirely portable – i.e. free of symbolic links and user privilege issues.
  • An assessment of the size of the data to determine the best way to provision it to compute nodes.
  • Assessment of the executables to see if any would benefit from a high throughput parallel (HTPC) treament.
  • Selection of OSG resources appropriate for the task.
  • Altering the Pegasus site configuration and process generally to work with GlideinWMS and target OSG.

More on this soon.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s