Grid Jobs with OSG Matchmaker

Introduction

Thanks to Mats for the very helpful advanced grid job example provided in the OSG Matchmaker documentation. This post is happening to help me understand what I’ve seen while working with it. It also discusses tools that may help to (a) reuse the architecture of the approach (b) simplify monitoring of running jobs for researchers who don’t want to know about the structure of the example.

It Isn’t Easy Being a Grid Job

A central challenge of running complex grid applications is managing the schedule, failures and diverse outcomes affecting individual compute jobs.

A lot of bad things can happen to a job on its way to a compute node. They can get routed to sites that fail. They can get routed to sites that put them in a queue where they starve to death.

Once at the compute node the job may find itself on a poor performing or badly configured site. Required services advertised as available may not function correctly. The site may be fine but the job may be preempted by a higher priority job.

We’d like to be able to define policies ahead of time to avoid adverse outcomes and respond to the unavoidable ones in a way that makes the process as fault tolerant as possible. Specifically, we want to

Specify the order of execution of parts of the job.
Clearly specify the job’s platform requirements.
Define policies for common failure conditions.
Specify standard activities to run before and after each job.
Centrally manage and monitor execution.
Schedule jobs for sites with a good track record.

Condor, Globus, DAGs and OSG Matchmaker … Together

The advanced job example illustrates approaches to addressing these concerns. Here is an overview of the main elements of the toolkit at work in the solution:

Execution Order: Handled with DAGs. Condor Directed Acyclic Graphs (DAG) are abstract representations of a set of jobs. DAGs
- control order of execution
- support dependency modeling and retry semantics
- provide a high level monitoring overview of grid application execution.
Job Requirements: Handled with Globus RSL and Condor ClassAds to specify substantial detail about the required execution platform including memory, disk and CPU specifications.
Failure Conditions: Multiple approaches here:
- Retry semantics expressed in Condor submit file.
- Retry semantics expressed in DAG.
- DAG rescue-DAG mechanism creates a DAG for failed jobs.
- Explicit management of exit status avoids having a job report success when its outputs are actually invalid.
Pre and Post Jobs: The requirements of a grid job may indicate a set of activities that must happen before and after each component job. The example attaches both pre and post scripts to each node in the DAG. These execute at the submit host.
Centralized Management: All status information of the executing job is available at the submit node. The DAG log provides a high level overview. The output of each individual grid job can also be examined for granular information on what’s happening at a particular compute node.
Match Optimization: OSG Matchmaker is a Condor service which acts as a meta-scheduler. It makes information about the reliability and track record of grid resources available to Condor via the ClassAd mechanism which Condor uses to do the actual matching of a job to a site.

Here’s a diagram of the moving parts described above:

A Condor submit file (the master DAG) is created in step 1. The DAG consists of a list of DAG-nodes. Each node details execution of the prescript at the submit node, the job at the remote host and the post script at the submit node. There is one DAG-node per data input file.

Associated with each DAG-node is a single condor submit file. This is where machine architecture considerations are expressed via RSL, job scheduling priorities and failure behaviors are specified via OSGMM and Condor ClassAds.

The grid job is then submitted to Condor in step 2. Condor uses OSGMM information to do matchmaking for jobs. Once jobs are at a remote grid site, they execute via a wrapper script. The wrapper script first downloads executables and the input data for that portion of the job onto the compute node. This is step 4.

Then the wrapper executes the job itself, collecting its output (5). The exit status of the job is carefully handled to ensure it accurately reflects the real execution status. An explicit marker is written to the job’s output to allow the post-script at the submit node to take the appropriate action. The job results are then transferred back to the submit node (6).

All of the job’s activities can be monitored from the submit node.

The interaction described here is largely descriptive of the Advanced Job example provided in the OSG Matchmaker documentation.

Generalizing the Advanced Job

The advanced job example does a number of key things that lots of jobs will want to do:

creating temporary storage at the compute node
fetch executables
fetch data to operate on
run the executable against that data capturing output
carefully record exit status
transmit results back to the submit node
evaluate status and retry as appropriate

It would be good to capture what is general about this process in a way that simplifies job submission while making the quality elements of the approach available to less expert users. This section describes some tools for doing that. They are part of the RENCI CI project.

Initialze

Source the submit.sh script to initialize the job execution environment.

scox@engage-submit:~/pmemd$ . app/bin/submit.sh
--(dbg): pointing to latest run: /home/scox/pmemd/runs/20100818_1629
--(dbg): top_dir  : /home/scox/pmemd
--(dbg): base_url : gsiftp://engage-submit.renci.org/home/scox/pmemd
--(dbg): run_dir  : /home/scox/pmemd/runs/20100818_1629

Submit

Use the job_submit command to

Verify that a valid grid proxy is in place so jobs don’t fail authentication
Establish a grid proxy if one is not present
Create one condor submit file per job
Create one master DAGman ordering
- Ordering the jobs
- Wrapping each in a pre and post script
- Producing a log of overall job execution
Create a new subdirectory of runs containing all of this job’s outputs

scox@engage-submit:~/pmemd$ job_submit
--(inf): verified valid grid proxy...
--(inf): set up run environment...
--(inf): run directory: /home/scox/pmemd/runs/20100818_1630
--(inf): generating job submit files...
--(inf):    generate job 1 processing /home/scox/pmemd/app/in/1.txt
--(inf):    generate job 2 processing /home/scox/pmemd/app/in/1.txt~
--(inf):    generate job 3 processing /home/scox/pmemd/app/in/2.txt
--(inf):    generate job 4 processing /home/scox/pmemd/app/in/2.txt~
--(inf):    generate job 5 processing /home/scox/pmemd/app/in/3.txt
--(inf):    generate job 6 processing /home/scox/pmemd/app/in/3.txt~
--(inf):    generate job 7 processing /home/scox/pmemd/app/in/4.txt
--(inf):    generate job 8 processing /home/scox/pmemd/app/in/4.txt~
--(inf): submitting master DAG:
JOB job_1 /home/scox/pmemd/runs/20100818_1630/1.submit.txt
SCRIPT PRE job_1 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 1
SCRIPT POST job_1 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 1
RETRY job_1 7
JOB job_2 /home/scox/pmemd/runs/20100818_1630/2.submit.txt
SCRIPT PRE job_2 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 2
SCRIPT POST job_2 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 2
RETRY job_2 7
JOB job_3 /home/scox/pmemd/runs/20100818_1630/3.submit.txt
SCRIPT PRE job_3 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 3
SCRIPT POST job_3 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 3
RETRY job_3 7
JOB job_4 /home/scox/pmemd/runs/20100818_1630/4.submit.txt
SCRIPT PRE job_4 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 4
SCRIPT POST job_4 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 4
RETRY job_4 7
JOB job_5 /home/scox/pmemd/runs/20100818_1630/5.submit.txt
SCRIPT PRE job_5 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 5
SCRIPT POST job_5 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 5
RETRY job_5 7
JOB job_6 /home/scox/pmemd/runs/20100818_1630/6.submit.txt
SCRIPT PRE job_6 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 6
SCRIPT POST job_6 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 6
RETRY job_6 7
JOB job_7 /home/scox/pmemd/runs/20100818_1630/7.submit.txt
SCRIPT PRE job_7 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 7
SCRIPT POST job_7 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 7
RETRY job_7 7
JOB job_8 /home/scox/pmemd/runs/20100818_1630/8.submit.txt
SCRIPT PRE job_8 /home/scox/pmemd/app/bin/pre-script -f job_initialize -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 8
SCRIPT POST job_8 /home/scox/pmemd/app/bin/post-script -f job_shutdown -d /home/scox/pmemd/runs/20100818_1630 -r 20100818_1630 -j 8
RETRY job_8 7

Checking all your submit files for log file names.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : /home/scox/pmemd/runs/20100818_1630/master.dag.condor.sub
Log of DAGMan debugging messages                 : /home/scox/pmemd/runs/20100818_1630/master.dag.dagman.out
Log of Condor library output                     : /home/scox/pmemd/runs/20100818_1630/master.dag.lib.out
Log of Condor library error messages             : /home/scox/pmemd/runs/20100818_1630/master.dag.lib.err
Log of the life of condor_dagman itself          : /home/scox/pmemd/runs/20100818_1630/master.dag.dagman.log

Condor Log file for all jobs of this DAG         : /home/scox/pmemd/alljobs.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2858593.
-----------------------------------------------------------------------
scox@engage-submit:~/pmemd$

Monitor – 1 – The DAG

This shows the DAG the script created and submitted. The job_dag command can then be used to tail the log of the execution of the DAG. Excerpts of the beginning of the execution log follow. This shows the successful execution of pre-scripts for several jobs and the number of jobs in various states.

scox@engage-submit:~/pmemd$ job_dag
8/18 16:30:16 DAG Lockfile will be written to /home/scox/pmemd/runs/20100818_1630/master.dag.lock
8/18 16:30:16 DAG Input file is /home/scox/pmemd/runs/20100818_1630/master.dag
8/18 16:30:16 Rescue DAG will be written to /home/scox/pmemd/runs/20100818_1630/master.dag.rescue
8/18 16:30:16 All DAG node user log files:
8/18 16:30:16   /home/scox/pmemd/alljobs.log (Condor)
8/18 16:30:16 Parsing /home/scox/pmemd/runs/20100818_1630/master.dag ...
8/18 16:30:16 Dag contains 8 total jobs
8/18 16:30:16 Truncating any older versions of log files...
8/18 16:30:16 MultiLogFiles: truncating older version of /home/scox/pmemd/alljobs.log
8/18 16:30:16 Sleeping for 12 seconds to ensure ProcessId uniqueness
8/18 16:30:28 Bootstrapping...
8/18 16:30:28 Number of pre-completed nodes: 0
8/18 16:30:28 Running PRE script of Node job_1...
8/18 16:30:28 Running PRE script of Node job_2...
8/18 16:30:28 Running PRE script of Node job_3...
8/18 16:30:28 Running PRE script of Node job_4...
8/18 16:30:28 Running PRE script of Node job_5...
8/18 16:30:28 Running PRE script of Node job_6...
8/18 16:30:28 Running PRE script of Node job_7...
8/18 16:30:28 Running PRE script of Node job_8...
8/18 16:30:28 Registering condor_event_timer...
8/18 16:30:28 PRE Script of Node job_1 completed successfully.
8/18 16:30:28 PRE Script of Node job_7 completed successfully.
8/18 16:30:28 PRE Script of Node job_6 completed successfully.
8/18 16:30:28 PRE Script of Node job_3 completed successfully.
8/18 16:30:28 PRE Script of Node job_2 completed successfully.
8/18 16:30:28 PRE Script of Node job_5 completed successfully.
8/18 16:30:28 PRE Script of Node job_4 completed successfully.
8/18 16:30:28 PRE Script of Node job_8 completed successfully.
8/18 16:30:29 Submitting Condor Node job_1 job(s)...
8/18 16:30:29 submitting: condor_submit -a dag_node_name' '=' 'job_1 -a +DAGManJobId' '=' '2858593 -a DAGManJobId' '=' '2858593 -a submit_event_notes' '=' 'DAG' 'Node:' 'job_1 -a +DAGParentNodeNames' '=' '"" /home/scox/pmemd/runs/20100818_1630/1.submit.txt
8/18 16:30:29 From submit: Submitting job(s).
8/18 16:30:29 From submit: Logging submit event(s).
[snip]...
8/18 16:30:29 Event: ULOG_SUBMIT for Condor Node job_1 (2858594.0)
8/18 16:30:29 Number of idle job procs: 1
8/18 16:30:29 Event: ULOG_SUBMIT for Condor Node job_7 (2858595.0)
8/18 16:30:29 Number of idle job procs: 2
8/18 16:30:29 Event: ULOG_SUBMIT for Condor Node job_6 (2858596.0)
8/18 16:30:29 Number of idle job procs: 3
8/18 16:30:29 Event: ULOG_SUBMIT for Condor Node job_3 (2858597.0)
8/18 16:30:29 Number of idle job procs: 4
8/18 16:30:29 Event: ULOG_SUBMIT for Condor Node job_2 (2858598.0)
8/18 16:30:29 Number of idle job procs: 5
8/18 16:30:29 Of 8 nodes total:
8/18 16:30:29  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
8/18 16:30:29   ===     ===      ===     ===     ===        ===      ===
8/18 16:30:29     0       0        5       0       3          0        0
8/18 16:30:34 Submitting Condor Node job_5 job(s)...
[snip]...
8/18 16:30:34 Event: ULOG_SUBMIT for Condor Node job_8 (2858601.0)
8/18 16:30:34 Number of idle job procs: 8
8/18 16:30:34 Of 8 nodes total:
8/18 16:30:34  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
8/18 16:30:34   ===     ===      ===     ===     ===        ===      ===
8/18 16:30:34     0       0        8       0       0          0        0

Monitor – 2 – The Grid

Once the job is executing we can use condor_grid_overview to see the disposition of each element of the DAG:

scox@engage-submit:~/pmemd$ condor_grid_overview | grep scox
2858593    (DAGMan)         scox                               Running     condor_dagman     0:05:31
2858596      |-job_6        scox         UmissHEP              Running     submit.sh         0:04:40
2858598      |-job_2        scox         UmissHEP              Running     submit.sh         0:03:16
2858599      |-job_5        scox         UFlorida-PG           Pending     submit.sh         0:05:13
2858600      |-job_4        scox         Purdue-Steele         Pending     submit.sh         0:05:13

Monitor – 3 – Individual Jobs

And the job_out_tail command can be used to view standard output and error of the jobs. Each output item below details the life cycle of a remote job’s execution at a grid execution node.

scox@engage-submit:~/pmemd$ job_out_tail
==> /home/scox/pmemd/runs/20100818_1630/logs/1/job.out <==
--(inf): work dir  : /state/partition1/wn-temp/job.gkdLH16279
--(inf): start dir : /osg/home/engage/gram_scratch_YIFIRN2xnv
--(inf): stage data into work dir: /state/partition1/wn-temp/job.gkdLH16279
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/bin/
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/in/1.txt
--(inf): sourced job.sh
--(inf): staging out /state/partition1/wn-temp/job.gkdLH16279/app.stdouterr
--(inf): staging out /state/partition1/wn-temp/job.gkdLH16279/app.env
--(inf): executing job cleanup...
=== RUN SUCCESSFUL ===

==> /home/scox/pmemd/runs/20100818_1630/logs/2/job.out <==

==> /home/scox/pmemd/runs/20100818_1630/logs/3/job.out <==
--(inf): work dir  : /state/partition1/wn-temp/job.zNCvk16572
--(inf): start dir : /osg/home/engage/gram_scratch_qz3WPQ8CxC
--(inf): stage data into work dir: /state/partition1/wn-temp/job.zNCvk16572
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/bin/
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/in/2.txt
--(inf): sourced job.sh
--(inf): staging out /state/partition1/wn-temp/job.zNCvk16572/app.stdouterr
--(inf): staging out /state/partition1/wn-temp/job.zNCvk16572/app.env
--(inf): executing job cleanup...
=== RUN SUCCESSFUL ===

==> /home/scox/pmemd/runs/20100818_1630/logs/4/job.out <==

==> /home/scox/pmemd/runs/20100818_1630/logs/5/job.out <==

==> /home/scox/pmemd/runs/20100818_1630/logs/6/job.out <==

==> /home/scox/pmemd/runs/20100818_1630/logs/7/job.out <==
--(inf): work dir  : /state/partition1/wn-temp/job.nuwCL16422
--(inf): start dir : /osg/home/engage/gram_scratch_kxvv5y0ePO
--(inf): stage data into work dir: /state/partition1/wn-temp/job.nuwCL16422
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/bin/
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/in/4.txt
--(inf): sourced job.sh
--(inf): staging out /state/partition1/wn-temp/job.nuwCL16422/app.stdouterr
--(inf): staging out /state/partition1/wn-temp/job.nuwCL16422/app.env
--(inf): executing job cleanup...
=== RUN SUCCESSFUL ===

==> /home/scox/pmemd/runs/20100818_1630/logs/8/job.out <==
--(inf): work dir  : /state/partition1/wn-temp/job.LWoBz12214
--(inf): start dir : /osg/home/engage/gram_scratch_lC4Nur3TLx
--(inf): stage data into work dir: /state/partition1/wn-temp/job.LWoBz12214
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/bin/
--(inf): getting gsiftp://engage-submit.renci.org/home/scox/pmemd/app/in/4.txt~
--(inf): sourced job.sh
--(inf): staging out /state/partition1/wn-temp/job.LWoBz12214/app.stdouterr
--(inf): staging out /state/partition1/wn-temp/job.LWoBz12214/app.env
--(inf): executing job cleanup...
=== RUN SUCCESSFUL ===