OSG Engage VO Submit Host
RENCI’s new Engage VO submit host is now online.
The new submit host uses Condor for job scheduling. Condor provides the DAGMAN workflow manager which defines precedence relationships and retry semantics for jobs within a larger workflow. Condor is a basic component used by GlideinWMS and Pegasus WMS.
GlideinWMS is the OSG recommended pilot-based job submission system. GlideinWMS monitors the Condor pool and creates glide-ins on remote systems to compute job results. In most cases, this will make jobs easier to manage and increase scalability.
Pegasus provides a layer of abstraction over Condor, Globus and other grid services. Workflow developers define an abstract workflow graph in XML. Pegasus interprets this graph to create Condor jobs which, in turn, use Condor-G to submit jobs to sites for execution. Pegasus excels at late binding of an abstract workflow to a concrete set of resources.
The original Engage submit node used OSG Match Maker to target and manage Condor-G jobs. The new submit node uses GlideinWMS. This causes several changes to the submit scripts. The Getting Started section below provides the path to example GWMS scripts for reference. For a detailed look at some of the implications of moving to GWMS for the submit files, see this tutorial. The Engage team will assist users with these changes to make migration as smooth as possible.
Here are some basics to get you started using the new submit host. The commands described are available to every user upon login.
- Host: Login to this host:
- Proxy Cert: Use the following command to initialize a proxy before running jobs:
voms-proxy-init -valid 500:00 -voms Engage
- See /home/rynge/osg/generic-glideinwms-example
- FYI, the script will prompt for a password if your proxy is not initialized
- Monitor: Use these tools to monitor your job:
The submit node provides a number of tools for monitoring system utilization in aggregate.
It also provides tools for understanding where individual user jobs are running and other characteristics about them.
Browse. Charts provide an overview of system activity. The image at right shows an aggregate view of the system’s recent performance. It can be refined to show activity on a single GlideinWMS Front End (FE) group such as bigmem or htpc. By default it shows activity across groups.
This view provides the ability to select a large number of granular criteria about jobs within each group. Also, the chart is interactive and allows the viewed time period to be controlled with the mouse.
By Group. This view shows activity broken down by group on the same view. It’s veryhelpful in understanding usage patterns of the system overall though perhaps less relevant to individual users daily activities.
grid_overview. For a better sense of how an individual users’ jobs are doing, use grid_overview. It provides a hierarchical view of a DAG and shows:
- Job ID: The Condor assigned unique id of the job
- DAG: The role of the job as a DAG or a job within a DAG
- Owner: The user who launched the job
- Resource: The resource where the job is executing
- Status: Current status of the job
- Command: The command executed within the job at the remote host
- Starts: Number of times the job has been executed
- TimeInState: Amount of time the job has been in the current state.
[scox@engage-submit3:~/dev/cec24/cone-osg/cone-elas-0.35]$ grid_overview condor status overview @ 2011-04-27 17:02:46.192475 ID DAG Owner Resource Status Command Starts TimeInState ========== ================ ============ ====================== =========== ========================== ====== =========== ... 7939 (DAGMan) scox Running condor_dagman 1 2:18:12 7940 |-job_1 scox Purdue Running remote-pdb-wrapper 1 2:17:52 7941 |-job_2 scox Purdue Running remote-pdb-wrapper 1 2:17:52 7942 |-job_3 scox Purdue Running remote-pdb-wrapper 1 2:17:52 7943 |-job_4 scox Purdue Running remote-pdb-wrapper 1 2:17:52 ... 7952 |-job_13 scox Purdue Running remote-pdb-wrapper 1 2:17:32 7953 |-job_14 scox Purdue Running remote-pdb-wrapper 1 2:17:32 ... 7982 |-job_43 scox Purdue Running remote-pdb-wrapper 1 2:16:51 7983 |-job_44 scox Purdue Running remote-pdb-wrapper 1 2:16:51 7984 |-job_45 scox Purdue Running remote-pdb-wrapper 1 2:16:51 7985 |-job_46 scox Purdue Running remote-pdb-wrapper 1 2:16:51 8143 (DAGMan) scox Running condor_dagman 1 0:23:50 8164 |-job_21 scox Purdue Running 21.sh 0 0:10:28 8173 |-job_30 scox SPRACE Running 30.sh 0 0:09:26 8175 |-job_32 scox SPRACE Running 32.sh 0 0:09:26 8180 |-job_37 scox UConn Running 37.sh 1 0:07:07 8181 |-job_38 scox Purdue Running 38.sh 2 0:06:11 8182 |-job_39 scox Purdue Running 39.sh 2 0:06:11 ... 8192 |-job_49 scox UNESP Running 49.sh 1 0:07:05 8193 |-job_50 scox UNESP Running 50.sh 1 0:07:06 Site Total Subm Stage Pend Run Other Rank Comment ========================= ===== ===== ===== ===== ===== ===== ===== ========================================== RENCI 0 0 0 0 0 0 0 ( 0) jobs executing on ( 3) glideins UConn 1 0 0 0 1 0 0 ( 1) jobs executing on ( 7) glideins UCSD 0 0 0 0 0 0 0 ( 0) jobs executing on ( 13) glideins Purdue 54 0 0 0 54 0 0 ( 54) jobs executing on ( 68) glideins Clemson 1 0 0 0 1 0 4 ( 1) jobs executing on ( 10) glideins Nebraska 0 0 0 0 0 0 0 ( 0) jobs executing on ( 0) glideins UNESP 2 0 0 0 2 0 0 ( 2) jobs executing on ( 10) glideins SPRACE 2 0 0 0 2 0 0 ( 2) jobs executing on ( 27) glideins Florida 0 0 0 0 0 0 0 ( 0) jobs executing on ( 0) glideins
Like the old submit host, it provides scratch space for overflow of data files beyond the user quota. Also like the old submit host, scratch space is periodically purged.
Frequently Asked Questions (FAQ)
Here’s a collection of frequently asked questions and answers.
Q. What’s the longest job I can run?
A. Jobs are limited to a wall time of <= 22 hours.
Q. How much memory can a normal job request?
A. In general, jobs can request <= ~1.7GB of memory.
Q. I have a job that needs more memory than normally available. Are there other choices?
A. Target big memory machines by making the following changes to your job:
1. Add this to the job's requirements:
(GLIDEIN_MaxMemMBs >= 4000)
2. Add this to the job’s body:
+RequiresBigMem = True
NOTE: This will limit the number of sites your job can run on so it’s not a good choice if the job can run with less memory.
Q: How do I ensure my job only matches sites with the Protein Data Bank installed?
A: Add this to the job’s requirements:
(Engage_Data_PDB == true)
Q: What Java versions are available to Engage jobs?
A: Java 1.5 and 1.6. Paths to each on compute nodes are:
jdk 1.6.0_25: $OSG_APP/engage/jdk1.6.0_25 jdk 1.5.0_09: $OSG_APP/engage/jdk1.5.0_09
Q: What is the scratch cleanup policy?
A: Files older than thirty days in /scratch are deleted daily. Please make appropriate arrangements to archive data.
Q: What are the hardware specifications for the machine?
Processors: 12 hyper-threaded cores; Intel Xeon @ 2.8 GHz Memory: 48 GB 1333Mhz RAM, 12 MB cache per core Disk: 5 TB local disk Network: 2 1 GBit eth cards configured as 1 bond interface OS: CentOS 5.6