OSG Engage VO Submit Host
RENCI’s new Engage VO submit host is now online.
Condor
The new submit host uses Condor for job scheduling. Condor provides the DAGMAN workflow manager which defines precedence relationships and retry semantics for jobs within a larger workflow. Condor is a basic component used by GlideinWMS and Pegasus WMS.
GlideinWMS
GlideinWMS is the OSG recommended pilot-based job submission system. GlideinWMS monitors the Condor pool and creates glide-ins on remote systems to compute job results. In most cases, this will make jobs easier to manage and increase scalability.
Pegasus WMS
Pegasus provides a layer of abstraction over Condor, Globus and other grid services. Workflow developers define an abstract workflow graph in XML. Pegasus interprets this graph to create Condor jobs which, in turn, use Condor-G to submit jobs to sites for execution. Pegasus excels at late binding of an abstract workflow to a concrete set of resources.
Migration Process
The original Engage submit node used OSG Match Maker to target and manage Condor-G jobs. The new submit node uses GlideinWMS. This causes several changes to the submit scripts. The Getting Started section below provides the path to example GWMS scripts for reference. For a detailed look at some of the implications of moving to GWMS for the submit files, see this tutorial. The Engage team will assist users with these changes to make migration as smooth as possible.
Getting Started
Here are some basics to get you started using the new submit host. The commands described are available to every user upon login.
- Host: Login to this host:
-
engage-submit3.renci.org
-
- Proxy Cert: Use the following command to initialize a proxy before running jobs:
-
voms-proxy-init -valid 500:00 -voms Engage
-
- Scripts:
- See /home/rynge/osg/generic-glideinwms-example
- FYI, the script will prompt for a password if your proxy is not initialized
- Monitor: Use these tools to monitor your job:
- Grid Overview
-
[scox@engage-submit3:~]$ grid_overview
- From a browser: web grid_overview
-
- Condor Q
-
condor_q <username>
-
condor_q -better-analyze <username>
-
- Web Interface:
- Grid Overview
Monitoring
The submit node provides a number of tools for monitoring system utilization in aggregate.
It also provides tools for understanding where individual user jobs are running and other characteristics about them.
Browse. Charts provide an overview of system activity. The image at right shows an aggregate view of the system’s recent performance. It can be refined to show activity on a single GlideinWMS Front End (FE) group such as bigmem or htpc. By default it shows activity across groups.
This view provides the ability to select a large number of granular criteria about jobs within each group. Also, the chart is interactive and allows the viewed time period to be controlled with the mouse.
By Group. This view shows activity broken down by group on the same view. It’s veryhelpful in understanding usage patterns of the system overall though perhaps less relevant to individual users daily activities.
grid_overview. For a better sense of how an individual users’ jobs are doing, use grid_overview. It provides a hierarchical view of a DAG and shows:
- Job ID: The Condor assigned unique id of the job
- DAG: The role of the job as a DAG or a job within a DAG
- Owner: The user who launched the job
- Resource: The resource where the job is executing
- Status: Current status of the job
- Command: The command executed within the job at the remote host
- Starts: Number of times the job has been executed
- TimeInState: Amount of time the job has been in the current state.
[scox@engage-submit3:~/dev/cec24/cone-osg/cone-elas-0.35]$ grid_overview condor status overview @ 2011-04-27 17:02:46.192475 ID DAG Owner Resource Status Command Starts TimeInState ========== ================ ============ ====================== =========== ========================== ====== =========== ... 7939 (DAGMan) scox Running condor_dagman 1 2:18:12 7940 |-job_1 scox Purdue Running remote-pdb-wrapper 1 2:17:52 7941 |-job_2 scox Purdue Running remote-pdb-wrapper 1 2:17:52 7942 |-job_3 scox Purdue Running remote-pdb-wrapper 1 2:17:52 7943 |-job_4 scox Purdue Running remote-pdb-wrapper 1 2:17:52 ... 7952 |-job_13 scox Purdue Running remote-pdb-wrapper 1 2:17:32 7953 |-job_14 scox Purdue Running remote-pdb-wrapper 1 2:17:32 ... 7982 |-job_43 scox Purdue Running remote-pdb-wrapper 1 2:16:51 7983 |-job_44 scox Purdue Running remote-pdb-wrapper 1 2:16:51 7984 |-job_45 scox Purdue Running remote-pdb-wrapper 1 2:16:51 7985 |-job_46 scox Purdue Running remote-pdb-wrapper 1 2:16:51 8143 (DAGMan) scox Running condor_dagman 1 0:23:50 8164 |-job_21 scox Purdue Running 21.sh 0 0:10:28 8173 |-job_30 scox SPRACE Running 30.sh 0 0:09:26 8175 |-job_32 scox SPRACE Running 32.sh 0 0:09:26 8180 |-job_37 scox UConn Running 37.sh 1 0:07:07 8181 |-job_38 scox Purdue Running 38.sh 2 0:06:11 8182 |-job_39 scox Purdue Running 39.sh 2 0:06:11 ... 8192 |-job_49 scox UNESP Running 49.sh 1 0:07:05 8193 |-job_50 scox UNESP Running 50.sh 1 0:07:06 Site Total Subm Stage Pend Run Other Rank Comment ========================= ===== ===== ===== ===== ===== ===== ===== ========================================== RENCI 0 0 0 0 0 0 0 ( 0) jobs executing on ( 3) glideins UConn 1 0 0 0 1 0 0 ( 1) jobs executing on ( 7) glideins UCSD 0 0 0 0 0 0 0 ( 0) jobs executing on ( 13) glideins Purdue 54 0 0 0 54 0 0 ( 54) jobs executing on ( 68) glideins Clemson 1 0 0 0 1 0 4 ( 1) jobs executing on ( 10) glideins Nebraska 0 0 0 0 0 0 0 ( 0) jobs executing on ( 0) glideins UNESP 2 0 0 0 2 0 0 ( 2) jobs executing on ( 10) glideins SPRACE 2 0 0 0 2 0 0 ( 2) jobs executing on ( 27) glideins Florida 0 0 0 0 0 0 0 ( 0) jobs executing on ( 0) glideins
Policy
Like the old submit host, it provides scratch space for overflow of data files beyond the user quota. Also like the old submit host, scratch space is periodically purged.
Frequently Asked Questions (FAQ)
Here’s a collection of frequently asked questions and answers.
——————————————————————————————————————————–
Q. What’s the longest job I can run?
A. Jobs are limited to a wall time of <= 22 hours.
——————————————————————————————————————————–
Q. How much memory can a normal job request?
A. In general, jobs can request <= ~1.7GB of memory.
——————————————————————————————————————————–
Q. I have a job that needs more memory than normally available. Are there other choices?
A. Target big memory machines by making the following changes to your job:
1. Add this to the job's requirements:
(GLIDEIN_MaxMemMBs >= 4000)
2. Add this to the job’s body:
+RequiresBigMem = True
NOTE: This will limit the number of sites your job can run on so it’s not a good choice if the job can run with less memory.
——————————————————————————————————————————–
Q: How do I ensure my job only matches sites with the Protein Data Bank installed?
A: Add this to the job’s requirements:
(Engage_Data_PDB == true)
——————————————————————————————————————————–
Q: What Java versions are available to Engage jobs?
A: Java 1.5 and 1.6. Paths to each on compute nodes are:
jdk 1.6.0_25: $OSG_APP/engage/jdk1.6.0_25 jdk 1.5.0_09: $OSG_APP/engage/jdk1.5.0_09
——————————————————————————————————————————–
Q: What is the scratch cleanup policy?
A: Files older than thirty days in /scratch are deleted daily. Please make appropriate arrangements to archive data.
——————————————————————————————————————————–
Q: What are the hardware specifications for the machine?
A:
Processors: 12 hyper-threaded cores; Intel Xeon @ 2.8 GHz Memory: 48 GB 1333Mhz RAM, 12 MB cache per core Disk: 5 TB local disk Network: 2 1 GBit eth cards configured as 1 bond interface OS: CentOS 5.6
Conclusion
Is there a command to list the scientific tools installed on engage-submit?
Hi Poornima,
engage-submit3 (ES3) serves the Engage Virtual Organization. As such, it’s used by researchers in a wide variety of fields. In general current practice, researcher teams copy executables to the submit node and stage it to compute nodes as part of their jobs. That is, the software provided by the Engage VO is the infrastructure for job execution and management such as Condor, GlideinWMS and Pegasus. Science discipline specific tools are up to the researcher.
Now, if you have a particular tool that you’d like to use at Engage sites we can look into copying that tool to sites supporting Engage users for you. We have maintenance jobs that will do this so that when you develop jobs, they don’t need to stage in the executables each time.
Let me know if you have more questions.
Thanks,
Steve
Thanks for the article. Where can I ask more questions specifically about glideinWMS usage with OSG?
Hi Ketan, please feel free to post questions here. There’s a good chance we’ll end up answering someone else’s question in the process.
Steve
Here is my question: I am trying to use glideinwms to submit multiple jobs which are mostly > 20 in number. However, I consistently see that only 4 jobs get the R status and rest of them are always in I status. This happens when submitting from the host managed by our team.
However, when I submit the same job from the engage submit host, I do get many jobs in the R state quickly. What could be the reason? Are there any configuration settings that I am missing?
Thanks.
Hi Ketan,
There are many reasons jobs might not run so it’s not possible to say without more information.
What does the output of condor_q -better-analyze say for the job ids that stay idle?
It’s also useful to check the GridManager. log as this often contains useful information on Condor related failures.
Consulting the glideinwms log files is also a good idea to make sure the front end is communicating effectively with the factory.
Steve
Thanks for these leads Steve. I tried to run -better-analyze on both platforms and realized that engage subit has access to much wider resource base than our host has. For instance, following are the outputs from engage vs our host for a single Idle gwms jobs:
The engage host:
2093407.019: Run analysis summary. Of 3018 machines,
57 are rejected by your job’s requirements
1276 reject your job because of their own requirements
16 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
1576 match but will not currently preempt their existing job
0 match but are currently offline
93 are available to run your job
Our host:
390889.017: Run analysis summary. Of 4 machines,
0 are rejected by your job’s requirements
0 reject your job because of their own requirements
4 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
There clearly seems something is missing from the interface we have. Could you give some more clues on what configurations etc. are required to properly setup a glideinwms environment. Or if you have any pointers to some mailing lists or documents, it would be very helpful.
Thanks, Ketan.
Glad to hear that was useful.
The GlideinWMS documentation is very good. I’d start there:
http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html
Your result suggests that your glideinwms frontend is not configured correctly to talk to a backend. Again, see the documentation on how to do this.
Ultimately, you’ll need to work with the administrators of a GlideinWMS factory. We use the one at UCSD.
Hi Steve, One more question:
With GlideinWMS, is there a way, one can end up on a particular host or a list of hosts? And adding further to it, is it possible to blacklist some hosts possibly because the environment is not suitable for applications etc.?
Ketan, add a predicate like this to the requirements of the Condor submit file:
GLIDEIN_Site == “UCSD”
The != operator will exclude the specified site. Use a parenthesized expression with and operators to create a list of sites to target or avoid. Of course, these are OSG sites and not really hosts. I can’t immediately think of a way to target a specific host (as in particular compute node) but I’m pretty sure that’s not what you were asking.
Thanks Steve. Is the OSG_APP shared site-wide? I ask this since, we have our app installed on a select number of hosts. So, we want to end up only on those hosts in order to access our application. Any suggestions?
Yes – by convention, $OSG_APP is defined in the shell executing the compute job at the worker node. For further details, see https://twiki.grid.iu.edu/bin/view/ArchivedDocumentation/OSG/OSG080/StorageParameterOsgApp .
It is mounted to the worker nodes so you don’t need to be concerned about the specific compute host the job executes on.
Thanks Steve, one possibly last question, is there a published catalogue that lists the sites and their short names for a particular VO? My VO is Engage. I tried to search the OSG site but could not succeed.
Thanks.
Is it possible to specify a wall-time in the job description of a gwms job?
Thanks.