Engage Submit Host

OSG Engage VO Submit Host

RENCI’s new Engage VO submit host is now online.

Condor

The new submit host uses Condor for job scheduling. Condor provides the DAGMAN workflow manager which defines precedence relationships and retry semantics for jobs within a larger workflow. Condor is a basic component used by GlideinWMS and Pegasus WMS.

GlideinWMS

GlideinWMS is the OSG recommended pilot-based job submission system. GlideinWMS monitors the Condor pool and creates glide-ins on remote systems to compute job results. In most cases, this will make jobs easier to manage and increase scalability.

Pegasus WMS

Pegasus provides a layer of abstraction over Condor, Globus and other grid services. Workflow developers define an abstract workflow graph in XML. Pegasus interprets this graph to create Condor jobs which, in turn, use Condor-G to submit jobs to sites for execution. Pegasus excels at late binding of an abstract workflow to a concrete set of resources.

Migration Process

The original Engage submit node used OSG Match Maker to target and manage Condor-G jobs. The new submit node uses GlideinWMS. This causes several changes to the submit scripts. The Getting Started section below provides the path to example GWMS scripts for reference. For a detailed look at some of the implications of moving to GWMS for the submit files, see this tutorial. The Engage team will assist users with these changes to make migration as smooth as possible.

Getting Started

Here are some basics to get you started using the new submit host. The commands described are available to every user upon login.

Host: Login to this host:
- ```
engage-submit3.renci.org
```
Proxy Cert: Use the following command to initialize a proxy before running jobs:
- ```
voms-proxy-init -valid 500:00 -voms Engage
```
Scripts:
- See /home/rynge/osg/generic-glideinwms-example
- FYI, the script will prompt for a password if your proxy is not initialized

Monitor: Use these tools to monitor your job:
- Grid Overview
  - ```
  [scox@engage-submit3:~]$ grid_overview
```
- From a browser: web grid_overview
- Condor Q
  - ```
  condor_q <username>
```
- ```
condor_q -better-analyze <username>
```
- Web Interface:
  - Browse Submit Host
  - By Group

Monitoring

The submit node provides a number of tools for monitoring system utilization in aggregate.

It also provides tools for understanding where individual user jobs are running and other characteristics about them.

Browse. Charts provide an overview of system activity. The image at right shows an aggregate view of the system’s recent performance. It can be refined to show activity on a single GlideinWMS Front End (FE) group such as bigmem or htpc. By default it shows activity across groups.

This view provides the ability to select a large number of granular criteria about jobs within each group. Also, the chart is interactive and allows the viewed time period to be controlled with the mouse.

By Group. This view shows activity broken down by group on the same view. It’s veryhelpful in understanding usage patterns of the system overall though perhaps less relevant to individual users daily activities.

grid_overview. For a better sense of how an individual users’ jobs are doing, use grid_overview. It provides a hierarchical view of a DAG and shows:

Job ID: The Condor assigned unique id of the job
DAG: The role of the job as a DAG or a job within a DAG
Owner: The user who launched the job
Resource: The resource where the job is executing
Status: Current status of the job
Command: The command executed within the job at the remote host
Starts: Number of times the job has been executed
TimeInState: Amount of time the job has been in the current state.

[scox@engage-submit3:~/dev/cec24/cone-osg/cone-elas-0.35]$ grid_overview 

condor status overview @ 2011-04-27 17:02:46.192475

ID         DAG              Owner        Resource               Status      Command                    Starts TimeInState
========== ================ ============ ====================== =========== ========================== ====== ===========
...
7939       (DAGMan)         scox                                Running     condor_dagman                  1   2:18:12
7940         |-job_1        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
7941         |-job_2        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
7942         |-job_3        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
7943         |-job_4        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
...
7952         |-job_13       scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:32
7953         |-job_14       scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:32
...
7982         |-job_43       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
7983         |-job_44       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
7984         |-job_45       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
7985         |-job_46       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
8143       (DAGMan)         scox                                Running     condor_dagman                  1   0:23:50
8164         |-job_21       scox         Purdue                 Running     21.sh                          0   0:10:28
8173         |-job_30       scox         SPRACE                 Running     30.sh                          0   0:09:26
8175         |-job_32       scox         SPRACE                 Running     32.sh                          0   0:09:26
8180         |-job_37       scox         UConn                  Running     37.sh                          1   0:07:07
8181         |-job_38       scox         Purdue                 Running     38.sh                          2   0:06:11
8182         |-job_39       scox         Purdue                 Running     39.sh                          2   0:06:11
...
8192         |-job_49       scox         UNESP                  Running     49.sh                          1   0:07:05
8193         |-job_50       scox         UNESP                  Running     50.sh                          1   0:07:06

Site                      Total  Subm Stage  Pend  Run  Other Rank Comment
========================= ===== ===== ===== ===== ===== ===== ===== ==========================================
RENCI                         0     0     0     0     0     0     0 (   0) jobs executing on (   3) glideins
UConn                         1     0     0     0     1     0     0 (   1) jobs executing on (   7) glideins
UCSD                          0     0     0     0     0     0     0 (   0) jobs executing on (  13) glideins
Purdue                       54     0     0     0    54     0     0 (  54) jobs executing on (  68) glideins
Clemson                       1     0     0     0     1     0     4 (   1) jobs executing on (  10) glideins
Nebraska                      0     0     0     0     0     0     0 (   0) jobs executing on (   0) glideins
UNESP                         2     0     0     0     2     0     0 (   2) jobs executing on (  10) glideins
SPRACE                        2     0     0     0     2     0     0 (   2) jobs executing on (  27) glideins
Florida                       0     0     0     0     0     0     0 (   0) jobs executing on (   0) glideins

Policy

Like the old submit host, it provides scratch space for overflow of data files beyond the user quota. Also like the old submit host, scratch space is periodically purged.

Frequently Asked Questions (FAQ)

Here’s a collection of frequently asked questions and answers.

——————————————————————————————————————————–

Q. What’s the longest job I can run?

A. Jobs are limited to a wall time of <= 22 hours.

——————————————————————————————————————————–

Q. How much memory can a normal job request?

A. In general, jobs can request <= ~1.7GB of memory.

——————————————————————————————————————————–

Q. I have a job that needs more memory than normally available. Are there other choices?

A. Target big memory machines by making the following changes to your job:

1. Add this to the job's requirements:

(GLIDEIN_MaxMemMBs >= 4000)

2. Add this to the job’s body:

+RequiresBigMem = True

NOTE: This will limit the number of sites your job can run on so it’s not a good choice if the job can run with less memory.

——————————————————————————————————————————–

Q: How do I ensure my job only matches sites with the Protein Data Bank installed?

A: Add this to the job’s requirements:

(Engage_Data_PDB == true)

——————————————————————————————————————————–

Q: What Java versions are available to Engage jobs?

A: Java 1.5 and 1.6. Paths to each on compute nodes are:

jdk 1.6.0_25: $OSG_APP/engage/jdk1.6.0_25
jdk 1.5.0_09: $OSG_APP/engage/jdk1.5.0_09

——————————————————————————————————————————–

Q: What is the scratch cleanup policy?

A: Files older than thirty days in /scratch are deleted daily. Please make appropriate arrangements to archive data.

——————————————————————————————————————————–

Q: What are the hardware specifications for the machine?

Processors: 12 hyper-threaded cores; Intel Xeon @ 2.8 GHz
Memory:     48 GB 1333Mhz RAM, 12 MB cache per core
Disk:       5 TB local disk
Network:    2 1 GBit eth cards configured as 1 bond interface
OS:         CentOS 5.6

Conclusion

The new submit node with GlideinWMS puts Engage in a stronger position to serve the science community. We’re working on tools to make the migration as seamless as possible and the future as simple and productive as possible. Please feel free to leave comments and questions.

14 Responses to Engage Submit Host

Poornima Pochana says:

September 28, 2011 at 6:22 pm

Is there a command to list the scientific tools installed on engage-submit?

- stevencox says:
  
  December 14, 2011 at 3:31 pm
  
  Hi Poornima,
  
  engage-submit3 (ES3) serves the Engage Virtual Organization. As such, it’s used by researchers in a wide variety of fields. In general current practice, researcher teams copy executables to the submit node and stage it to compute nodes as part of their jobs. That is, the software provided by the Engage VO is the infrastructure for job execution and management such as Condor, GlideinWMS and Pegasus. Science discipline specific tools are up to the researcher.
  
  Now, if you have a particular tool that you’d like to use at Engage sites we can look into copying that tool to sites supporting Engage users for you. We have maintenance jobs that will do this so that when you develop jobs, they don’t need to stage in the executables each time.
  
  Let me know if you have more questions.
  
  Thanks,
  
  Steve
  
Ketan says:

December 13, 2011 at 11:59 pm

Thanks for the article. Where can I ask more questions specifically about glideinWMS usage with OSG?

- stevencox says:
  
  December 14, 2011 at 3:32 pm
  
  Hi Ketan, please feel free to post questions here. There’s a good chance we’ll end up answering someone else’s question in the process.
  
  Steve
  
Ketan says:

December 14, 2011 at 4:19 pm

Here is my question: I am trying to use glideinwms to submit multiple jobs which are mostly > 20 in number. However, I consistently see that only 4 jobs get the R status and rest of them are always in I status. This happens when submitting from the host managed by our team.

However, when I submit the same job from the engage submit host, I do get many jobs in the R state quickly. What could be the reason? Are there any configuration settings that I am missing?

Thanks.

- stevencox says:
  
  December 14, 2011 at 5:56 pm
  
  Hi Ketan,
  
  There are many reasons jobs might not run so it’s not possible to say without more information.
  
  What does the output of condor_q -better-analyze say for the job ids that stay idle?
  
  It’s also useful to check the GridManager. log as this often contains useful information on Condor related failures.
  
  Consulting the glideinwms log files is also a good idea to make sure the front end is communicating effectively with the factory.
  
  Steve
  
  - Ketan says:
    
    December 14, 2011 at 11:53 pm
    
    Thanks for these leads Steve. I tried to run -better-analyze on both platforms and realized that engage subit has access to much wider resource base than our host has. For instance, following are the outputs from engage vs our host for a single Idle gwms jobs:
    
    The engage host:
    2093407.019: Run analysis summary. Of 3018 machines,
    57 are rejected by your job’s requirements
    1276 reject your job because of their own requirements
    16 match but are serving users with a better priority in the pool
    0 match but reject the job for unknown reasons
    1576 match but will not currently preempt their existing job
    0 match but are currently offline
    93 are available to run your job
    
    Our host:
    390889.017: Run analysis summary. Of 4 machines,
    0 are rejected by your job’s requirements
    0 reject your job because of their own requirements
    4 match but are serving users with a better priority in the pool
    0 match but reject the job for unknown reasons
    0 match but will not currently preempt their existing job
    0 are available to run your job
    
    There clearly seems something is missing from the interface we have. Could you give some more clues on what configurations etc. are required to properly setup a glideinwms environment. Or if you have any pointers to some mailing lists or documents, it would be very helpful.
    
    Thanks, Ketan.
  - stevencox says:
    
    December 15, 2011 at 2:34 pm
    
    Glad to hear that was useful.
    
    The GlideinWMS documentation is very good. I’d start there:
    
    http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html
    
    Your result suggests that your glideinwms frontend is not configured correctly to talk to a backend. Again, see the documentation on how to do this.
    
    Ultimately, you’ll need to work with the administrators of a GlideinWMS factory. We use the one at UCSD.
Ketan says:

December 21, 2011 at 12:08 am

Hi Steve, One more question:

With GlideinWMS, is there a way, one can end up on a particular host or a list of hosts? And adding further to it, is it possible to blacklist some hosts possibly because the environment is not suitable for applications etc.?

- stevencox says:
  
  January 3, 2012 at 2:35 pm
  
  Ketan, add a predicate like this to the requirements of the Condor submit file:
  
  GLIDEIN_Site == “UCSD”
  
  The != operator will exclude the specified site. Use a parenthesized expression with and operators to create a list of sites to target or avoid. Of course, these are OSG sites and not really hosts. I can’t immediately think of a way to target a specific host (as in particular compute node) but I’m pretty sure that’s not what you were asking.
  
  - Ketan says:
    
    January 4, 2012 at 11:08 pm
    
    Thanks Steve. Is the OSG_APP shared site-wide? I ask this since, we have our app installed on a select number of hosts. So, we want to end up only on those hosts in order to access our application. Any suggestions?
  - stevencox says:
    
    January 5, 2012 at 3:47 pm
    
    Yes – by convention, $OSG_APP is defined in the shell executing the compute job at the worker node. For further details, see https://twiki.grid.iu.edu/bin/view/ArchivedDocumentation/OSG/OSG080/StorageParameterOsgApp .
    
    It is mounted to the worker nodes so you don’t need to be concerned about the specific compute host the job executes on.
Ketan says:

January 12, 2012 at 10:35 pm

Thanks Steve, one possibly last question, is there a published catalogue that lists the sites and their short names for a particular VO? My VO is Engage. I tried to search the OSG site but could not succeed.

Thanks.

Ketan says:

January 24, 2012 at 9:32 pm

Is it possible to specify a wall-time in the job description of a gwms job?

Thanks.

Engage Submit Host

OSG Engage VO Submit Host

Migration Process

Getting Started

Monitoring

Policy

Frequently Asked Questions (FAQ)

14 Responses to Engage Submit Host

Leave a reply to Poornima Pochana Cancel reply

Email Subscription

Blogroll

Recent Posts

Categories

contacts

Meta

Archives

Engage Submit Host

OSG Engage VO Submit Host

Migration Process

Getting Started

Monitoring

Policy

Frequently Asked Questions (FAQ)

Share this:

14 Responses to Engage Submit Host

Leave a reply to Poornima Pochana Cancel reply

Email Subscription

Blogroll

Recent Posts

Topics

Categories

contacts

Meta

Archives