Engage Submit Host


OSG Engage VO Submit Host

RENCI’s new Engage VO submit host is now online.

Condor

The new submit host uses Condor for job scheduling. Condor provides the DAGMAN workflow manager which defines precedence relationships and retry semantics for jobs within a larger workflow. Condor is a basic component used by GlideinWMS and Pegasus WMS.

GlideinWMS

GlideinWMS is the OSG recommended pilot-based job submission system. GlideinWMS monitors the Condor pool and creates glide-ins on remote systems to compute job results. In most cases, this will make jobs easier to manage and increase scalability.

Pegasus WMS

Pegasus provides a layer of abstraction over Condor, Globus and other grid services. Workflow developers define an abstract workflow graph in XML. Pegasus interprets this graph to create Condor jobs which, in turn, use Condor-G to submit jobs to sites for execution. Pegasus excels at late binding of an abstract workflow to a concrete set of resources.

Migration Process

The original Engage submit node used OSG Match Maker to target and manage Condor-G jobs. The new submit node uses GlideinWMS. This causes several changes to the submit scripts. The Getting Started section below provides the path to example GWMS scripts for reference. For a detailed look at some of the implications of moving to GWMS for the submit files, see this tutorial. The Engage team will assist users with these changes to make migration as smooth as possible.

Getting Started

Here are some basics to get you started using the new submit host. The commands described are available to every user upon login.

  • Host: Login to this host:
    • engage-submit3.renci.org
  • Proxy Cert: Use the following command to initialize a proxy before running jobs:
    • voms-proxy-init -valid 500:00 -voms Engage
  • Scripts:
    • See /home/rynge/osg/generic-glideinwms-example
    • FYI, the script will prompt for a password if your proxy is not initialized

  • Monitor: Use these tools to monitor your job:

Monitoring

The submit node provides a number of tools for monitoring system utilization in aggregate.

It also provides tools for understanding where individual user jobs are running and other characteristics about them.

Browse. Charts provide an overview of system activity. The image at right shows an aggregate view of the system’s recent performance. It can be refined to show activity on a single GlideinWMS Front End (FE) group such as bigmem or htpc. By default it shows activity across groups.

This view provides the ability to select a large number of granular criteria about jobs within each group. Also, the chart is interactive and allows the viewed time period to be controlled with the mouse.

By Group. This view shows activity broken down by group on the same view. It’s veryhelpful in understanding usage patterns of the system overall though perhaps less relevant to individual users daily activities.

grid_overview. For a better sense of how an individual users’ jobs are doing, use grid_overview. It provides a hierarchical view of a DAG and shows:

  • Job ID: The Condor assigned unique id of the job
  • DAG: The role of the job as a DAG or a job within a DAG
  • Owner: The user who launched the job
  • Resource: The resource where the job is executing
  • Status: Current status of the job
  • Command: The command executed within the job at the remote host
  • Starts: Number of times the job has been executed
  • TimeInState: Amount of time the job has been in the current state.
[scox@engage-submit3:~/dev/cec24/cone-osg/cone-elas-0.35]$ grid_overview 

condor status overview @ 2011-04-27 17:02:46.192475

ID         DAG              Owner        Resource               Status      Command                    Starts TimeInState
========== ================ ============ ====================== =========== ========================== ====== ===========
...
7939       (DAGMan)         scox                                Running     condor_dagman                  1   2:18:12
7940         |-job_1        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
7941         |-job_2        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
7942         |-job_3        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
7943         |-job_4        scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:52
...
7952         |-job_13       scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:32
7953         |-job_14       scox         Purdue                 Running     remote-pdb-wrapper             1   2:17:32
...
7982         |-job_43       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
7983         |-job_44       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
7984         |-job_45       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
7985         |-job_46       scox         Purdue                 Running     remote-pdb-wrapper             1   2:16:51
8143       (DAGMan)         scox                                Running     condor_dagman                  1   0:23:50
8164         |-job_21       scox         Purdue                 Running     21.sh                          0   0:10:28
8173         |-job_30       scox         SPRACE                 Running     30.sh                          0   0:09:26
8175         |-job_32       scox         SPRACE                 Running     32.sh                          0   0:09:26
8180         |-job_37       scox         UConn                  Running     37.sh                          1   0:07:07
8181         |-job_38       scox         Purdue                 Running     38.sh                          2   0:06:11
8182         |-job_39       scox         Purdue                 Running     39.sh                          2   0:06:11
...
8192         |-job_49       scox         UNESP                  Running     49.sh                          1   0:07:05
8193         |-job_50       scox         UNESP                  Running     50.sh                          1   0:07:06

Site                      Total  Subm Stage  Pend  Run  Other Rank Comment
========================= ===== ===== ===== ===== ===== ===== ===== ==========================================
RENCI                         0     0     0     0     0     0     0 (   0) jobs executing on (   3) glideins
UConn                         1     0     0     0     1     0     0 (   1) jobs executing on (   7) glideins
UCSD                          0     0     0     0     0     0     0 (   0) jobs executing on (  13) glideins
Purdue                       54     0     0     0    54     0     0 (  54) jobs executing on (  68) glideins
Clemson                       1     0     0     0     1     0     4 (   1) jobs executing on (  10) glideins
Nebraska                      0     0     0     0     0     0     0 (   0) jobs executing on (   0) glideins
UNESP                         2     0     0     0     2     0     0 (   2) jobs executing on (  10) glideins
SPRACE                        2     0     0     0     2     0     0 (   2) jobs executing on (  27) glideins
Florida                       0     0     0     0     0     0     0 (   0) jobs executing on (   0) glideins

Policy

Like the old submit host, it provides scratch space for overflow of data files beyond the user quota. Also like the old submit host, scratch space is periodically purged.

Frequently Asked Questions (FAQ)

Here’s a collection of frequently asked questions and answers.

——————————————————————————————————————————–

Q. What’s the longest job I can run?

A. Jobs are limited to a wall time of <= 22 hours.

——————————————————————————————————————————–

Q. How much memory can a normal job request?

A. In general, jobs can request <= ~1.7GB of memory.

——————————————————————————————————————————–

Q. I have a job that needs more memory than normally available. Are there other choices?

A. Target big memory machines by making the following changes to your job:

1. Add this to the job's requirements:

(GLIDEIN_MaxMemMBs >= 4000)

2. Add this to the job’s body:

+RequiresBigMem = True

NOTE: This will limit the number of sites your job can run on so it’s not a good choice if the job can run with less memory.

——————————————————————————————————————————–

Q: How do I ensure my job only matches sites with the Protein Data Bank installed?

A: Add this to the job’s requirements:

(Engage_Data_PDB == true)

——————————————————————————————————————————–

Q: What Java versions are available to Engage jobs?

A: Java 1.5 and 1.6. Paths to each on compute nodes are:

jdk 1.6.0_25: $OSG_APP/engage/jdk1.6.0_25
jdk 1.5.0_09: $OSG_APP/engage/jdk1.5.0_09

——————————————————————————————————————————–

Q: What is the scratch cleanup policy?

A: Files older than thirty days in /scratch are deleted daily. Please make appropriate arrangements to archive data.

——————————————————————————————————————————–

Q: What are the hardware specifications for the machine?

A:

Processors: 12 hyper-threaded cores; Intel Xeon @ 2.8 GHz
Memory:     48 GB 1333Mhz RAM, 12 MB cache per core
Disk:       5 TB local disk
Network:    2 1 GBit eth cards configured as 1 bond interface
OS:         CentOS 5.6

Conclusion

The new submit node with GlideinWMS puts Engage in a stronger position to serve the science community. We’re working on tools to make the migration as seamless as possible and the future as simple and productive as possible. Please feel free to leave comments and questions.

14 Responses to Engage Submit Host

  1. Is there a command to list the scientific tools installed on engage-submit?

    • stevencox says:

      Hi Poornima,

      engage-submit3 (ES3) serves the Engage Virtual Organization. As such, it’s used by researchers in a wide variety of fields. In general current practice, researcher teams copy executables to the submit node and stage it to compute nodes as part of their jobs. That is, the software provided by the Engage VO is the infrastructure for job execution and management such as Condor, GlideinWMS and Pegasus. Science discipline specific tools are up to the researcher.

      Now, if you have a particular tool that you’d like to use at Engage sites we can look into copying that tool to sites supporting Engage users for you. We have maintenance jobs that will do this so that when you develop jobs, they don’t need to stage in the executables each time.

      Let me know if you have more questions.

      Thanks,

      Steve

  2. Ketan says:

    Thanks for the article. Where can I ask more questions specifically about glideinWMS usage with OSG?

  3. Ketan says:

    Here is my question: I am trying to use glideinwms to submit multiple jobs which are mostly > 20 in number. However, I consistently see that only 4 jobs get the R status and rest of them are always in I status. This happens when submitting from the host managed by our team.

    However, when I submit the same job from the engage submit host, I do get many jobs in the R state quickly. What could be the reason? Are there any configuration settings that I am missing?

    Thanks.

    • stevencox says:

      Hi Ketan,

      There are many reasons jobs might not run so it’s not possible to say without more information.

      What does the output of condor_q -better-analyze say for the job ids that stay idle?

      It’s also useful to check the GridManager. log as this often contains useful information on Condor related failures.

      Consulting the glideinwms log files is also a good idea to make sure the front end is communicating effectively with the factory.

      Steve

      • Ketan says:

        Thanks for these leads Steve. I tried to run -better-analyze on both platforms and realized that engage subit has access to much wider resource base than our host has. For instance, following are the outputs from engage vs our host for a single Idle gwms jobs:

        The engage host:
        2093407.019: Run analysis summary. Of 3018 machines,
        57 are rejected by your job’s requirements
        1276 reject your job because of their own requirements
        16 match but are serving users with a better priority in the pool
        0 match but reject the job for unknown reasons
        1576 match but will not currently preempt their existing job
        0 match but are currently offline
        93 are available to run your job

        Our host:
        390889.017: Run analysis summary. Of 4 machines,
        0 are rejected by your job’s requirements
        0 reject your job because of their own requirements
        4 match but are serving users with a better priority in the pool
        0 match but reject the job for unknown reasons
        0 match but will not currently preempt their existing job
        0 are available to run your job

        There clearly seems something is missing from the interface we have. Could you give some more clues on what configurations etc. are required to properly setup a glideinwms environment. Or if you have any pointers to some mailing lists or documents, it would be very helpful.

        Thanks, Ketan.

      • stevencox says:

        Glad to hear that was useful.

        The GlideinWMS documentation is very good. I’d start there:

        http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html

        Your result suggests that your glideinwms frontend is not configured correctly to talk to a backend. Again, see the documentation on how to do this.

        Ultimately, you’ll need to work with the administrators of a GlideinWMS factory. We use the one at UCSD.

  4. Ketan says:

    Hi Steve, One more question:

    With GlideinWMS, is there a way, one can end up on a particular host or a list of hosts? And adding further to it, is it possible to blacklist some hosts possibly because the environment is not suitable for applications etc.?

    • stevencox says:

      Ketan, add a predicate like this to the requirements of the Condor submit file:

      GLIDEIN_Site == “UCSD”

      The != operator will exclude the specified site. Use a parenthesized expression with and operators to create a list of sites to target or avoid. Of course, these are OSG sites and not really hosts. I can’t immediately think of a way to target a specific host (as in particular compute node) but I’m pretty sure that’s not what you were asking.

  5. Ketan says:

    Thanks Steve, one possibly last question, is there a published catalogue that lists the sites and their short names for a particular VO? My VO is Engage. I tried to search the OSG site but could not succeed.

    Thanks.

  6. Ketan says:

    Is it possible to specify a wall-time in the job description of a gwms job?

    Thanks.

Leave a reply to Poornima Pochana Cancel reply