A Tier-3 Open Science Grid Compute Element

I. RENCI Blueridge as Compute Element

RENCI’s new Open Science Grid (OSG) Compute Element (CE) will accept job submissions to be executed on the Blueridge cluster.

A CE presents a uniform interface other OSG sites can use to submit jobs. Major functional components of a CE include:

  • Authentication: The means of verifying the identity of a grid user. VOMS and GUMS are the two generally used alternatives used to map a public key infrastructure (PKI) credential like an X509 certificate to a local user identity.
  • Job Submission: Globus GRAM is the standard job submission interface between OSG clusters. Users will use a remote batch system like Condor or globus-job-run to submit jobs which communicate with the CE using the GRAM protocol.
  • File Transmission: The CE provides Globus GridFTP for this function. Jobs use GridFTP to transmit data into and out of the compute element.
  • Batch System: The Virtual Data Toolkit (VDT) supports adapters from Globus to a number of common batch systems including Condor, Torque/PBS and Sun Grid Engine.
  • Storage: Compute elements must make provision for storage of applications and data used by submitted jobs.  For larger CEs such as Tier-1s this is usually done with a Storage Element (SE) component running systems like BestMan, Hadoop and dCache. For our Tier-3 system we’ll be using mounted partitions.
  • Monitoring: Gratia is used to gather and report compute element statistics.

A few words about Blueridge:

  • Topology: Two external gateway, two login and 128 compute nodes
  • Batch System: Torque/PBS
  • Operating System: CentOS 5.5
  • Cluster Management: ROCKS

II. Planning and Installation Process

These are steps to install the CE. This reflects actions we’ll be taking for this cluster as opposed to normative requirements for installing a CE in general. Those are described in the OSG CE installation guide referenced above:

  • We’ll need a user with a valid personal DOEGrids certificate.
  • Create a virtual machine running CentOS 5.5.
    • 10GB free space writable by root
    • Root access
    • PBS batch system installed
  • Request host and service certificates for the machine.
  • Configure the firewall for a CE running GRAM, GridFTP and related services.
  • Ensure the hostname is set to the fully qualified domain name: (FQDN)
  • Ensure the host is running the network time protocol (NTP)
  • Install the CE:
    • Pacman: Download, install and configure the Virtual Data Toolkit (VDT) package manager, pacman.
    • Worker Node Client: Install executable components required at every compute node in the cluster.
    • Compute Element: Install modules common to the OSG Compute Element head node architecture.
    • Job Manager: Install a job manager. This is an adapter between Globus and a specific batch system. Supported batch systems include Condor, Sun Grid Engine and PBS. Blueridge runs Torque/PBS so we’ll be installing jobmanager-pbs. PBS must be installed and on the path before the job manager package can be installed.
    • Certificate Authority: Set up a certificate authority for the host.
    • Post Installation steps
      • Add host and service certificates and keys to /etc/grid-security
      • Configure file permissions correctly for certificates and keys
      • Execute the VDT post installation script.
    • Configure the CE: Create the config.ini configuration file used by the VDT to configure CE components including Globus, the batch manager, Gratia and other components.
      • Generate: Generate a configuration file by substituting environment specific parameters into the configuration template.
      • Apply: Execute the configure-osg script which propagates the settings to each configured OSG component in the VDT.
      • VOMS: Configure EDG mkgridmap to append local users to the grid-mapfile.
    • Start the CE: Use vdt-control to list configured services. Use the start option with the force flag to start the services. Note: The force flag will overwrite any existing services. Also, processes resembling VDT processes are killed (with kill) to prevent port conflicts in the case of a disorderly shutdown. Neither of these are desirable long term but a dedicated VM for the CE is a resulting design assumption for now.
  • Verify the CE installation. Do this as a non-root user
    • Obtain a valid voms proxy using voms-proxy-init
    • Verify the proxy with voms-proxy-info
    • Verify that the user is in the grid-mapfile
    • Execute ${VDT_LOCATION}/verify/site_verify.pl

III. Automating Installation

There are many steps to installing  a CE and the VDT is a complex system. To better manage this, the installation has been substantially automated via a set of scripts. Essentially, they are bash scripts which execute the steps described above in the order they are presented. The following noteworthy caveats apply:

  • init.d: The script uses the force option to vdt-control when installing services. If there were other versions of globus, apache or tomcat on the system, this would overwrite the existing init scripts with the VDT versions. Again, this assumes the CE has a dedicated machine.
  • Processes: The scripts kill globus, httpd and other processes before starting the CE. This assumes the CE has a dedicated machine.
  • Certificates: Certificates are needed for this process to succeed so the end to end execution below can only happen after certificates are available.
  • Configuration Assumptions: The implemented steps apply to one very simple and specific CE configuration. They could be extended to handle a number of other scenarios. Major assumptions of the current configuration include:
    • Torque/PBS batch system
    • VOMS and edg-mkgridmap used for authentication
    • Disk storage as opposed to an OSG storage element (SE)

The script is provided for information purposes only. It still needs significant work. Initialization and execution look like this:

  • Initialize: This loads functions implementing the tools:
    • [root@engage-ce:/osg/2]$ source ci/bin/environment.sh
    • [root@engage-ce:/osg/2]$ renci_ci_grid_tools
  • Execute: In bash, use command completion with the “renci_” and “osg_” namespaces to see available functions. The following command installs a CE following the process documented above:
    • [root@engage-ce:/osg/2]$ osg_install_all –purge

IV. Log of a Full Installation

Here are a few excerpts from the output of an execution of the OSG CE installation script:

Installation

[root@engage-ce:/osg]$ osg_install_all –purge > osg-2-install.log 2>&1
–(dbg): purge option selected.
–(inf): clean: removing preexisting directories…
–(inf):    –removing /osg/2/pacman-3.28
–(inf):    –removing /osg/2/osg-1.2.11
–(inf):    –removing /osg/2/wn-1.2.11
–(inf):    –creating /osg/2

Pacman:

–(inf): starting install at Fri Aug 13 17:36:11 EDT 2010
–(inf): osg_install_pacman…
–(inf): installing pacman…
–2010-08-13 17:36:12–  http://atlas.bu.edu/~youssef/pacman/sample_cache/tarballs/pacman-3.28.tar.gz
Resolving atlas.bu.edu… 192.5.207.10
Connecting to atlas.bu.edu|192.5.207.10|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 856237 (836K) [application/x-gzip]
Server file no newer than local file `pacman-3.28.tar.gz’ — not retrieving.
pacman-3.28/
pacman-3.28/democache/
pacman-3.28/democache/Package-E.pacman
pacman-3.28/democache/Package-C.pacman
pacman-3.28/democache/Python.pacman
[snip]…
pacman-3.28/src/SnapshotCache.py
pacman-3.28/src/Username.pyc
pacman-3.28/src/md5sum.pyc
pacman-3.28/src/MirrorCache.py
pacman-3.28/src/Version.pyc
pacman-3.28/src/LaunchBrowser.py
pacman-3.28/src/LocalTarballAccess.pyc
pacman-3.28/bin/
pacman-3.28/bin/pacman
–(inf): initializing pacman vdt package installation manager…
–(pacman): Pacman version:  3.28
–(pacman): Python version:  2.4.3 (#1, Sep  3 2009, 15:37:12)
–(pacman): Your platform [CentOS-5]
–(pacman): Your current architecture is i686
–(inf): pacman installation complete…
–(inf): osg_pacman_init…
–(inf): initializing pacman vdt package installation manager…
–(pacman): Pacman version:  3.28
–(pacman): Python version:  2.4.3 (#1, Sep  3 2009, 15:37:12)
–(pacman): Your platform [CentOS-5]
–(pacman): Your current architecture is i686

Worker Node Client:

–(inf): osg_install_worker_node_client…
[snip]…
gLite-Data-Delegation-API found in http://vdt.cs.wisc.edu/vdt_200_cache…
Downloading glite-data-delegation-api-2.0.0-7.x86_rhap_5.tar.gz…
gLite-Data-Util-C found in http://vdt.cs.wisc.edu/vdt_200_cache…
Downloading gridsite-shared-1.1.18-1.x86_rhap_5.tar.gz…
gLite-Service-Discovery-File-C found in http://vdt.cs.wisc.edu/vdt_200_cache…
Downloading glite-service-discovery-file-c-2.1.2-2.x86_rhap_5.tar.gz…
gLite-Service-Discovery-BDII found in http://vdt.cs.wisc.edu/vdt_200_cache…
Downloading glite-service-discovery-bdii-2.2.2-2.x86_rhap_5.tar.gz…
Downloading glite-data-util-c-1.2.3-1.x86_rhap_5.tar.gz…
Downloading glite-data-delegation-client-2.0.0-5.x86_rhap_5.tar.gz…
Downloading glite-fts-client-3.7.2-1.x86_rhap_5.tar.gz…
Downloading edg-gridftp-client-1.2.8.x86_rhap_5.tar.gz…
Downloading osg-version-1.2.12.tar.gz…
The OSG Worker Node Client package OSG version 1.2.12 has been installed.
–(dbg): renci_ci_path…
–(dbg): purge=/osg/2/wn-1.2.11
–(dbg): commit=true
–(inf):    –: /osg/2/pacman-3.28/bin
–(inf):    –: /osg/pbs
–(inf):    –: /osg/2/osg-1.2.11/osg/bin
–(inf):    –: /osg/2/osg-1.2.11/bwctl/bin
[snip]…
–(inf):    –: /osg/2/osg-1.2.11/vdt/bin
–(inf):    –: /osg/2/condor/bin
–(inf):    –: /osg/2/condor/sbin
–(inf):    –: /usr/kerberos/sbin
–(inf):    –: /usr/kerberos/bin
–(inf):    –: /usr/local/sbin
–(inf):    –: /usr/sbin
–(inf):    –: /sbin
–(inf):    –: /usr/bin
–(inf):    –: /bin
–(inf):    –: /opt/local/bin
–(inf):    –: /opt/local/sbin
–(inf):    –: /root/bin
–(inf):    –: /osg/2/globus-5.0.2/bin
–(inf):    –: /home/scox/app/globus-5.0.2/bin
–(inf): committing
–(dbg): path: :/osg/2/pacman-3.28/bin::/osg/pbs::/osg/2/osg-1.2.11/osg/bin::/osg/2/osg-1.2.11/bwctl/bin::/osg/2/osg-1.2.11/owamp/bin::/osg/2/osg-1.2.11/npad/bin::/osg/2/osg-1.2.11/ndt/bin::/osg/2/osg-1.2.11/subversion/bin::/osg/2/osg-1.2.11/apache/bin::/osg/2/osg-1.2.11/dccp/bin::/osg/2/osg-1.2.11/condor-cron/wrappers::/osg/2/osg-1.2.11/srm-client-lbnl/bin::/osg/2/osg-1.2.11/srm-client-fermi/sbin::/osg/2/osg-1.2.11/srm-client-fermi/bin::/osg/2/osg-1.2.11/gums/scripts::/osg/2/osg-1.2.11/cert-scripts/bin::/osg/2/osg-1.2.11/edg/sbin::/osg/2/osg-1.2.11/gip/bin::/osg/2/osg-1.2.11/prima/bin::/osg/2/osg-1.2.11/curl/bin::/osg/2/osg-1.2.11/glite/sbin::/osg/2/osg-1.2.11/glite/bin::/osg/2/osg-1.2.11/ant/bin::/osg/2/osg-1.2.11/jdk1.6/bin::/osg/2/osg-1.2.11/mysql5/bin::/osg/2/osg-1.2.11/wget/bin::/osg/2/osg-1.2.11/logrotate/sbin::/osg/2/osg-1.2.11/gpt/sbin::/osg/2/osg-1.2.11/globus/bin::/osg/2/osg-1.2.11/globus/sbin::/osg/2/osg-1.2.11/vdt/sbin::/osg/2/osg-1.2.11/vdt/bin::/osg/2/condor/bin::/osg/2/condor/sbin::/usr/kerberos/sbin::/usr/kerberos/bin::/usr/local/sbin::/usr/sbin::/sbin::/usr/bin::/bin::/opt/local/bin::/opt/local/sbin::/root/bin::/osg/2/globus-5.0.2/bin::/home/scox/app/globus-5.0.2/bin::

CE Software

–(inf): osg_install_ce…
–(inf): installing osg compute element software…
[snip]…
Downloading syslog-ng-2.0.10-x86_rhap_5.tar.gz…
Downloading site-verify-2.0.0-4.tar.gz…
Downloading ndt-3.5.0.x86_rhap_5.tar.gz…
Downloading npad-client-1.5.5.x86_rhap_5.tar.gz…
Downloading owamp-3.1.x86_rhap_5.tar.gz…
Downloading bwctl-1.3.x86_rhap_5.tar.gz…
Downloading site-web-page-2.0.0-5.tar.gz…
Downloading auto-vdt-1.2.12.tar.gz…
Downloading vo-package-32.tar.gz…
Downloading configure_osg-2.0.0-4.tar.gz…
Downloading osg-version-1.2.12.tar.gz…
Pacman Installation of OSG-1.2.12 Complete

Job Manager – PBS:

–(inf): osg_install_jobmanager_pbs…
[snip]…
Downloading gratia-pbs-probe-1.06.15i-1-x86_rhap_5.tar.gz…
Gratia-PBS-Probe has been installed.
Installing osg-version…
osg-version has been installed.
Globus-PBS-Setup has been installed.

CA Setup

–(inf): osg_ca_setup…
–(inf): executing VDT CA manager to setup CA…
Setting CA Certificates for VDT installation at ‘/osg/2/osg-1.2.11’
Setup completed successfully.
–(inf): enableing certificate update service…
running ‘vdt-register-service –name vdt-update-certs –enable’… ok

Post Install:

–(inf): osg_post_install…
–(inf): Initializing VDT environment…
–(inf): Installing Globus certificates…
–(inf): listing /etc/grid-security…
/etc/grid-security/hostkey.pem
/etc/grid-security/vomsdir/vdt_empty.pem
/etc/grid-security/hostcert.pem
/etc/grid-security/http/httpkey.pem
/etc/grid-security/http/httpcert.pem
/etc/grid-security/containerkey.pem
/etc/grid-security/containercert.pem
/etc/grid-security/host-backup/hostkey.pem
/etc/grid-security/host-backup/hostcert.pem
/etc/grid-security/host-backup/containerkey.pem
/etc/grid-security/host-backup/containercert.pem
/bin/chown: `globus:globus’: invalid user
–(inf): Executing VDT post install script…
Starting…
Configuring PRIMA… Done.
Configuring EDG-Make-Gridmap… Done.
Configuring PRIMA-GT4… Done.
Done.

Configure:

–(inf): osg_configure_ce…
–(inf):    backing up /osg/2/osg-1.2.11/monitoring/config.ini
–(inf):    applying values in /home/scox/dev/grid/bin/../resources/osg/engage-ce-conf.txt via template /home/scox/dev/grid/bin/../resources/osg/config.ini.template…
–(inf):    executing configure-osg…
running ‘vdt-register-service –name gratia-pbs –enable’… ok
running ‘vdt-register-service –name mysql5 –enable’… ok
running ‘vdt-register-service –name gsiftp –enable’… ok
running ‘vdt-register-service –name globus-gatekeeper –enable’… ok
running ‘vdt-register-service –name globus-ws –enable’… ok
The following consumer subscription has been installed:
HOST:    http://is2.grid.iu.edu:14001
TOPIC:   OSG_CE
DIALECT: RAW
running ‘vdt-register-service –name tomcat-55 –enable’… ok
running ‘vdt-register-service –name apache –enable’… ok
The following consumer subscription has been installed:
HOST:    http://is1.grid.iu.edu:14001
TOPIC:   OSG_CE
DIALECT: RAW
running ‘vdt-register-service –name tomcat-55 –enable’… ok
running ‘vdt-register-service –name apache –enable’… ok
The following consumer subscription has been installed:
HOST:    https://osg-ress-1.fnal.gov:8443/ig/services/CEInfoCollector
TOPIC:   OSG_CE
DIALECT: OLD_CLASSAD
running ‘vdt-register-service –name tomcat-55 –enable’… ok
running ‘vdt-register-service –name apache –enable’… ok
running ‘vdt-register-service –name edg-mkgridmap –enable’… ok
running ‘vdt-register-service –name gums-host-cron –disable’… ok
PRIMA for GT4 web services has been disabled
You will now be using a grid-mapfile for authorization.
Modifications to the /etc/sudoers file are still required.
You will need to restart the /etc/init.d/globus-ws container
to effect the changes.
Using /osg/2/osg-1.2.11/osg/etc/config.ini for configuration information
Running /osg/2/osg-1.2.11/edg/sbin/edg-mkgridmap, this process may take some time to query vo and gums servers
running ‘vdt-register-service –name vdt-rotate-logs –enable’… ok
Configure-osg completed successfully
–(inf):    verifying created configurations…
Using /osg/2/osg-1.2.11/osg/etc/config.ini for configuration information
Configuration verified successfully
–(inf): configuring local grid-mapfile …
–local-gridmap-entry: “/DC=org/DC=doegrids/OU=People/CN=Steven Cox 318595” scox

Start CE:

–(inf): osg_on…
–(inf): killing all processes matching patterns: httpd globus
–(inf): remaining processes:
Service                 | Type   | Desired State
————————+——–+————–
fetch-crl               | cron   | do not enable
vdt-rotate-logs         | cron   | enable
vdt-update-certs        | cron   | enable
gris                    | init   | do not enable
globus-gatekeeper       | inetd  | enable
gsiftp                  | inetd  | enable
mysql5                  | init   | enable
globus-ws               | init   | enable
gums-host-cron          | cron   | do not enable
MLD                     | init   | do not enable
condor-cron             | init   | do not enable
apache                  | init   | enable
tomcat-55               | init   | enable
gratia-pbs              | cron   | enable
edg-mkgridmap           | cron   | enable
skipping cron service ‘fetch-crl’ — marked as disabled
enabling cron service vdt-rotate-logs… ok
enabling cron service vdt-update-certs… ok
skipping init service ‘gris’ — marked as disabled
enabling inetd service globus-gatekeeper… ok
enabling inetd service gsiftp… ok
enabling init service mysql5… ok
enabling init service globus-ws… ok
skipping cron service ‘gums-host-cron’ — marked as disabled
skipping init service ‘MLD’ — marked as disabled
skipping init service ‘condor-cron’ — marked as disabled
enabling init service apache… ok
enabling init service tomcat-55… ok
enabling cron service gratia-pbs… ok
enabling cron service edg-mkgridmap… ok
–(inf): end at Fri Aug 13 17:48:59 EDT 2010. duration: 0:12:48
 

Verify CE

 

And here’s an execution of the CE verify script.
–(inf): verified valid grid proxy…
“/DC=org/DC=doegrids/OU=People/CN=Steven Cox 318595” scox
–(inf): verified user scox is in the grid-mapfile
===============================================================================
Info: Site verification initiated at Fri Aug 13 21:51:30 2010 GMT.
===============================================================================
——————————————————————————-
———- Begin engage-ce.renci.org at Fri Aug 13 21:51:30 2010 GMT ———-
——————————————————————————-
Checking prerequisites needed for testing: PASS
Checking for a valid proxy for scox@engage-ce.renci.org: PASS
Checking if remote host is reachable: PASS
Checking for a running gatekeeper: YES; port 2119
Checking authentication: PASS
Checking ‘Hello, World’ application: PASS
Checking remote host uptime: PASS
17:51:33 up 1 day,  4:59,  4 users,  load average: 1.13, 1.24, 0.92
Checking remote Internet network services list: PASS
Checking remote Internet servers database configuration: PASS
Checking for GLOBUS_LOCATION: /osg/2/osg-1.2.11/globus
Checking expiration date of remote host certificate: Aug  6 20:49:06 2011 GMT
Checking for gatekeeper configuration file: YES
/osg/2/osg-1.2.11/globus/etc/globus-gatekeeper.conf
Checking users in grid-mapfile, if none must be using Prima: alice,cdf,cigi,compbiogrid,dayabay,des,dosar,engage,fermilab,geant4,glow,gluex,gpn,grase,gridunesp,grow,hcc,i2u2,icecube,ilc,jdem,ligo,mis,nanohub,nwicg,nysgrid,ops,osg,osgedu,samgrid,sbgrid,scox,star,usatlas1,uscms01
Checking for remote globus-sh-tools-vars.sh: YES
Checking configured grid services: PASS
jobmanager,jobmanager-fork,jobmanager-pbs
Checking for OSG osg-attributes.conf: YES
Checking scheduler types associated with remote jobmanagers: PASS
jobmanager is of type fork
jobmanager-fork is of type fork
jobmanager-pbs is of type pbs
Checking for paths to binaries of remote schedulers: PASS
Path to pbs binaries is /osg/pbs
Checking remote scheduler status: PASS
pbs : 0 jobs running, 0 jobs idle/pending
Checking if Globus is deployed from the VDT: YES; version 2.0.0p19
Checking for OSG version: NO
Checking for OSG grid3-user-vo-map.txt: YES
Checking for OSG site name: UNAVAILABLE
Checking for OSG $GRID3 definition: /osg/2/osg-1.2.11
Checking for OSG $OSG_GRID definition: /osg/2/wn-1.2.11
Checking for OSG $APP definition: /osg/2/osg-app
Checking for OSG $DATA definition: /osg/2/osg-data
Checking for OSG $TMP definition: /osg/2/osg-data
Checking for OSG $WNTMP definition: /osg/2/condor_scratch
Checking for OSG $OSG_GRID existence: PASS
Checking for OSG $APP existence: PASS
Checking for OSG $DATA existence: PASS
Checking for OSG $TMP existence: PASS
Checking for OSG $APP writability: FAIL
Checking for OSG $DATA writability: PASS
Checking for OSG $TMP writability: PASS
Checking for OSG $APP available space: 4.342 GB
Checking for OSG $DATA available space: 4.342 GB
Checking for OSG $TMP available space: 4.342 GB
Checking for OSG additional site-specific variable definitions: YES
MountPoints
SAMPLE_LOCATION default /SAMPLE-path
SAMPLE_SCRATCH devel /SAMPLE-path
Checking for OSG execution jobmanager(s): engage-ce.renci.org/jobmanager-condor
Checking for OSG utility jobmanager(s): engage-ce.renci.org/jobmanager
Checking for OSG sponsoring VO: engage
Checking for OSG policy expression: NONE
Checking for OSG setup.sh: YES
Checking for OSG $Monalisa_HOME definition: /osg/2/osg-1.2.11/MonaLisa
Checking for MonALISA configuration: PASS
key ml_env vars:
FARM_NAME = engage-ce.renci.org
FARM_HOME = /osg/2/osg-1.2.11/MonaLisa/Service/VDTFarm
FARM_CONF_FILE = /osg/2/osg-1.2.11/MonaLisa/Service/VDTFarm/vdtFarm.conf
SHOULD_UPDATE = false
key ml_properties vars:
lia.Monitor.group = Test
lia.Monitor.useIPaddress = undef
MonaLisa.ContactEmail = root@engage-ce.renci.org
Checking for a running MonALISA: NO
MonALISA does not appear to be running
Checking for a running GANGLIA gmond daemon: NO
gmond does not appear to be running
Checking for a running GANGLIA gmetad daemon: NO
gmetad does not appear to be running
Checking for a running gsiftp server: YES; port 2811
Checking gsiftp (local client, local host -> remote host): PASS
Checking gsiftp (local client, remote host -> local host): PASS
Checking that no differences exist between gsiftp’d files: PASS
——————————————————————————-
———– End engage-ce.renci.org at Fri Aug 13 21:52:42 2010 GMT ———–
——————————————————————————-
===============================================================================
Info: Site verification completed at Fri Aug 13 21:52:42 2010 GMT.
This entry was posted in Compute Grids, Globus, grid, High Throughput Computing (HTC), OSG, RENCI. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s