Showing posts with label univa. Show all posts
Showing posts with label univa. Show all posts


Update to Grid Engine Job Submission Verifier (JSV) Using Go

A while ago, I posted about a job submission verifier (JSV) for Univa Grid Engine to try to handle job submissions which had less than ideal resource requests by leveraging cgroups. It was based on Daniel Gruber's JSV Go API.

In the 3+ years since that post, we had stopped using the JSV for one reason or another (including a Univa issue with cgroups and interaction with a specific kernel version), and just manually dealt with issues that came up by communicating with the users. Since then, as well, Daniel has updated API to be more Go-like. And we have had a fairly bad round of multithreaded programs submitted as serial jobs using up to 64 threads on our 64-core nodes.

So, I dusted off the old code, refreshed it, and reduced its scope to just deal with two cases: serial jobs, and multithreaded jobs. These types are jobs are defined either by a lack of PE (serial jobs), or a finite set of PEs (multithreaded).

There still is a deficiency in that the JSV cannot really deal with slot ranges. In Grid Engine, it is possible to request a range of slots for jobs, e.g. “-pe multithread 4-12” which would allow a job to be assigned any number of slots from 4 to 12. This is useful for busy clusters and users who would rather their jobs run slower than wait for the full 12 slots to open up.

Anyway, the JSV code is pretty straightforward. Find it here:

Together with this, UGE must be configured to have cgroups enabled (see your documentation). Here is the setup on our cluster -- the freezer functionality is disabled as there may be an issue in the interaction with RHEL 6 kernels:

cgroups_params   cgroup_path=/cgroup cpuset=true mount=true \
                 killing=true freezer=false freeze_pe_tasks=false \
                 forced_numa=true h_vmem_limit=true \
                 m_mem_free_hard=true m_mem_free_soft=true \

The JSV code is short enough that I include it here:

 * Requires

package main

import (

func jsv_on_start_function() {

func job_verification_function() {
    // Set binding on serial jobs (i.e. no PE) to "linear:1
    var modified_p bool = false
    if !jsv.IsParam("pe_name") {
        jsv.SetParam("binding_strategy", "linear_automatic")
        jsv.SetParam("binding_type", "set")
        jsv.SetParam("binding_amount", "1")
        jsv.SetParam("binding_exp_n", "0")
        modified_p = true
    } else {
        pe_name, _ := jsv.GetParam("pe_name")

        /* XXX the "shm" PE is the single-node multicore PE
         *     change this to the equivalent for your site; 
         *     the "matlab" PE is identically defined to the "shm" PE
         * XXX note that this does not properly deal with a range of number of slots;
         *     it just takes the max value of the range 
        if (strings.EqualFold("shm", pe_name) || strings.EqualFold("matlab", pe_name)) {
            pe_max, _ := jsv.GetParam("pe_max")
            jsv.SetParam("binding_strategy", "linear_automatic")
            jsv.SetParam("binding_type", "set")
            jsv.SetParam("binding_amount", pe_max)
            jsv.SetParam("binding_exp_n", "0")
            modified_p = true

    if modified_p {
        jsv.Correct("Job was modified")
    } else {
        jsv.Correct("Job was not modified")


func main() {
    jsv.Run(true, job_verification_function, jsv_on_start_function)


Abaqus integration for Univa Grid Engine (update)

My last post about Abaqus integration with Univa Grid Engine (UGE) had one disadvantage: it did not use qrsh to launch the slave MPI processes. As a result, job resource usage accounting was inaccurate. To fix this, certain parallel environment (PE) settings need to be corrected, and the rsh command that Abaqus uses for launching MPI slaves needs to be set to the wrapper rsh script.

The PE settings which worked worked for me -- see also sge_pe(5)

pe_name                abaqus
slots                  99999
user_lists             NONE
xuser_lists            NONE
start_proc_args        /cm/shared/apps/sge/var/default/common/pescripts/
stop_proc_args         NONE
allocation_rule        $round_robin
control_slaves         TRUE
job_is_first_task      FALSE
urgency_slots          min
accounting_summary     TRUE
daemon_forks_slaves    TRUEmaster_forks_slaves    FALSE

And the updated PE script (again, setting the mp_host_list is optional). The rsh command is actually the rsh wrapper shell script, which then calls qrsh.

#!/usr/bin/env python
import sys, os

### PE startup script aka prologue to set up Abaqus MPI "hostfile"
### Based on documented env file format

machinefile = os.environ['PE_HOSTFILE']
abaqenvfile = "abaqus_v6.env"

machinelines = []
with open(machinefile, "ro") as mf:
    for l in mf:
        lsplit = l.split()
        machinelines.append( [lsplit[0], int(lsplit[1])] )

with open(abaqenvfile, "wo") as envfile:
    envfile.write("mp_rsh_command='/cm/shared/apps/sge/univa/mpi/rsh -n -l %U %H %C'\n")
    envfile.write("mp_host_list=%s\n" % (str(machinelines)))


SSSD setup in a Bright Cluster

UPDATE 2015-05-28: Also, on the master node, the User Portal web application needs a setting in /etc/openldap/ldap.conf:

    TLS_REQCERT      never

At my current job, we use Bright Cluster Manager and Univa Grid Engine on RHEL 6.5. We were seeing issues where submitted jobs ended up in an "Error" state, especially if many jobs were submitted in a short period, either an array or a shell script loop running qsub iteratively. The error reason was:

    can't get password entry for user "juser". Either the user does not exist or NIS error!

However, logging into the assigned compute node and running "id" or even some C code to do user lookups passed.

By default, our installation used nslcd for LDAP lookups. Univa suggested switching to SSSD (System Security Services Daemon) as Red Hat had phased out nslcd. The Fedora site has a good overview.

The switch to using SSSD turned out to be fairly easy, with some hidden hiccups. Running authconfig-tui and keeping the existing settings, and then hitting "OK" immediately turned off nslcd and started up sssd, instead. All the attendant changes were made, too: chkconfig settings, /etc/nsswitch.conf. However, I found that users could not change passwords on the login nodes. They could login, but the passwd command failed with "system offline".

Turns out, SSSD requires an encrypted connection to the LDAP server for password changes. This is a security requirement so that the new password is not sent in the clear from the client node to the LDAP server. (See this forum post by sgallagh.) This means an SSL certificate needs to be created. Self-signed will work if the following line is added to /etc/sssd/sssd.conf:

     ldap_tls_reqcert = never

To create the self-signed cert:
     root # cd /etc/pki/tls/certs
     certs # make slapd.pem
     certs # chown ldap:ldap slapd.pem

Then, edit /cm/local/apps/openldap/etc/slapd.conf to add the following lines:
    TLSCACertificateFile /etc/pki/tls/certs/ca-bundle.crt
    TLSCertificateFile /etc/pki/tls/certs/slapd.pem
    TLSCertificateKeyFile /etc/pki/tls/certs/slapd.pem

Also, make sure there is a section - my config did not have access to shadowLastChange:
    access to attrs=loginShell,shadowLastChange
        by group.exact="cn=rogroup,dc=cm,dc=cluster" read
        by self write
        by * read

Then, restart the ldap service.

UPDATE Adding some links to official Red Hat documentation:


Exclusive host access in Grid Engine

UPDATE - 2014-03-28: Univa Grid Engine 8.1.7 (and possibly earlier) has a simpler way to set this up. One just needs to define the "exclusive" complex, without setting up a separate exclusive queue with a forced complex:

#name       shortcut   type  relop   requestable consumable default  urgency 
exclusive   excl       BOOL  EXCL    YES         YES        0        1000

Unfortunately, I just discovered a slight deficiency with this approach. That complex must be attached to specific hosts. This means modifying each exec host using "qconf -me hostname".

Here at Drexel's URCF, we use Univa Grid Engine (Wikipedia article). One of the requirements that frequently comes up is for jobs to have exclusive access to the compute hosts that the jobs occupy. A common reason is that a job may need more memory per process than is available on any single host.

Some resource managers and schedulers like Torque allow one to reserve nodes exclusively. In Grid Engine (GE), it is not a built-in feature. However, there is a way to accomplish the same thing. This post expands a little on Dan Gruber's blog post, for people like me who are new to GE.

Here, we assume there is only one queue named all.q. And we have two host groups: @intelhosts, and @amdhosts.

One can create a Boolean resource, a.k.a. complex, named "exclusive", which can be requested. That resource is forced to have the value TRUE in a new queue called exclusive.q so that only jobs that requests "exclusive" will be sent to that queue.

exclusive    excl   BOOL  ==  FORCED  NO     0    0

Once the complex is created, create a new queue named exclusive.q which spans all hosts, with a single slot per host. Set it to be subordinate to all.q -- this means that if there are any jobs in all.q on a host, exclusive.q on that host is suspended. And set the "exclusive" Boolean complex to be TRUE.

    qname     exclusive.q
    slots     1
    subordinate_list  all.q=1
    complex_values exclusive=TRUE

Modify all.q and set it subordinate to exclusive.q -- this ensures that if there is a job in exclusive.q on a host, all.q on that host is suspended:

    subordinate_list    exclusive.q=1

And now, to get a parallel job on Intel hosts, the job script would have something like this:

    #$ -l excl 
    #$ -q *@@intelhosts
    #$ -pe mvapich 128