Showing posts with label high performance computing. Show all posts
Showing posts with label high performance computing. Show all posts


Julia vs. C++

When I first heard about Julia, various claims that it was as fast as a compiled language like C or Fortran. However, here are two articles going into some detail: 

  • This first from 2016 by Victor Zverovich talks about startup times and memory footprint. The examples used are small, so may not be indicative of performance for larger more numerically-intensive code. 
  • This second from 2019 by Eduardo Alvarez takes a very close look at what is needed to get C++-like speed from Julia. The way to do that is not to take advantage of Python-like syntax, and instead code Julia like it was C++.
It seems like Julia would be good for a Python scientific programmer, who wants to increase performance greatly, but not switch to the somewhat more complicated programming involved in using C++.


Still more on SSSD in Bright Cluster Manager - cache file

It has been a few years since I got SSSD to work in Bright Cluster Manager 6, and I just figured out one little thing that has been an annoyance for a few years. There has been a spurious group hanging on: it has the same GID as an existing group, but a different group name.

Since Bright CM 6 did not handle SSSD out of the box, it also did not handle the SSSD cache file. More accurately, it did not ignore the file in the software image, and the grabimage command would grab the image to the provisioning server and then propagate it to nodes in the category.

The fix is simple: add /var/lib/sss/db/* to the various exclude list settings in the category.

To reset the cache:
    service sssd stop
    /bin/rm -f /var/lib/sss/db/cache_default.ldb
    service sssd start

I did try "sss_cache -E" which is supposed to clear the cache, but found that it did not work as I expected: the spurious group still appeared with "getent group".


Docker containers for high performance computing

I have started to see more science application authors/groups provide their applications as Docker images. Having only a passing acquaintance with Docker (and with containers in general), I found this article at The New Stack useful: Containers for High Performance Computing (Joab Jackson).

The vendors of the tools I use -- Bright Cluster Manager, Univa Grid Engine -- have incorporated support for containers. It is good to read some independent information about the role of containers in HPC.

Christian Kniep of Docker points out some issues with the interaction of HPC and Docker, and came up with a preliminary solution (a proxy for Docker Engine) to address the issues. HPC commonly makes use of specific hardware (e.g. GPUs, InfiniBand), and this is counter to Docker's hardware-agnostic approach. Also, HPC workflows may rely on shared resources (e.g. i/o to a shared filesystem).


Update to Grid Engine Job Submission Verifier (JSV) Using Go

A while ago, I posted about a job submission verifier (JSV) for Univa Grid Engine to try to handle job submissions which had less than ideal resource requests by leveraging cgroups. It was based on Daniel Gruber's JSV Go API.

In the 3+ years since that post, we had stopped using the JSV for one reason or another (including a Univa issue with cgroups and interaction with a specific kernel version), and just manually dealt with issues that came up by communicating with the users. Since then, as well, Daniel has updated API to be more Go-like. And we have had a fairly bad round of multithreaded programs submitted as serial jobs using up to 64 threads on our 64-core nodes.

So, I dusted off the old code, refreshed it, and reduced its scope to just deal with two cases: serial jobs, and multithreaded jobs. These types are jobs are defined either by a lack of PE (serial jobs), or a finite set of PEs (multithreaded).

There still is a deficiency in that the JSV cannot really deal with slot ranges. In Grid Engine, it is possible to request a range of slots for jobs, e.g. “-pe multithread 4-12” which would allow a job to be assigned any number of slots from 4 to 12. This is useful for busy clusters and users who would rather their jobs run slower than wait for the full 12 slots to open up.

Anyway, the JSV code is pretty straightforward. Find it here:

Together with this, UGE must be configured to have cgroups enabled (see your documentation). Here is the setup on our cluster -- the freezer functionality is disabled as there may be an issue in the interaction with RHEL 6 kernels:

cgroups_params   cgroup_path=/cgroup cpuset=true mount=true \
                 killing=true freezer=false freeze_pe_tasks=false \
                 forced_numa=true h_vmem_limit=true \
                 m_mem_free_hard=true m_mem_free_soft=true \

The JSV code is short enough that I include it here:

 * Requires

package main

import (

func jsv_on_start_function() {

func job_verification_function() {
    // Set binding on serial jobs (i.e. no PE) to "linear:1
    var modified_p bool = false
    if !jsv.IsParam("pe_name") {
        jsv.SetParam("binding_strategy", "linear_automatic")
        jsv.SetParam("binding_type", "set")
        jsv.SetParam("binding_amount", "1")
        jsv.SetParam("binding_exp_n", "0")
        modified_p = true
    } else {
        pe_name, _ := jsv.GetParam("pe_name")

        /* XXX the "shm" PE is the single-node multicore PE
         *     change this to the equivalent for your site; 
         *     the "matlab" PE is identically defined to the "shm" PE
         * XXX note that this does not properly deal with a range of number of slots;
         *     it just takes the max value of the range 
        if (strings.EqualFold("shm", pe_name) || strings.EqualFold("matlab", pe_name)) {
            pe_max, _ := jsv.GetParam("pe_max")
            jsv.SetParam("binding_strategy", "linear_automatic")
            jsv.SetParam("binding_type", "set")
            jsv.SetParam("binding_amount", pe_max)
            jsv.SetParam("binding_exp_n", "0")
            modified_p = true

    if modified_p {
        jsv.Correct("Job was modified")
    } else {
        jsv.Correct("Job was not modified")


func main() {
    jsv.Run(true, job_verification_function, jsv_on_start_function)


scikit-learn with shared CBLAS and BLAS

If you have your own copies of BLAS and CBLAS installed as shared libraries, the default build of scikit-learn may end up not finding which depends on.

You may, when doing "from sklearn import svm",  get an error like:

from . import libsvm, liblinearImportError: /usr/local/blas/lib64/ undefined symbol: cgemv_

To fix it, modify the private _build_utils module:


---    2016-11-08 16:19:49.920389034 -0500
+++ 2016-11-08 15:58:42.456085829 -0500
@@ -27,7 +27,7 @@

     blas_info = get_info('blas_opt', 0)
     if (not blas_info) or atlas_not_found(blas_info):
-        cblas_libs = ['cblas']
+        cblas_libs = ['cblas', 'blas']
         blas_info.pop('libraries', None)
         cblas_libs = blas_info.pop('libraries', [])


Optimized zlib

I wasn't aware of optimized versions of zlib, the free patent-unencumbered compression library, until today. I ran across Juho Snellman's comparison benchmarks of vanilla zlib, CloudFlare zlib, Intel zlib, and zlib-ng. The upshot is that CloudFlare's optimizations seem to be the best performing. And its decompression times were 75% of the vanilla version, while Intel and zlib-ng ran at 98% and 99% respectively. This would be a clear win for read-intensive workflows, such as some bioinformatics workflows.

Read more about how CloudFlare was contacted by the Institute of Cancer Research in London to help improve zlib at this blog post by Vlad Krasnov. Intel has an overview of zlib in its Intel Performance Primitives (IPP) product. And this is zlib-ng's GitHub repo.


High Performance Python

At PyCon 2012, Ian Oszvald showed how to write high performance Python. Key is understanding performance using profiling. In his introductory remarks, he tells how he came to work in Python after years of doing industry AI research using C++. It's the same reason I started using Python extensively, and I've known several other people who adopted Python generally for the same reason:
I was more productive at the end of the first day using Python to parse SAX than I was after 5 years as being senior dev using C++
Anyway, he has a blog post about his talk, with the slides and links to further material. The source is at github: get it by doing
git clone git://
The first sort of case review he gives is converting old Fortran Xray diffraction code to Python/Cython, and then optimizing the Python in the first day getting an order of magnitude speedup. Further optimization was done using other tools, getting to a final speedup of 300 on the pure Python numpy code.

As with all performance tuning, the key is profiling the code to understand exactly where the code spends its time.