Showing posts with label python. Show all posts
Showing posts with label python. Show all posts


Python packaging example with commandline script and module-level constant

The Python packaging landscape has evolved a bit since setuptools. I wanted to have a ready-made example for a common use case: a Python module that provides one or more commandline scripts, and uses module-level constants.

I based mine on the Python Packaging Tutorial. It’s available on GitHub.

Improvements to be made include specifying requirements.


Notes on building Genomic Data Commons gdc-client

The National Cancer Institute’s Genomic Data Commons (GDC) produces a tool which facilitates data transfer to and from their data repository called gdc-client, which is open sourced on GitHub.

My first pass at building it gave an error while trying to build lxml without Cython:

      building 'lxml.etree' extension

      creating build/temp.linux-x86_64-cpython-311

      creating build/temp.linux-x86_64-cpython-311/src

      creating build/temp.linux-x86_64-cpython-311/src/lxml

      gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -DCYTHON_CLINE_IN_TRACEBACK=0 -I/usr/include/libxml2 -Isrc -Isrc/lxml/includes -I/home/chind/Src/gdc-client/venv/include -I/home/chind/opt/include/python3.11 -c src/lxml/etree.c -o build/temp.linux-x86_64-cpython-311/src/lxml/etree.o -w

      src/lxml/etree.c:289:12: fatal error: longintrepr.h: No such file or directory

        289 |   #include "longintrepr.h"

            |            ^~~~~~~~~~~~~~~

      compilation terminated.

      Compile failed: command '/usr/bin/gcc' failed with exit code 1

The fix was to build and install lxml from source, using Cython. And Cython < 3 is needed, i.e. Cython 0.29.x.

Once lxml 4.4.2 was installed manually, following the gdc-client build instructions was successful, and the gdc-client script was created.

For more detail, see this Gist.


Python csv.DictReader

I use Python csv a lot, especially csv.DictReader. This article on Python csv.DictReader at Python Pool was useful, particularly for showing how to subclass csv.DictReader to get case-insensitive column names.


Scripting Bright Cluster Manager 9.0 with Python

It has been more than 6 years since the previous post about using the Python API to script Bright Cluster Manager (CM). Time for an update.

I have to do the same as before: change the “category” on a whole bunch of nodes.

NB the Developer Manual has some typos, where it makes it look like you can specify categories as strings of their names, e.g. cluster.get_by_type('Node')


Dealing with Excel CSV/TSV files in UTF-16LE encoding, and an invisible character

Motivation: I am trying to read a TSV file produced by Blackboard courseware. I am trying to do so using Python's built-in csv library. I am using a simple method:

import csv 

with open('myfile.tsv', 'r', encoding='utf-16le') as cf:
    cr = csv.DictReader(cf, dialect='excel-tab')

    for row in cr:       

What I got for the fieldnames was:

['\ufeff"Last Name"', 'First Name']

I.e. the "Last Name" field got munged with \ufeff or U+FEFF in normal Unicode notation.

and the rows looked like:

OrderedDict([('\ufeff"Last Name"', 'Doe'), ('First Name', 'Alice')])

Just using vim to look at the file, there seemed to be nothing weird about the first line which contains the column names. This is a ZERO WIDTH NO-BREAK SPACE character used as a Byte Order Mark (BOM). It allows reading processes, e.g. file(1)/magic(5), to figure out the byte order of the file.

But it messes up the csv.DictReader parsing of the field names.

Turns out, there is an easy fix. You can just modify the field names directly:

cr = csv.DictReader(cf, dialect='excel-tab')
cr.fieldnames[0] = 'Last Name'

And the output is fixed, too:

OrderedDict([('Last Name', 'Doe'), ('First Name', 'Alice')])


Python multiprocessing ignores cgroups

At work, I noticed a Python job that was causing a large overload condition. The job requested 16 CPU cores, and was running gensim.models.ldamulticore requesting 15 workers. However, the load on that server indicated an overload of many times over. This is despite a cgroups cpuset restricting the number of cores for that process to 16.

It turns out, gensim.models.ldamulticore uses Python multiprocessing. That module decides how many threads to run based on the number of CPU cores read directly from /proc/cpuinfo. This completely bypasses the limitations imposed by cgroups.

There is currently an open enhancement request to add a new function to multiprocessing for requesting the number of usable CPU cores rather than the total number of CPU cores.


Proposed fix for duplicity Azure backend breakage

At work, I just got a Microsoft Azure Cool Blob Storage allocation for doing off-site backups. The Python-based duplicity software is supposed to be able to use Azure Blob storage as a backend. It does this by using the azure-storage Python module provided by Microsoft.

Unfortunately, a recent update of azure-storage broke duplicity. The fix was not to hard to implement; mostly minor changes in class names, and one simplification in querying blob properties. It took me a few hours to make a fix, and I just submitted my changes as a merge request to duplicityThe proposed merge can be found at Launchpad.

UPDATE Unfortunately, I made a mistake and made my changes against the 0.7.14 release rather than trunk. It looks like there is already a lot of work in trunk to deal with the current azure-storage version.  So, I withdrew the merge request. I'll work from the 0.8 series branch, instead. Currently, it looks like 0.8 all works as is.


Linux daemon using Python daemon with PID file and logging

The python-daemon package (PyPI listing, Pagure repo) is very useful. However, I feel it has suffered a bit from sparse documentation, and the inclusion of a "runner" example, which is in the process of being deprecated as of 2 weeks ago (2016-10-26).

There are several questions about it on StackOverflow, going back a few years:  2009, 20112012, and 2015. Some refer to the included as an example, which is being deprecated.

So, I decided to figure it out myself. I wanted to use the PID lockfile mechanism provided by python-daemon, and also the Python logging module. The inline documentation for python-daemon mention the files_preserve parameter, a list of file handles which should be held open when the daemon process is forked off. However, there wasn't an explicit example, and one StackOverflow solution for logging under python-daemon mentions that the file handle for logging objects may not be obvious:

  • for a StreamHandler, it's logging.root.handlers[0].stream.fileno()
  • for a SyslogHandler, it's logging.root.handlers[1].socket.fileno()

After a bunch of experiments, I think I have sorted it out to my own satisfaction. My example code is in GitHub: prehensilecode/python-daemon-example. It also has a SysV init script. 

The daemon itself is straigtforward, doing nothing but logging timestamps to the logfile. The full code is pasted here:

#!/usr/bin/env python3.5
import sys
import os
import time
import argparse
import logging
import daemon
from daemon import pidfile

debug_p = False

def do_something(logf):
    ### This does the "work" of the daemon

    logger = logging.getLogger('eg_daemon')

    fh = logging.FileHandler(logf)

    formatstr = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    formatter = logging.Formatter(formatstr)



    while True:
        logger.debug("this is an DEBUG message")"this is an INFO message")
        logger.error("this is an ERROR message")

def start_daemon(pidf, logf):
    ### This launches the daemon in its context

    global debug_p

    if debug_p:
        print("eg_daemon: entered run()")
        print("eg_daemon: pidf = {}    logf = {}".format(pidf, logf))
        print("eg_daemon: about to start daemonization")

    ### XXX pidfile is a context
    with daemon.DaemonContext(
        ) as context:

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Example daemon in Python")
    parser.add_argument('-p', '--pid-file', default='/var/run/')
    parser.add_argument('-l', '--log-file', default='/var/log/eg_daemon.log')

    args = parser.parse_args()
    start_daemon(pidf=args.pid_file, logf=args.log_file)


scikit-learn with shared CBLAS and BLAS

If you have your own copies of BLAS and CBLAS installed as shared libraries, the default build of scikit-learn may end up not finding which depends on.

You may, when doing "from sklearn import svm",  get an error like:

from . import libsvm, liblinearImportError: /usr/local/blas/lib64/ undefined symbol: cgemv_

To fix it, modify the private _build_utils module:


---    2016-11-08 16:19:49.920389034 -0500
+++ 2016-11-08 15:58:42.456085829 -0500
@@ -27,7 +27,7 @@

     blas_info = get_info('blas_opt', 0)
     if (not blas_info) or atlas_not_found(blas_info):
-        cblas_libs = ['cblas']
+        cblas_libs = ['cblas', 'blas']
         blas_info.pop('libraries', None)
         cblas_libs = blas_info.pop('libraries', [])


SWIG is great - Python DRMAA2 interface in less than an hour

I have never used SWIG before, surprisingly. I figured creating a Python interface to DRMAA2 would be a good self-tutorial. Turns out to be almost trivial, following the directions here.

My Python (3.5) DRMAA2 interface code is on GitHub - The hardest part, really, was writing the Makefile.

NOTE: no testing at all has been done. This is just a Q&D exercise to use SWIG.


Jupyter (FKA iPython) project gets $6m funding

The Helmsley Charitable Trust, the Alfred P. Sloan Foundation, and the Gordon and Betty Moore Foundation just announced a $6m grant to UC Berkeley and Cal Poly to fund Project Jupyter. Jupyter evolved from the iPython project, abstracting the language-agnostic parts. It also serves as an interactive shell for Python 3, Julia, R, Haskell, and Ruby. I think the most notable thing it provides is a web-based GUI “notebook” similar to what has been available in Maple and Mathematica for a while. (Maybe Matlab, too: I have not used Matlab much.)

Correction: Jupyter serves as an interactive shell for a lot more than what I listed. Here is the full list.


pylab confusions

There are three pylabs that one may encounter in using Python. Two have been around for a while, and the third just showed up less than a month ago.

The “real” pylab is the procedural interface to matplotlib, i.e. a MATLAB-like command line interface. It imports matplotlib.pyplot and numpy into a single namespace. You can use it from ipython’s prompt by calling the magic function “%pylab”. It is no longer recommended by the matplotlib people. The recommended way is to import with abbreviated namespace names, and use the qualified functions. For example:

import matplotlib.pyplot as pltimport numpy as np
x = np.linspace(0, 2, 100)
plt.plot(x, x, label='linear')plt.plot(x, x**2, label='quadratic')plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')plt.ylabel('y label')
plt.title("Simple Plot")
Then there is the idea/proposal by Keir Mierle to improve on the pylab idea of a single package one might use to utilize Python for interactive analysis. This is written up in the SciPy wiki, but does not seem to have been updated since 2012.

And finally, if you are like me, and have not been thinking too hard, and typed “pip install pylab” you get this new package from PyPI, first added on 2015-04-23. It does nothing but pull in several other Python packages, i.e. it serves as a metapackage. You can see the source is basically a dummy, with all the action in the requirements defined in


Python's with statement

Old habits die hard. I learned a long time ago (Python 1.x) this pattern for opening and operating on files:

        f = open("filename.txt", "ro")
            for l in f:
                print l
    except IOError as e:
        print "I/O error({0}): {1}".format(e.errno, e.strerror)

Since Python 2.6, the with statement does this automatically:

    with open("filename.txt", "ro") as f:
        for l in f:
            print l

The with statement works with some other classes, too.

PS Blogger really needs a code block style.


Scripting Bright Cluster Manager

At my new position as Sr. SysAdmin at Drexel's University Research Computing Facility (URCF), we use Bright Cluster Manager. I am new to Bright, and I am finding it very nice, indeed. One of its best features is programmatic access via a Python API. In about half an hour, I figured out enough to modify the node categories of all the nodes in the cluster.

Node categories group nodes which have similar configurations and roles. Example configuration may be a list of remote filesystem mounts, and an example role may be Grid Engine compute node with 64 job slots. The cluster at URCF has 64-core AMD nodes and 16-core Intel nodes, so I created a category for each of these. Then, I needed to change the node categories from the default to the architecture-specific categories. The script below did  it for the Intel nodes.


Python tip - converting HH:MM:SS time into more understandable format

The Torque resource manager for clusters prints out amounts of time -- CPU time, or walltime -- in HH:MM:SS format. For small numbers, it's easy enough to understand: 04:00:00 = 4 hours. But for larger numbers, I wanted the time amount specified in days, hours, minutes, and seconds. Here's a quick way to do it using the datetime module. (I'm working with Python 2.6 here, which is what comes with RHEL6.) It uses a list comprehension to split up the HH:MM:SS time string. (BTW, I am using ipython as the interactive Python shell.)

In [4]: import re, datetime

In [5]: def timedeltastr(timestr):
   ...:     dayspat = re.compile(r'\ days?,')
   ...:     t = [int(i) for i in timestr.split(':')]
   ...:     td = datetime.timedelta(hours=t[0], minutes=t[1], seconds=t[2])
   ...:     return dayspat.sub('d', str(td))
In [6]: timestr = '3400:00:00'

In [7]: timedeltastr(timestr)
Out[7]: '141d 16:00:00'


Small Python tip - sorted iterating over dictionary

Python dictionaries are great. However, iterating over dictionaries results in an unsorted order:
In [1]: d = {'a':10, 'b': 20, 'c': 30}

In [2]: for key,value in d.iteritems():
   ...:     print key, value
a 10
c 30
b 20
The fix is to use sorted():

In [3]: for key,value in sorted(d.iteritems()):
   ...:     print value
a 10
b 20
c 30


High Performance Python

At PyCon 2012, Ian Oszvald showed how to write high performance Python. Key is understanding performance using profiling. In his introductory remarks, he tells how he came to work in Python after years of doing industry AI research using C++. It's the same reason I started using Python extensively, and I've known several other people who adopted Python generally for the same reason:
I was more productive at the end of the first day using Python to parse SAX than I was after 5 years as being senior dev using C++
Anyway, he has a blog post about his talk, with the slides and links to further material. The source is at github: get it by doing
git clone git://
The first sort of case review he gives is converting old Fortran Xray diffraction code to Python/Cython, and then optimizing the Python in the first day getting an order of magnitude speedup. Further optimization was done using other tools, getting to a final speedup of 300 on the pure Python numpy code.

As with all performance tuning, the key is profiling the code to understand exactly where the code spends its time.


Twelve Days of Python

Here's something for learners of programming. The Hello World! blog is running a series called the 12 Days of Python. It uses the pygame module to create some graphics.


Small matplotlib tip

While making plots using matplotlib, I kept getting this error message when trying to write a string to a certain location in the plot:
UserWarning: findfont: Font family ['cmsy10'] not found.
Turns out, the fix is simple; add the following:

matplotlib.rc('text', usetex=True)
matplotlib is a wonderful Python package for doing plotting and analysis. It uses numpy.  Used interactively with the PyLab module, it feels close to Matlab. If you are a "scientific" user, I highly recommend checking it out.