Showing posts with label ganglia. Show all posts
Showing posts with label ganglia. Show all posts


Building Ganglia Monitor Core 3.6.1 on RHEL 7 - "Installed (but unpackaged) file(s) found" error

Install a bunch of prerequisites, some from EPEL:
* libconfuse
* libart_lgpl-devel

Download source: monitor-core-3.6.1.tar.gz

Expand the tarball: tar xf monitor-core-3.6.1.tar.gz  That creates the directory monitor-core-3.6.1

Enter that directory and run the bootstrap script: ./

That generates the configure script. 

Run the configure script: ./configure

That generates the SPEC file: ganglia.spec

Make a tar ball with the appropriate name:
    cd ..
    mv monitor-core-3.6.1 ganglia-3.6.1
    tar zcf ganglia-3.6.1.tar.gz ganglia-3.6.1

Build with rpmbuild -ta ganglia-3.6.1.tar.gz

Will probably get RPM build errors:
    bogus date in %changelog: Thu Mar 28 2008 Brad Nicholes <>
    bogus date in %changelog: Wed Jul 10 2007 Bernard Li <>
    bogus date in %changelog: Wed Jul 3 2007 Brad Nicholes <>
    bogus date in %changelog: Wed Jun 14 2007 Brad Nicholes <>
    bogus date in %changelog: Fri Feb 25 2006 Bernard Li <>
    Installed (but unpackaged) file(s) found:

Modify the SPEC file ganglia.spec: add line in "%files gmetad" section:

and in "%files gmond" section:

Make a new tar ball containing the fixed SPEC file (delete the old one first):
    rm ganglia-3.6.1.tar.gz
    tar zcf ganglia-3.6.1.tar.gz ganglia-3.6.1

Then build with: rpmbuild -ta ganglia-3.6.1.tar.gz

RPMs should be in: ~/rpmbuild/RPMS/x86_64


Ganglia module (kludge) to monitor temperature via IPMI

Since I don't have environmental monitoring in my server room, I used ipmitool to read my cluster nodes' on-board sensors to sort of get at the cold aisle ambient temperature. One should be able to see a list of available sensor readings with "ipmitool sdr" or "ipmitool sensor", the latter giving other associated parameters for the metrics, like alarm thresholds.

Since access to /dev/ipmi0 is restricted to root, my kludge was to create a root cron job which runs every N minutes, writing the appropriate value to a log file:

ipmitool -c sdr get "Inlet Temp" | cut -f2 -d, > $LOGFILE

Then, the Ganglia python module reads that file. I followed the module which is distributed with Ganglia, and the associated example.pyconf.

The code is at my github repo.


Ganglia fix to handle process names containing underscores

UPDATE: Accepted. Current version here.

This has bugged me for a while: Ganglia's Python module which monitors process CPU and memory usage did not show any data for Grid Engine's qmaster, which has a process name of "sge_qmaster". Turns out, this is because it tries to parse out the process name by assuming it does not have underscores in it. This snippet is from the get_stat(name) function in

if name.startswith('procstat_'):
    fir = name.find('_')
    sec = name.find('_', fir + 1)
    proc = name[fir + 1:sec]
    label = name[sec + 1:]
I just submitted a pull request to change this to something which handles process names with some number of underscores. The snippet to replace the above:

if name.startswith('procstat_'):
    nsp = name.split('_')
    proc = '_'.join(nsp[1:-1])
    label = nsp[-1



Word to the wise: do not enable the multiplecpu multicpu module. It doesn't get disabled even if you append ".disabled" to the file name. Now, I have 265 CPU metrics.


Using the NVIDIA Python plugin for Ganglia monitoring under Bright Cluster Manager

The github repo for Ganglia gmond Python plugins contains a plugin for monitoring NVIDIA GPUs. This presumes that the NVIDIA Deployment Kit, which contains the NVML (management library), is installed via the normal means into the usual places. If you are using Bright Cluster Manager, you would have used Bright's cuda60/tdk to do the installation. That means that the library is not in one of the standard library directories. To fix it, just modify the /etc/init.d/gmond init script. Near the top, modify the LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/cm/local/apps/cuda/libs/current/lib64
The modifications to Ganglia Web, however, are out of date. I will make another post once I figure out how to do modify Ganglia Web to display the NVIDIA metrics.

UPDATE: Well, turns out there seems to be no need to modify the Ganglia Web installation. Under the host view, there is a tab for "gpu metrics" which shows 22 available metrics.


Writing a new SELinux policy module for a standard init daemon

This is going to be a summary of my experience writing new policy modules for Ganglia gmetad and gmond on RHEL5. Ganglia is a "scalable distributed monitoring system for high-performance computing systems." I downloaded the package source distribution, and built RPMs myself.

In case you are looking to apply this to something else, here are a couple of the underlying assumptions:
  • the service is a standard init-launched daemon
  • each service only has one executable, the daemon program
In the case of gmetad and gmond, the daemon programs are, respectively, /usr/sbin/gmetad and /usr/sbin/gmond.

I have written about creating new SELinux policies before, but I think this is better in that it wraps things up into a module that may be removed or updated more easily than a monolithic policy. Note, however, that rules governing network ports are not bundled into the module. (See below.)

This is going to be an iterative process. Before even starting, one needs to know which files/directories the daemons will write to, and if they run non-root. If the package one is working with is well-documented, this may be obtained from the documentation. If not, some trial and error will be needed. Also, for most programs, these file/directory locations are configurable.

We use the GUI Selinux Policy Generation tool, system-config-selinux. There is a good article on using this tool by Dan Walsh dating back to 2007.

We will start with gmetad. In the case of gmetad, the default location for the RRD files is /var/lib/ganglia/rrds. So, the policy should allow write access to /var/lib/ganglia.

In the Selinux Policy Generation tool, these are the entries used:
  • Name: gmetad
  • Executable: /usr/sbin/gmetad
  • Standard Init Daemon
  • Incoming network ports, both TCP and UDP: 8651,8652
  • Common Application Traits
    • Application uses syslog to log messages
    • Application uses /tmp to Create/Manipulate temporary files
    • Application uses nsswitch or translates UID's (daemons that run as non root)
  • Add Directory: /var/lib/ganglia
This generates 4 files in whatever directory you specify at the end of the druid: gmetad.fc, gmetad,if,, gmetad.te. If you examine, you will see:
make -f /usr/share/selinux/devel/Makefile
/usr/sbin/semodule -i gmetad.pp

/sbin/restorecon -F -R -v /usr/sbin/gmetad
/sbin/restorecon -F -R -v /var/lib/ganglia
/usr/sbin/semanage port -a -t gmetad_port_t -p tcp 8651
/usr/sbin/semanage port -a -t gmetad_port_t -p tcp 8652
/usr/sbin/semanage port -a -t gmetad_port_t -p udp 8651
/usr/sbin/semanage port -a -t gmetad_port_t -p udp 8652
Note that the ports are not bundled into the "compiled" module file, gmetad.pp. The port rules are added "manually". The module merely defines the type gmetad_port_t.

The gmetad.te file is what we will be editing in the iterative steps below.  The first line determines a version number, that allows you to update a policy using "semodule -u gmetad.te".


Make sure the gmetad service is not running. Now, turn off the auditd service, and move away the audit log file to simplify finding incremental changes in policy that are needed:
# service gmetad stop
# service auditd stop
# cd /var/log/audit< # mv audit.log audit.log.20130313-1500
Then, start up the audit daemon, followed by gmetad. Wait for a few minutes (or much longer) for gmetad to do its thing, and for auditd to accumulate all or most of the AVC denials that would affect gmetad. Once a sufficient amount of time has passed:
# grep gmetad /var/log/audit/audit.log | audit2allow -R > audit.out
The output should look like:
require {
        type gmetad_t;
        class capability { setuid setgid };

#============= gmetad_t ==============
allow gmetad_t self:capability { setuid setgid };

Next, edit gmetad.te, and increment the version number. Append to the end of gmetad.te the contents of audit.out. Then, generate the policy file, and load the updated policy:
# make -f /usr/share/selinux/devel/Makefile
# semodule -u gmetad.pp
Next, shut down gmetad, shut down auditd, move the audit log away, start auditd, and start gmetad. Wait a bit, and look for new denials in the audit log by doing
# grep gmetad /var/log/audit/audit.log | audit2allow -R > audit2.out
To append any new rules, you have to manually pick out the new unique lines from audit2.out and put them in the appropriate sections (the 'require' section, or the block of allows) of gmetad.te. For gmetad.te, I found there wasn't much change between iterations. For gmond, however, there were quite a few, mostly the addition of file getattr permissions. This involved changing many lines like:

allow gmond_t lvm_t:file read;  -->  allow gmond_t lvm_t:file { getattr read };
This iteration may have to include alternating gmond and gmetad since gmetad has to connect to the gmond port, which means something like:
allow gmetad_t gmond_port_t:tcp_socket name_connect;

Here at the Wake Forest University HPC facility, we have a combination of cfengine and Puppet to manage machine configurations: cfengine for the RHEL5 nodes, and Puppet for the RHEL6 nodes. The policy .pp file is distributed via cfengine, and a shellcommand is run by cfengine to load/update the module, and additional commands do the file system relabelling and the port rules. Basically, reproducing the .sh file that the Policy Generation Tool creates.

UPDATE 2013-03-22: If you have a cyclic dependency in your policy modules -- in this case, gmond refers to gmetad, and gmetad refers to gmond -- you will find that you can't load the modules individually. All you have to do is load them all in one command line:
semodule -i gmond.pp gmetad.pp