Linux Follies: big data

Apache Spark is a popular (because it is fast) big data engine. The speed comes from keeping data in memory. This is an update to my older post: it is still Spark in standalone mode, using the nodes assigned by GE as the worker nodes. I have an update for using Spark 2.2.0, with Java 1.8.0.

It is mostly the same, except only one file needs to be modified: sbin/slaves.sh The Parallel Environment (PE) startup script update only adds an environment variable for defining where the worker logs go. (Into a Grid Engine job-specific directory under the job directory.) And it now specifies Java 1.8.0.

As before, the modifications to sbin/slaves.sh handle using the proper spark-env script based on the user's shell. Since that spark-env script is set up by the PE script to generate job-specific conf and log directories, everything job-specific is separated.

Apache Spark is a fast engine for big data. It can use Hadoop infrastructure (like HDFS), and provides its own map-reduce implementation. It can also be run in standalone mode, without Hadoop or the YARN resource manager.

I have been able to get Spark 1.4.1 running, with some integration into an existing Univa Grid Engine cluster. The integration is not "tight" in that the slave processes are still independently launched with ssh. I was unable to get Spark to work with qrsh. So, without tight integration, usage accounting is not exact.

I also had to make some modifications to the Spark standalone shell scripts in order to have job-specific configuration and log directories. Out of the box, Spark's shell scripts do not completely propagate the environment to the slaves. Job-specific configuration and log directories are needed because multiple users may want to run Spark jobs at the same time.

Additionally, I was not able to figure a way to constrain Spark slave instances to subsets of available processor cores. So, Spark jobs require exclusive use of compute nodes.

So, let's start there. Your GE installation needs to have the "exclusive" complex defined:

#name shortcut type relop requestable consumable default urgency
#------------------------------------------------------------------------------------------

exclusive excl BOOL EXCL YES YES 0 1000

The OS on Drexel's Proteus cluster is RHEL 6.4-ish. I use Red Hat's packaging of Oracle Java 1.7.0_85 by default. Running Spark requires the JAVA_HOME environment variable to be set, which I do in the global login script location /etc/profile.d/. I found that using /usr/lib/jvm/java did not work. It needed to be:

JAVA_HOME=/usr/lib/jvm/java-1.7.0-oracle.x86_64

Building Spark 1.4.1 was painless. I used the bundled script to generate a binary distribution tarball:

./make-distribution.sh --name myname --tgz

Untar it into some convenient location.

Next, the sbin/start-slaves.sh and sbin/stop-slaves.sh scripts need to be modified. You can look at my fork at GitHub. As they are, these two scripts just ssh to all the slave nodes to start the slave processes. However, ssh does not pass environment variables, so all the slave processes launch with the default SPARK_HOME. That means all the slave processes read the global Spark config and environment, and log to the global Spark installation log directory.

Because the remote shell is the user shell, we have to figure out the user shell in order to build the command to be executed on the slave hosts. Here is the snippet from sbin/start-slaves.sh:

# Launch the slaves

USERSHELL=$( getent passwd $USER | cut -f7 -d: )

if [ $USERSHELL = "/bin/bash" -o $USERSHELL = "/bin/zsh" -o $USERSHELL = "/bin/ksh" ] ; then

"$sbin/slaves.sh" cd "$SPARK_HOME" \&\& "." "$SPARK_CONF_DIR/spark-env.sh" \&\& "$sbin/start-slave.sh" "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"

elif [ $USERSHELL = "/bin/tcsh" -o $USERSHELL = "/bin/csh" ] ; then

"$sbin/slaves.sh" cd "$SPARK_HOME" \&\& "source" "$SPARK_CONF_DIR/spark-env.csh" \&\& "$sbin/start-slave.sh" "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"

The cluster here has two types of compute nodes: Dell C6145s with 64-core AMD CPUs, and Dell C6220s with 16-core Intel CPUs. So, I created a job class (JC) with two subclasses, and also separate parallel environments (PEs).

The job class is as follows -- all missing lines have the default "{+}UNSPECIFIED" config:

jcname spark

variant_list default intel amd

owner NONE

user_lists NONE

xuser_lists NONE

...

l_hard {+}exclusive=TRUE,h_vmem=4g,m_mem_free=3g, \

[{+}intel=vendor=intel,h_vmem=4g,m_mem_free=3g], \

[{+}amd=vendor=amd,h_vmem=4g,m_mem_free=3g]

...

pe_name {~}spark.intel,[intel=spark.intel],[amd=spark.amd]

...

The spark.intel PE is defined as follows (with the spark.amd PE defined similarly):

pe_name spark.intel

slots 99999

user_lists NONE

xuser_lists NONE

start_proc_args /cm/shared/apps/sge/var/default/common/pescripts/sparkstart.sh

stop_proc_args NONE

allocation_rule 16

control_slaves FALSE

job_is_first_task FALSE

urgency_slots min

accounting_summary FALSE

daemon_forks_slaves FALSE

master_forks_slaves FALSE

The PE start script writes the job-specific environment files, and log4j properties file:

#!/bin/bash

spark_conf_dir=${SGE_O_WORKDIR}/conf.${JOB_ID}

/bin/mkdir -p ${spark_conf_dir}

### for bash-like

sparkenvfile=${spark_conf_dir}/spark-env.sh

echo "#!/usr/bin/env bash" > $sparkenvfile

echo "export JAVA_HOME=/usr/lib/jvm/java-1.7.0-oracle.x86_64" >> $sparkenvfile

echo "export SPARK_CONF_DIR=${spark_conf_dir}" >> $sparkenvfile

echo "export SPARK_MASTER_WEBUI_PORT=8880" >> $sparkenvfile

echo "export SPARK_WORKER_WEBUI_PORT=8881" >> $sparkenvfile

echo "export SPARK_WORKER_INSTANCES=1" >> $sparkenvfile

spark_master_ip=$( cat ${PE_HOSTFILE} | head -1 | cut -f1 -d\ )

echo "export SPARK_MASTER_IP=${spark_master_ip}" >> $sparkenvfile

echo "export SPARK_MASTER_PORT=7077" >> $sparkenvfile

echo "export MASTER_URL=spark://${spark_master_ip}:7077" >> $sparkenvfile

spark_slaves=${SGE_O_WORKDIR}/slaves.${JOB_ID}

echo "export SPARK_SLAVES=${spark_slaves}" >> $sparkenvfile

spark_worker_cores=$( expr ${NSLOTS} / ${NHOSTS} )

echo "export SPARK_WORKER_CORES=${spark_worker_cores}" >> $sparkenvfile

spark_worker_dir=/lustre/scratch/${SGE_O_LOGNAME}/spark/work.${JOB_ID}

echo "export SPARK_WORKER_DIR=${spark_worker_dir}" >> $sparkenvfile

spark_log_dir=${SGE_O_WORKDIR}/logs.${JOB_ID}

echo "export SPARK_LOG_DIR=${spark_log_dir}" >> $sparkenvfile

echo "export SPARK_LOCAL_DIRS=${TMP}" >> $sparkenvfile

chmod +x $sparkenvfile

### for csh-like

sparkenvfile=${spark_conf_dir}/spark-env.csh

echo "#!/usr/bin/env tcsh" > $sparkenvfile

echo "setenv JAVA_HOME /usr/lib/jvm/java-1.7.0-oracle.x86_64" >> $sparkenvfile

echo "setenv SPARK_CONF_DIR ${spark_conf_dir}" >> $sparkenvfile

echo "setenv SPARK_MASTER_WEBUI_PORT 8880" >> $sparkenvfile

echo "setenv SPARK_WORKER_WEBUI_PORT 8881" >> $sparkenvfile

echo "setenv SPARK_WORKER_INSTANCES 1" >> $sparkenvfile

spark_master_ip=$( cat ${PE_HOSTFILE} | head -1 | cut -f1 -d\ )

echo "setenv SPARK_MASTER_IP ${spark_master_ip}" >> $sparkenvfile

echo "setenv SPARK_MASTER_PORT 7077" >> $sparkenvfile

echo "setenv MASTER_URL spark://${spark_master_ip}:7077" >> $sparkenvfile

spark_slaves=${SGE_O_WORKDIR}/slaves.${JOB_ID}

echo "setenv SPARK_SLAVES ${spark_slaves}" >> $sparkenvfile

spark_worker_cores=$( expr ${NSLOTS} / ${NHOSTS} )

echo "setenv SPARK_WORKER_CORES ${spark_worker_cores}" >> $sparkenvfile

spark_worker_dir=/lustre/scratch/${SGE_O_LOGNAME}/spark/work.${JOB_ID}

echo "setenv SPARK_WORKER_DIR ${spark_worker_dir}" >> $sparkenvfile

spark_log_dir=${SGE_O_WORKDIR}/logs.${JOB_ID}

echo "setenv SPARK_LOG_DIR ${spark_log_dir}" >> $sparkenvfile

echo "setenv SPARK_LOCAL_DIRS ${TMP}" >> $sparkenvfile

chmod +x $sparkenvfile

/bin/mkdir -p ${spark_log_dir}

/bin/mkdir -p ${spark_worker_dir}

cat ${PE_HOSTFILE} | cut -f1 -d \ > ${spark_slaves}

### defaults, sp. log directory

echo "spark.eventLog.dir ${spark_log_dir}" > ${spark_conf_dir}/spark-defaults.conf

### log4j defaults

log4j_props=${spark_conf_dir}/log4j.properties

echo "### Suggestion: use "WARN" or "ERROR"; use "INFO" when debugging" > $log4j_props

echo "# Set everything to be logged to the console" >> $log4j_props

echo "log4j.rootCategory=WARN, console" >> $log4j_props

echo "log4j.appender.console=org.apache.log4j.ConsoleAppender" >> $log4j_props

echo "log4j.appender.console.target=System.err" >> $log4j_props

echo "log4j.appender.console.layout=org.apache.log4j.PatternLayout" >> $log4j_props

echo "log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n" >> $log4j_props

echo "# Settings to quiet third party logs that are too verbose" >> $log4j_props

echo "log4j.logger.org.spark-project.jetty=WARN" >> $log4j_props

echo "log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR" >> $log4j_props

echo "log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO" >> $log4j_props

echo "log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO" >> $log4j_props

And then an example job script looks like:

#!/bin/bash

#$ -S /bin/bash

#$ -P myprj

#$ -M myemail@example.com

#$ -m ea

#$ -j y

#$ -cwd

#$ -jc spark.intel

#$ -l exclusive

#$ -pe spark.intel 32

#$ -l vendor=intel

#$ -l h_rt=0:30:00

#$ -l h_vmem=4g

#$ -l m_mem_free=3g

. /etc/profile.d/modules.sh

module load shared

module load proteus

module load gcc

module load sge/univa

module load python/2.7-current

module load apache/spark/1.4.1

###

### Set up environment for Spark

###

export SPARK_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID}

. ${SPARK_CONF_DIR}/spark-env.sh

###

### The actual work is done below

###

### Start the cluster: master first, then slaves

echo "Starting master on ${SPARK_MASTER_IP} ..."

start-master.sh

echo "Done starting master."

echo "Starting slave..."

start-slaves.sh

echo "Done starting slave."

### the script which does the actual computation is submitted to the

### standalone Spark cluster

echo "Submitting job..."

spark-submit --master $MASTER_URL wordcount.py

echo "Done job."

### Stop the cluster: slaves first, then master

echo "Stopping slaves..."

stop-slaves.sh

echo "Done stopping slaves"

echo "Stopping master..."

stop-master.sh

echo "Done stopping master."

And, that's it. I have not done extensive testing or benchmarking, so I don't know what the performance is like relative to an installation that runs on Hadoop with HDFS.

Linux Follies

2020-08-24

Neocortex at Pittsburgh Supercomputing Center - 40k-core custom CPU

2017-10-04

Apache Spark integration with Grid Engine (update for Spark 2.2.0)

2015-08-06

Apache Spark integration with Grid Engine