How to build a condor cluster

Using Amazon Web Services (AWS) Elastic Compute Cloud (EC2)

UPDATE 2017: much has changed since I wrote this!

1. For most distros condor is now usually packaged and can be installed from repositories.

2. Everything can be run through the 9618 condor port and only that needs to be opened (USE_SHARED_PORT = True).

3. Custom local config scripts are now more usually placed in /etc/condor/config.d/ and are run in order.

4. You may need to make sure the daemons on different nodes are allowed to talk to each other, with (in a local config script):

ALLOW_WRITE = $(ALLOW_WRITE), 192.168.100.*
ALLOW_READ = $(ALLOW_READ), 192.168.100.*
ALLOW_NEGOTIATOR_SCHEDD = $(ALLOW_NEGOTIATOR_SCHEDD), 192.168.100.*

Obviously change 192.168.100.* to match your own network here. You will also need to set CONDOR_HOST = <host-ip> for the workers.

6. To avoid hours (!!) of frustration, make sure all outgoing network traffic is allowed (this is the usual default security group). Otherwise one can get peculiar startup errors and error messages like "SECMAN:2007:Failed to end classad message". If you do get these, the most likely suspect is a networking problem and not necessarily a condor config error. Also, check your sestatus (I usually back off from enforcing and set the selinux mode to permissive).

The upshot of all this is that you probably don't need the condor config scripts below, rather you can have the shell scripts write a (short!) local config file in /etc/condor/config.d/ and everything should work.

AWS EC2 can be accessed by the AWS Management Console in a web browser, by an API toolbox provided by AWS, or by other methods such as bespoke perl scripts. The non-web-browser methods may require proxy settings as detailed in the two sections at the end.

There is a lot of documentation on the internet related to AWS and EC2. To get going one needs to sign up to AWS, and I suggest the first thing to do is follow the "Getting Started with EC2 guide". I personally use a linux machine (either physical or virtual) to access the running instances.

One thing I found confusing initially is that there are several different AWS authentication schemes :

  • AWS account logon, needed for the AWS Management Console,
  • Access keys, needed for certain kinds of programmatic control such as perl,
  • A.509 certificates, needed for other kinds of programmatic control such as the Java API toolbox,
  • Amazon EC2 key pairs, needed to SSH into running instances.

Access to running EC2 instances is typically through SSH port 22.


Customised Amazon Machine Images

EC2 allows one to capture machine images. The easiest method seems to be to use Amazon Elastic Block Storage (EBS), where one can capture an image from an EBS-backed running instance, for example by right-clicking in the Amazon Management Console, or by using the API tools.

The EBS-backed Amazon linux AMI ami-47cefa33 comes with a large selection of goodies in a yum repository so this is a good place to start (or with a more recent version of the Amazon linux AMI). For example, I start from the basic EBS-backed Amazon linux AMI (launched with the AWS Management Console) and customize it with :

sudo yum install -y make gcc gcc-c++ gcc-gfortan swig perl-ExtUtils-MakeMaker

I then capture the instance as a new AMI. This allows me to launch the AMI and immediately start to compile things with C, C++ and FORTRAN, and to build SWIG interfaces for perl, using the facilities of the perl package ExtUtils::MakeMaker.


Building a condor cluster in the cloud

My approach relies heavily on the ideas in this website^. After spending fruitless hours trying to compile condor from source on the Amazon standard AMI, I found that the pre-packaged RHEL 5 versions just work! These are :

condor-7.6.1-x86_rhap_5-stripped.tar.gz
condor-7.6.1-x86_64_rhap_5-stripped.tar.gz

These (or more recent versions) can be downloaded from the condor website, but they are large, typically > 200MB in size. To avoid round-tripping through a local machine one can download these directly into the AMI instance. Running a GUI-based web browser in an AMI would take us too far afield but downloading from the condor website requires manual intervention so curl or similar cannot be used. The solution is to use a text-based web browser such as w3m. This can be installed in Amazon linux by

sudo yum install -y w3m

The condor implementation is somewhat similar to that described in the above website, except that we will make the condor host machine an EC2 instance too. The idea is to use a customised AMI for both host and worker nodes, and have the instances initialise themselves appropriately with the use of user data which can be communicated at launch time. The condor host will launch first, then we identify its private name (not the public name). The condor worker nodes can then be launched with the private condor host name communicated in user data to allow the worker nodes to identify the condor host. We use private names as all condor nodes in our cluster will be in the same EC2 domain.

The first step is to set up an AWS EC2 security group with TCP and UDP open to other members of the group on the condor port 9618 and a suitable range of high port numbers, eg 50000-50100. Also, I allow arbitrary incoming connections on the SSH port 22 to be able to log in. I will launch all instances in this group.

The next step is to build a customised AMI. I start with the EBS-backed Amazon linux AMI ami-47cefa33 (for 32 bit) and first customize it with the above yum installs. I then download the appropriate condor .tar.gz (with, eg, w3m: see above) into the ec2-user home directory and install as follows :

sudo useradd condor
tar xvzf condor-7.4.4-linux-x86-rhel5-dynamic.tar.gz
cd condor-7.4.4
sudo ./condor_configure --install --install-dir=/usr/local/condor --local-dir=/var/condor --type=submit,execute,manager
cd ..
rm condor-7.4.4-linux-x86-rhel5-dynamic.tar.gz
rm -rf condor-7.4.4/

The first line creates the condor user, as advised in the condor documentation. The next three lines unpack the .tar.gz and install condor into /usr/local/condor, with a local directory set to /var/condor/. The remaining three lines clean up.

I tidy up the vanilla installation by making a symlink to the base configuration file from a location in /etc where condor will find it, and moving the local configuration file out of the way :

sudo mkdir /etc/condor
sudo ln -s /usr/local/condor/etc/condor_config /etc/condor/
sudo mv /var/condor/condor_config.local /var/condor/condor_config.org

If you try to start condor in this state, it will fail (by design) and complain about the absent local configuration file.

The idea now is to use a couple of small shell scripts invoked during the boot process to put in place the appropriate local condor configuration file and start up the master condor daemon. Create the files condor_config.host, condor_config.worker, boot_as_host, boot_as_worker as described below, and move them into the ec2-user home directory using scp. Log in to the running instance and move these to their final intended locations (with the appropriate ownership and permissions) :

sudo mv condor_config.* /var/condor
sudo chown root:root /var/condor/condor_config.*
sudo mv boot_* /etc
sudo chmod 755 /etc/boot_*
sudo chown root:root  /etc/boot_*

As a final convenience, add the path to the standard condor commands to the ec2-user startup script :

echo 'export PATH=$PATH:/usr/local/condor/bin/' >> /home/ec2-user/.bashrc

With all this in place, capture the instance as a new AMI. As above, this is easiest to do by right-clicking on the instance in the AWS Management Console. Now, to launch an instance as host, use :

--user-data '#!/etc/boot_as_host'

You should be able to check this has worked correctly by logging in and seeing if the full suite of condor daemons is running, and checking with condor_status and condor_q.

After the host instance has booted, identify the private DNS name of this instance. Then you can launch any number of instances as condor worker nodes, using :

--user-data '#!/etc/boot_as_worker <full-private-host-DNS-name>'

where <full-private-host-DNS-name> should be replaced by the private DNS name of the condor host node. Shortly after the worker nodes have booted you should see them appear in the pool of machines in the condor host (check with condor_status).

The above examples show how to user data with the API tool ec2-run-instances, but this can be done with all other methods. In some methods such as perl, it is necessary to explicitly MIME::Base64::encode the user data.

The way this works is that during the boot process the Amazon linux AMI intercepts user data which begins with #!, and runs it as a script. I have set up the shell scripts boot_as_host and boot_as_worker to do some simple edits on the appropriate condor_config file, and symlink to condor_config.local, before starting the condor master daemon. When booting up as a worker, the script boot_as_worker takes one argument which specifies the host location. The mechanism relies on a double (recursive) invocation of #! but this seems to work provided that we only attempt to pass one argument.

Originally I attempted to do this with rc.local, however the latest version of the Amazon linux AMI boots with upstart not Sys V init, and it is not guaranteed the legacy file rc.local will execute at the right stage. In fact I found there is a race condition with the cloud initialisation step so that one cannot assume rc.local has access to the user data. One way to fix this would be to add extra events in the upstart boot process but this involves reading lots of documentation. The method I eventually adopted (above) relies on the fact that the cloud initialisation process checks to see if the user data is intended to be an executable file. Normally this would indeed be a file but the user data can made executable in a single line (as above), with the limitation that only one additional argument can be transmitted (in this case, the host name for the worker nodes).

One could use the boot_as_host and boot_as_worker scripts as client-side data files, rather than installing them on the server-side AMI. The disdvantage is that one would need to edit the boot_as_worker script on the fly to point to the condor host. This editing could be done programatically on the client (ie the machine launching the AMIs) by another script, and this would be another way to solve the problem.

Here is the configuration file condor_config.host :

CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = EC2 condor pool
###############################################################################
# Pool settings
###############################################################################
# EC2 workers don't have shared filesystems or authentication
UID_DOMAIN = <domain>
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
USE_NFS = False
USE_AFS = False
USE_CKPT_SERVER = False
# The same for all machines with the same condor user
CONDOR_IDS = 500.501
###############################################################################
# Local paths
###############################################################################
RELEASE_DIR = /usr/local/condor
LOCAL_DIR = /var/condor
# LOG and EXECUTE are set automatically by the startup script. They can't be
# changed here.
LOG = $(LOCAL_DIR)/log
EXECUTE = $(LOCAL_DIR)/execute
LOCK = $(LOG)
###############################################################################
# Security settings
###############################################################################
# Allow local host and the central manager to manage the node
ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME), $(CONDOR_HOST)
ALLOW_READ = *.<domain>
ALLOW_WRITE = *.<domain>
###############################################################################
# CPU usage settings
###############################################################################
# Don't count a hyperthreaded CPU as multiple CPUs
COUNT_HYPERTHREAD_CPUS = False
# Leave this commented out. If your instance has more than one CPU (i.e. if
# you use a large instance or something) then condor will advertise one
# slot for each CPU.
#NUM_CPUS = 1
###############################################################################
# Daemon settings
###############################################################################
# Full list on the host node
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
# Don't run java
JAVA =
###############################################################################
# Classads
###############################################################################
# Run everything, all the time
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
WANT_VACATE = False
WANT_SUSPEND = True
SUSPEND_VANILLA = False
WANT_SUSPEND_VANILLA = True
KILL = False
STARTD_EXPRS = START
###############################################################################
# Network settings
###############################################################################
# Use random numbers here so the workers don't all hit the collector at
# the same time. If there are many workers the collector can get overwhelmed.
UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
# Port range for Amazon firewall
LOWPORT=50000
HIGHPORT=50100

and the configuration file condor_config.worker :

CONDOR_HOST = <host>
COLLECTOR_NAME = EC2 condor pool
###############################################################################
# Pool settings
###############################################################################
# EC2 workers don't have shared filesystems or authentication
UID_DOMAIN = <domain>
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
USE_NFS = False
USE_AFS = False
USE_CKPT_SERVER = False
# The same for all machines with the same condor user
CONDOR_IDS = 500.501
###############################################################################
# Local paths
###############################################################################
RELEASE_DIR = /usr/local/condor
LOCAL_DIR = /var/condor
# LOG and EXECUTE are set automatically by the startup script. They can't be
# changed here.
LOG = $(LOCAL_DIR)/log
EXECUTE  = $(LOCAL_DIR)/execute
LOCK = $(LOG)
###############################################################################
# Security settings
###############################################################################
# Allow local host and the central manager to manage the node
ALLOW_ADMINISTRATOR = $(FULL_HOSTNAME), $(CONDOR_HOST)
ALLOW_READ = *.<domain>
ALLOW_WRITE = *.<domain>
###############################################################################
# CPU usage settings
###############################################################################
# Don't count a hyperthreaded CPU as multiple CPUs
COUNT_HYPERTHREAD_CPUS = False
# No need to be nice (except on host)
JOB_RENICE_INCREMENT = 0
# Leave this commented out. If your instance has more than one CPU (i.e. if
# you use a large instance or something) then condor will advertise one
# slot for each CPU.
#NUM_CPUS = 1
###############################################################################
# Daemon settings
###############################################################################
# Only master and startd, other daemons aren't needed on workers
DAEMON_LIST = MASTER, STARTD
# Don't run java
JAVA =
###############################################################################
# Classads
###############################################################################
# Run everything, all the time
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
WANT_VACATE = False
WANT_SUSPEND = True
SUSPEND_VANILLA = False
WANT_SUSPEND_VANILLA = True
KILL = False
STARTD_EXPRS = START
###############################################################################
# Network settings
###############################################################################
# Use random numbers here so the workers don't all hit the collector at
# the same time. If there are many workers the collector can get overwhelmed.
UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
# Port range for Amazon firewall
LOWPORT=50000
HIGHPORT=50100

The two files are very similar but I think there are advantages in maintaining two templates. The entries "JAVA = " override the setting from the base configuration to disable java jobs; one can allow java jobs by commenting these lines out. One should check that the condor user and group id created above correspond to the values in the CONDOR_IDS parameter. The port range should correspond to that set up for the AWS security group. The placeholders <domain> and <host> are replaced automatically at boot time by the shell scripts described next.

Here is the shell script boot_as_host :

#!/bin/sh
domain=`hostname -d`
cd /var/condor
sed -i.bak -e "s/<domain>/$domain/" condor_config.host
ln -sf condor_config.host condor_config.local
/usr/local/condor/sbin/condor_master

and the shell script boot_as_worker :

#!/bin/sh
domain=`hostname -d`
host=$1
cd /var/condor
sed -i.bak -e "s/<host>/$host/" -e "s/<domain>/$domain/" condor_config.worker
ln -sf condor_config.worker condor_config.local
/usr/local/condor/sbin/condor_master

These work as described above, by editing in place the appropriate configuration file, symlinking to this from the local configuration file, and starting up the condor master daemon.


EC2 access using EC2 Java API tools, through a proxy

I recommend that you make sure you can control things through the AWS Management Console first!

In the past I have used a virtual linux machine running under VMWare Workstation; presently I use a physical linux machine. Within this I have installed Sun/Oracle Java and the EC2 Java API tools in the /opt directory. Thus I have set the following environment variables :

export JAVA_HOME=/opt/jdk1.6.0_23/
export EC2_HOME=/opt/ec2-api-tools-1.4.2.2/

(I don't bother adding the tools to my path.)

I then set up an X.509 certificate and aassociated private key as described in the AWS documentation, and put them in the ~/.ec2/ directory. Thus I need :

export EC2_PRIVATE_KEY=~/.ec2/pk-<code>.pem
export EC2_CERT=~/.ec2/cert-<code>.pem

(obviously you should replace these dummy file names with your own actual full file names).

Next, it may be necessary to set a proxy for the https access requests generated by the API :

export EC2_JVM_ARGS="-Dhttps.proxyHost=<proxy FQDN>"

(this is the only tricky undocumented step). The <proxy FQDN> is the fully qualified domain name of the proxy, eg proxy.company.com, without the http:// gubbins.

With all these in place one can examine the regions :

% /opt/ec2-api-tools-1.4.2.2/bin/ec2-describe-regions
REGION eu-west-1 ec2.eu-west-1.amazonaws.com
REGION us-east-1 ec2.us-east-1.amazonaws.com
REGION ap-northeast-1 ec2.ap-northeast-1.amazonaws.com
REGION us-west-1 ec2.us-west-1.amazonaws.com
REGION ap-southeast-1 ec2.ap-southeast-1.amazonaws.com

Since I am signed up to the eu-west-1 region I finally set :

export EC2_URL=https://ec2.eu-west-1.amazonaws.com

Now we're ready to go...


EC2 access using perl Net::Amazon::EC2, through a proxy

The perl package Net::Amazon::EC2 is slightly out of date but works. However it does not allow for a proxy. To get around this, a small hack can be made to the source code.

First I do

sudo yum install perl-Net-Amazon-EC2 perl-Test-Simple

on my local linux machine which brings in all the required dependencies. I then download the source code for the Net::Amazon::EC2 package from CPAN and unpack it somewhere convenient. The Net::Amazon::EC2 package can be built and installed by the following (this fails if the Test::Simple package is missing) :

perl Makefile.PL
make
sudo make install

Support for a proxy in the Net::Amazon::EC2 package can now be added by editing the file lib/Net/Amazon/EC2.pm and adding the following two marked lines after the indicated existing lines :

has 'SecretAccessKey'   => ( is => 'ro', isa => 'Str', required => 1 ); # EXISTING LINE
has 'https_proxy'       => ( is => 'ro', isa => 'Str', required => 0, default => '' ); # ADDED LINE

and

my $ua  = LWP::UserAgent->new(); # EXISTING LINE
$ua->proxy('https', $self->https_proxy) if ($self->https_proxy ne ''); # ADDED LINE

Then one should be able to recompile and reinstall the Net::Amazon::EC2 package. The default base URL is wrong too, but this can be over-ridden in the constructor call. The example perl script described in the package documentation would be modified to :

my $ec2 = Net::Amazon::EC2->new(
       AWSAccessKeyId => 'PUBLIC_KEY_HERE',
       SecretAccessKey => 'SECRET_KEY_HERE',
       base_url => 'https://ec2.eu-west-1.amazonaws.com', # ADDED LINE
       https_proxy => '<proxy address>' # ADDED LINE
);

The <proxy address> is the full address, eg http://proxy.company.com:8080/. Again, we're now ready to roll...