This is a set of installation and configuration notes I generated while installing the open source, maintained version of pbs called torque on our system. The torque pbs website is here: http://www.clusterresources.com/pages/products/torque-resource-manager.ph
FIRST OF ALL: torque comes with a make uninstall! So I can safely install in the "usual" place.
The download is here:
http://clusterresources.com/downloads/torque
I downloaded torque-2.3.7.tar.gz. On the master bach00:
./configure
Building components: server=yes mom=yes clients=yes gui=yes drmaa=no pam=no PBS Machine type: linux Remote copy: /usr/bin/scp -rpB PBS home: /var/spool/torque Default server: bach00.astro.bnl.gov Unix Domain sockets: yes Tcl: -L/usr/lib64 -ltcl8.4 -ldl -lpthread -lieee -lm Tk: -L/usr/lib64 -ltk8.4 -L/usr/lib64 -lX11 -L/usr/lib64 -ltcl8.4 -ldl -lpthread -lieee -lm
apparently /var/spool/torque is where output files are generated and then they
are copied to the user's account later. This is probably ok
make
make install
gave no errors.
On the master, the manual says:
"Configure the pbs_server daemon by executing the command torque.setup
<USER>, where <USER> is a username that will act as the TORQUE administrator. "
So in the torque-2.3.7 directory I ran the following since all this should be
done as root
./torque.setup root
The following builds the mom packages needed for the compute nodes, along with
some other packages.
make packages
Building packages from /root/src/torque-2.3.7/tpackages
rm -rf /root/src/torque-2.3.7/tpackages
mkdir /root/src/torque-2.3.7/tpackages
Building ./torque-package-server-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /usr/local/lib'
Building ./torque-package-mom-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /usr/local/lib'
Building ./torque-package-clients-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /usr/local/lib'
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-devel-linux-x86_64.sh ...
libtool: install: warning: remember to run `libtool --finish /usr/local/lib'
Building ./torque-package-doc-linux-x86_64.sh ...
Done.
The package files are self-extracting packages that can be copied
and executed on your production machines. Use --help for options.
to see what files get installed run a command like this:
./torque-package-mom-linux-x86_64.sh --listfiles
to install
./torque-package-mom-linux-x86_64.sh --install
I put these under /home/users/products/local/torque and ran the above command on each node
Setting up the master involves telling it the nodes and starting the services
On the master I edited the nodes file:
vim /var/spool/torque/server_priv/nodes
bach01.astro.bnl.gov np=8
bach02.astro.bnl.gov np=8
bach03.astro.bnl.gov np=8
np=8 means 8 processors/cores
and started the services
on bach00 the master:
cp contrib/init.d/pbs_server /etc/init.d/
chkconfig --add pbs_server
cp contrib/init.d/pbs_sched /etc/init.d/
chkconfig --add pbs_sched
/etc/init.d/pbs_server start
/etc/init.d/pbs_sched start
I needed to tell it that the /home/users directory is an nfs mount. This involves the
file /var/spool/torque/mom_priv/config
$usecp *:/home/users/ /home/users/
I actually put this under /global/data/products/torque/mom_priv for copying to each node
Here is what must be done on each node (this could go in a script):
/global/data/products/torque/torque-package-mom-linux-x86_64.sh --install
cp /global/data/products/torque/mom_priv/config /var/spool/torque/mom_priv/
cp /global/data/products/torque/contrib/init.d/pbs_mom /etc/init.d
chkconfig --add pbs_mom
/etc/init.d/pbs_mom restart
check the nodes:
pbsnodes
bach01.astro.bnl.gov
state = free
np = 8
ntype = cluster
status = opsys=linux,uname=Linux bach01.astro.bnl.gov 2.6.18-128.1.6.el5 #1 SMP Wed Apr 1 09:10:25 EDT 2009 x86_64,sessions=7976,nsessions=1,nusers=1,idletime=1451,totmem=37152104kb,availmem=36119148kb,physmem=32959148kb,ncpus=8,loadave=0.00,netload=1625644302356,state=free,jobs=,varattr=,rectime=1247861716
bach02.astro.bnl.gov
state = free
np = 8
ntype = cluster
status = opsys=linux,uname=Linux bach02.astro.bnl.gov 2.6.18-128.1.6.el5 #1 SMP Wed Apr 1 09:10:25 EDT 2009 x86_64,sessions=? 0,nsessions=? 0,nusers=0,idletime=85,totmem=37152104kb,availmem=36162584kb,physmem=32959148kb,ncpus=8,loadave=0.00,netload=1633815987980,state=free,jobs=,varattr=,rectime=1247861716
bach03.astro.bnl.gov
state = free
np = 8
ntype = cluster
status = opsys=linux,uname=Linux bach03.astro.bnl.gov 2.6.18-128.1.6.el5 #1 SMP Wed Apr 1 09:10:25 EDT 2009 x86_64,sessions=22117,nsessions=1,nusers=1,idletime=5446024,totmem=37152104kb,availmem=35913092kb,physmem=32959148kb,ncpus=8,loadave=0.00,netload=1634810280259,state=free,jobs=,varattr=,rectime=1247861717
Tried submitting a job:
> cat test.pbs
#PBS -l nodes=1:ppn=1
sleep 120
qsub test.pbs
It works, although the jobs stay in the queue as "completed" for a while