SA 256

Original text by D.Shin, revisions by S. Kirklin

A more in depth, and commensurately more complicated, guide to system administration on our group clusters.

Job management

PBS/torque - pbs_server/pbs_mom

On master node, pbs_server should be running to accept jobs and pbs_mom shoud be running all computing nodes. Please do NOT restart or kill running pbs_server deamons, unless it is really needed. It will reset all running jobs.

josquin ~ # ps aux|grep pbs_server

root 6728 0.0 0.0 15500 3560 ? Ss Jan02 16:39 pbs_server

If PBS related commands, such as qstat, qsub, qalter and etc., are not working, then make sure pbs_server is not working with above command and launch pbs_server command manuall. DO NOT restart entire pbs service with /etc/init.d/pbs restart.

josquin ~ # /usr/local/sbin/pbs_server

Maui is a scheduler for job management on our clusters. It starts its service at the boot along with pbs_server. If all maui related commands are not working, such as showq, diagnose –f (aliased as fs), diagnose –p (aliased as p), showstart and etc., relaunch maui command by:

josquin ~ # /usr/local/maui/sbin/maui

Fairshare Scheme

Fairshare scheme is applied to all cluster and its settings can be found in maui.cfg file in /usr/local/maui. All jobs will be launched based on the order of priority, which is weighted by many different categories, such as fairshare and resources requested.

NEED TO UPDATE THE FAIRSHARE SCHEME THEN PUT THAT INFORMATION HERE.

Package management

To install/remove/upgrade a program on cluster, you may want to use its package management feature. There is a nice summary on Wikipedia for various package management system on linux.

Victoria

Its OS is CentOS 5.2, and uses rpm, the most common linux package management system. To actually install files, you can

victoria ~ # yum search scipy

Loading "fastestmirror" plugin

Loading "priorities" plugin

Loading "downloadonly" plugin

Loading mirror speeds from cached hostfile

* rpmforge: fr2.rpmfind.net

* base: yum.singlehop.com

* updates: mirror.sanctuaryhost.com

* addons: mirror.team-cymru.org

* extras: pubmirrors.reflected.net

rpmforge 100% |=========================| 1.1 kB 00:00

base 100% |=========================| 2.1 kB 00:00

updates 100% |=========================| 1.9 kB 00:00

addons 100% |=========================| 951 B 00:00

extras 100% |=========================| 2.1 kB 00:00

Excluding Packages in global exclude list

Finished

0 packages excluded due to repository priority protections

python-numpy.x86_64 : Fast multidimensional array facility for Python

python-numpy.x86_64 : Fast multidimensional array facility for Python

victoria ~ # yum install python-numpy

Josquin/Byrd/Palestrina

Gentoo is installed on palestrina, josquin and byrd, which uses portage for package management. To access the library of software available:

palestrina ~ # emerge -s scipy

Searching...

[ Results for search key : scipy ]

[ Applications found : 1 ]

* sci-libs/scipy

Latest version available: 0.7.2-r1

Latest version installed: 0.7.2-r1

Size of files: 13,340 kB

Homepage: http://www.scipy.org/ http://pypi.python.org/pypi/scipy

Description: Scientific algorithms library for Python

License: BSD

palestrina ~ # emerge scipy

Encina

Ubuntu 8.04 server is installed on encina, and it uses APT for package management.

Accessibility

iptables

iptables is a kernel level firewall that blocks an access to a port which is not opened.

On Wolverton clusters, ports other than 22 (ssh), 25 (mail), 80 (http), 443 (https), 3573 (DevMan[2]), are all closed. Rule files are /etc/iptables.bak (josquin, byrd, palestrina) and /etc/sysconfig/iptable.save (victoria).

kaien@josquin ~$ sudo /sbin/iptables -L

Password:

Chain INPUT (policy ACCEPT)

target prot opt source destination

ACCEPT all -- anywhere anywhere

ACCEPT all -- anywhere anywhere

ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:smtp

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:https

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:3573

DROP all -- anywhere anywhere

Chain FORWARD (policy ACCEPT)

target prot opt source destination

DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337

Chain OUTPUT (policy ACCEPT)

target prot opt source destination

DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337

Fail2ban

Fail2ban is a program that bans certain ip addresses, if there are more than certain number of malicious attempts and it basically adds more rules to iptables. The configuration file, jail.conf, can be found in /etc/fail2ban directory.

kaien@josquin /etc/fail2ban $ sudo /sbin/iptables -L

Chain INPUT (policy ACCEPT)

target prot opt source destination

fail2ban-BadBots tcp -- anywhere anywhere multiport dports http,https

fail2ban-SSH tcp -- anywhere anywhere tcp dpt:ssh

ACCEPT all -- anywhere anywhere

ACCEPT all -- anywhere anywhere

ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:smtp

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:https

ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:3573

DROP all -- anywhere anywhere

Chain FORWARD (policy ACCEPT)

target prot opt source destination

DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337

Chain OUTPUT (policy ACCEPT)

target prot opt source destination

DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337

Chain fail2ban-BadBots (1 references)

target prot opt source destination

RETURN all -- anywhere anywhere

Chain fail2ban-SSH (1 references)

target prot opt source destination

RETURN all -- anywhere anywhere

/etc/hosts.allow, /etc/hosts.deny

Access to Wolverton clusters is only allowed from certain ip addresses that are listed in /etc/hosts.allow files. An ip address of a group member can be added to make a hole.

#

# hosts.allow This file describes the names of the hosts which are

# allowed to use the local INET services, as decided

# by the '/usr/sbin/tcpd' server.

#

#sshd: *.northwestern.edu: allow

#sshd: phasepusan.metsce.psu.edu: allow

#

# encina

sshd: 129.105.92.49: allow

# byrd

sshd: 165.124.29.202: allow

# victoria

sshd: 165.124.29.204: allow

# morales

sshd: 129.105.12.20: allow

# guerrero

sshd: 129.105.12.19 : allow

# tallis

sshd: 165.124.29.197: allow

# quest

sshd: 165.124.130.5: allow

sshd: 165.124.130.6: allow

sshd: 165.124.130.7: allow

sshd: 165.124.130.8: allow

Services

Linux provides certain services for users, such as web, ssh, and etc. They can be start/stop/restart by:

$ /etc/init.d/service_name [start/stop/restart/status]

Web via apache2 server

$ /etc/init.d/apache2

(josquin/byrd/palestrina)

$ /etc/init.d/httpd

(victoria)

SSH (Secure shell)

$ /etc/init.d/ssh

Nodewatch

$ /etc/init.d/ssh

Ganglia

$ /etc/init.d/gmond

(nodes)

$ /etc/init.d/gmetad

(master)

Pathscale subscription server

There is only one seat for pathscale compiler suite, and encina is serving as the license server. The license file is /opt/pathscale/lib/3.2/pscsubscription-7104.xml.

$ /etc/init.d/pathsub

(only on encina)