LODZ is currently located at lodz.ucsf.edu. LODZ is a mosix cluster, with all jobs running through the "head" node. Currently 56 jobs can be run in parallel across 12 machines (2 dual 2.27Ghz quad xeons with 48G memory, 9 quad core i7 2.67Ghz with 12G memory, and 1 quad core i7 overclocked to 3.8Ghz.
The whole purpose of using LODZ is to run your work in parallel with mosix. As such, it is built for jobs with moderate I/O to the disk. If you need to run something with heavy I/O, please contact us first for workarounds (i.e., put the files on that node). For example if you tried to load the same large file in on 50 jobs simultaneously, the cluster might choke. It is generally to your advantage to chop your job up into a number of fine partitions. That way it can be split across the nodes, partitions that take longer will not slow up other partitions, and jobs will not be fighting each other for memory. If your jobs need lots of memory, then you will probably only want to submit as many jobs as machines. If they need massive amounts of memory, the head node that you login has the most memory, and jobs should be run on that node.
The storage server apollonia.ucsf.edu has 60TB, split into 3 pieces that are accessible from the head node in /data/vault1, /data/vault2, and /data/vault3. You can also transfer data directly to the storage server.
We ask that if you are about to tie up a good portion of LODZ for more than a few hours, that you e-mail the other users that you plan to do so. You should run your code on a small chunk of your dataset and/or a small number of simulation iterations to get an estimate of how long things will take to run.
Feature 1: Java programs need special treatment, ask for details (needs -E, also files must be present on storage node or every cluster node, and if on every cluster node, then you must mount this specially).
Feature 2: The over-clocked node needed to be separated from the rest of the group, because it did bad things to the job scheduler. You must log on to that node directly.
Feature 3: Our current version of mosix has a rather poor job scheduler, and does nasty things sometimes. One example is trying to queue twenty jobs on the head node only. This typically confuses the job scheduler which decides you have not met the 56 job limit, and runs them all at once.
Use some ssh client. For Linux, this is just
ssh firstname.lastname@example.org to do it with X
ssh email@example.com -Yand finally to connect to a specific node, e.g. the overclocked node (you must create an account on it for yourself first):
ssh -oport=33 firstname.lastname@example.orgGetting files back and forth is just typing
sftp://email@example.com your file browser (i.e. konqueror, dolphin, nautilus) if you happen to use linux. For Windows, two GPL pieces of software are available that I would suggest. WinSCP will allow you to transfer files back and forth to your windows system, and PuTTY will allow you to connect to the remote system and run commands on it.
You should submit jobs with
nohup mosrun -e -b -q YOUR-JOB &So for example, suppose we had the R script file "foobatch.R", we might run this with
nohup mosrun -e -b -q R CMD BATCH foobatch.R &or with
nohup mosrun -e -b -q R --vanilla < foobatch.R > foobatch.Rout1 &The command "nohup" means your command is immune to hangups, so if you logout, it will still run. You must submit jobs with the "-q" option, using the queuing system, or your job will be killed. To kill all of your jobs type
moskillallTo see what you have currently running (this will not work if you run things with nohup and logout), type
mospsor to see what everyone is running
mosps -u allSometimes running things with "nohup" and logging out can confuse this a little, but generally looking at what everyone is running should show your jobs if they don't show up otherwise. In this case, also try 'top -u username' which will give some other pointers.
If you were running a job that needed to use the head node for memory issues, you could e.g. run
nohup mosrun -e -3 -q R CMD BATCH foobatch.R &for example.
To check the availability of LODZ, you can type
mon -t -Vto see how much of LODZ is in use.
You may find the R package batch useful for splitting your jobs in parallel and passing command line arguments to R on the cluster.