This section contains very simple generalities on how to perform basic everyday tasks to interface with the cluster system. Such as accessing the system, copying files to the system and back, and general system navigation. covered are several ``best practice'' recommendations, and general policies for using a cluster system.
This portion of the documentation assumes little knowledge of Linux/Unix or clusters in general. However, neither are these topics fully explored and explained here. If this is your first time with any Linux/Unix system, it is recommended that you seek additional tutorials and documentation to get up to speed. This section though informative enough to start you working with a cluster system, does not contain the necessary details or fundamentals to effectively work in a Linux/Unix environment
The preferred method of logging into a cluster is using ssh (secure shell). If using a Windows workstation you will have to use a third party ssh client such as putty.
Here is an example which uses ssh to connect to the cluster ``axiom'' from the workstation ``legio.''
Example:
jdpoisso@legio ~]$ ssh axiom.ccmb.med.umich.edu jdpoisso@axiom.ccmb.med.umich.edu's password: Last login: Thu May 27 12:53:15 2010 from www.bioinformatics.med.umich.edu [jdpoisso@axiom ~]$
Important: Axiom may not be the name of your cluster, please ask your advisor or technical contact which cluster you should be using.
Note that the when connecting to the cluster it may be necessary to use its full name, as shown above. The full name of the cluster ``axiom'' is ``axiom.ccmb.med.umich.edu.'' Please see your technical staff or supervisor to find the full name of your cluster system.
Also notice, when logging into a cluster system, you are asked for a password. Different clusters have different password policies, so some may accept your campus uniqname and password, others may have a different set of usernames and passwords. Again, see your technical staff or supervisor for details regarding your login and password.
The detailed usage of Linux, Windows, or alternative ssh clients is beyond the scope of this document, please reference appropriate documentation for this software.
The preferred method of copying files to a cluster is using scp (secure copy). a Linux workstation you may use this command to copy files to and from the cluster system. If using a Windows based system, there are third party utilities, such as WinSCP, that you may use to copy file. The usage of this utility, however, is not covered in this document.
Copying to the cluster -
Here is an example of copying the file foo from workstation legio &rarr#to;the cluster axiom.
Example:
[jdpoisso@legio ~]$ scp foo axiom.ccmb.med.umich.edu:~ jdpoisso@axiom.ccmb.med.umich.edu's password: foo 100% 13KB 13.3KB/s 00:00 [jdpoisso@legio ~]$
The file foo has just been copied to from workstation legio to the home directory of jdpoisso (represented by the ~) on axiom. Notice as with ssh a password is requested, and then the progress of your file copy is shown on subsequent lines before returning you to your command prompt.
This command however assumes you are copying to your home directory on the axiom cluster, what if you wanted to copy the file somewhere different? Say maybe to /tmp ? In this case you can change the part of the command after the colon as in this example.
Example:
[jdpoisso@legio ~]$ scp foo axiom.ccmb.med.umich.edu:/tmp jdpoisso@axiom.ccmb.med.umich.edu's password: foo 100% 13KB 13.3KB/s 00:00 [jdpoisso@legio ~]$
So far the examples have only demonstrated the copying of a single file. What if, instead of a single file, you want to copy a whole directory or folder? This is accomplished with an argument to the scp command. For basic purposes, an argument may be defined as any additional data appended to a command, where each argument is separated by white space. So the command:
scp foo axiom.ccmb.med.umich.edu:/tmp
Has two arguments to the scp command ``foo'' and ``axiom.ccmb.med.umich.edu:/tmp''. Most commands in Linux/Unix accept arguments, and tend to expect them in a certain order. scp command expects the last argument on the command line to be the destination of the file copy, while the preceding argument is generally considered to be the source file. may also give the scp command an optional argument of the form ``-r'' which if present will copy any and all directories found at the source.
Example:
[jdpoisso@legio ~]$ scp -r data axiom.ccmb.med.umich.edu:~ jdpoisso@axiom.ccmb.med.umich.edu's password: interface.c 100% 2730 2.7KB/s 00:00 architechture.c 100% 1214 1.2KB/s 00:00 DSCF1692-1.jpg 100% 61KB 61.1KB/s 00:00 architechture.h 100% 7895 7.7KB/s 00:00 main.c 100% 420 0.4KB/s 00:00 alcove.c 100% 971 1.0KB/s 00:00 interface.h 100% 182 0.2KB/s 00:00 alcove.h 100% 6781 6.6KB/s 00:00 [jdpoisso@legio ~]$
Notice now that there are multiple files copied, this is because the source, data , rather than being a single file, was a directory. The -rafter the scp was an argument that directed the scp command to copy from the source, all directories and all contents in those directories, to the destination location.
Assuming youve read the previous section on how to copy files to the cluster, the reverse, copying files back from the cluster, is easy considering the information given.
Here is an example copying the file foo from the home directory of axiom to the home directory on legio.
Example:
[jdpoisso@legio ~]$ scp axiom.ccmb.med.umich.edu:~/foo ~ jdpoisso@axiom.ccmb.med.umich.edu's password: foo 100% 13KB 13.3KB/s 00:00 [jdpoisso@legio ~]$
Notice that this scp command is little more than a reversal of two arguments. This is because as mentioned before, the last argument to an scp command is the destination, while the one that precedes that argument is the source. Since we want our destination to be the home directory on legio, we simply put ``~'' in as destination. In Linux/Unix, ``~'' is a shorthand representation for your home directory, and since the scp command is being run on legio, without any directives to the contrary, it will copy the file to the home directory on legio. Also notice that using scp, files on the cluster may be used as a source. The command will ask for your password and log into the cluster to find the file at the location specified and copy it to the destination if found. You may even still use the ``-r'' argument if the source on the cluster is a directory!
As you should be able to deduce, the scpcommand is sensitive to location. Meaning the computer you are currently logged into. command will automatically assume, unless there is instruction otherwise, that the source and destination is on the computer you are currently logged into. the preceding examples, we specified a computer other than the one we are currently logged into by typing its full name followed (without spaces) by a colon. We then write out the source location as we would normally. So ``~/foo'' means in the home directory, the file foo. We could instead write ``/tmp/foo'' to mean in the tmp directory, the file foo. Using this format you can copy to and from any location (which you have access to) on the cluster system.
So far this document has focused on clearly defined instructions on how to log into the clusters, and a means on which to get your data to and from the cluster system. instructions, however, do not cover all circumstances, and depending on various factors may not work in once place the way they work in another. Neither has it been explained exactly where youre logging into, and where you are copying files to and from. some of the underlying structure and details of the system will help you understand the limitations of the system and why things may not always work the same different places.
Also in this section, if youre short of Linux/Unix knowledge, there is a set of examples for basic commands to help you navigate the system. These examples are not complete, and it is recommended that you pursue other means to improve your knowledge of these systems.
A classic cluster is essentially a number of computers grouped together in a manner that allows them to share infrastructure, such as disk space, and work together by sharing program data while those programs are running. However, this simple definition, though accurate, does not really capture the full capability of a modern cluster system, as it excludes a very important concept. This concept, which has been developed to essentially become the core of clustering in general, is the scheduling system. The functional purpose of the scheduling system is to eliminate the need to know what individual computers are doing.
When presented with multiple computers, you do not know what they are doing without individually checking them. Anything could be running on them, by anybody who has access to them. If you want to run a program, you would have to check each computer to see which, if any, have enough available resources, disk space, processors, memory, to run your program. only is it inconvenient to manually check each computer, but if none of them have any available resources, then you will be forced to check again (manually) at a later time.
A scheduling system removes this need, by aggregating data, and monitoring its system, a scheduler will keep an accurate and up to date picture of what resources are available and where. Even beyond tracking resources, a scheduler will allow you to submit instructions for running your program, and then run your program on your behalf once the necessary resources are available.
Knowledge of your storage system is an often overlooked, but critical aspect of any cluster system. For most users, it may be enough to see that there is space, and therefore they use it accordingly. Unfortunately this is a mistake that often has a radical impact on the performance of the cluster system, causing it to slow or fail, and by extension slowing and failing everything that runs on that cluster.
textbfThe key thing to realize is that there are different kinds of storage, and different storages are optimized and fit for particular purposes. When it is time for you to use a cluster system, it is important to understand the demands of your programs with regards to storage. Consider whether you have needs for fast storage, local storage, shared storage, long term storage, short term storage, or large amounts of storage.
Not all cluster systems have all types of storage available to them. Its not unusual for a cluster to have only a few kinds of storage, or be designed primarily around a single type of storage with other types being merely ancillary. example, the umms-amino cluster is optimized for fast file access and reads, and is built around a fast storage system. While the axiom general cluster (at the time of this writing) has a solution that is engineered for data volume rather than performance. Most cluster systems have local scratch (storage located on each compute node, i.e. not shared (in most local systems this storage is located at /tmp)) available, however some may not and rely entirely on a shared storage system.
As any shared space can be saturated by the number of programs running on a cluster system, any need to access or write files to a shared space (like your home directory) can be affected by other jobs accessing that same general location. Ideally a cluster system may have hundreds, if not thousands, of jobs running on it at any given time. you place a thousand cars at a traffic intersection, no matter how well the intersection is designed or how many lanes there are, it will still take some time for all those cars to clear the intersection.
Using storage on a cluster system is often a transparent process. The details of what the storage is and where it is, and its available features is often not readily apparent. As you change directories on any individual node in the cluster system, you may move seamlessly between different storage systems, each with different characteristics. example, home directories (more on these later), are often on a shared storage system, but this may not be made apparent to you without querying the system information and or asking a system administrator. In most cases you will be directed to a preferred location (such as /tmp) from which to run your program, and it will be up to you to take advantage of this location when you submit jobs to the scheduling system.
Note: When you log into a cluster system you typically start out in your home directory. The home directory is usually a shared location which you can use to setup your personal configuration and preferences, as well as test and build programs to run on the cluster. previously mentioned guidelines for a preferred run location apply mostly to program data, and not the programs themselves. programs (jobs) that you plan on submitting to the cluster read specific data files (databases or the like) or write out large data files (building and assembling a large data structure or time step analysis), it is best for this data be copied or staged in preferred locations according to any guidelines given to you (i.e. like /tmp).
Typically when you access a cluster system you are accessing a head node, or gateway node. A head node is setup to be the launching point for jobs running on the cluster. When you are told or asked to login or access a cluster system, invariably you are being directed to log into the head node.
A head node is often nothing more than a simply configured system that is configured to act as a middle point between the actual cluster and the outside network. When you are on a head node, you are not actually running or working on the cluster proper. However, all the tools and necessary programs are made available to test and then submit your programs to the cluster, as well as view and manage your data. purpose for this arrangement is to keep your cluster system separate and distinct from other systems and ``appear'' as a single system rather than a aggregation of many individual systems. This has the effect of simplifying access, usage, and efficiency by having all interactions filtered and managed through a few designated points. In theory, you should never have any need to access any of the individual compute nodes of a cluster system, instead relying on the head node and the tools it provides to submit jobs to the cluster, monitor them, and view and retrieve their results.
Note: From a best practices standpoint although the head nodes are often setup to provide a point of access and testing of the programs you want to run on a cluster system, ideally you do not want to run computational programs on the head node itself. Meaning specifically, any programs you want to run on the cluster should not be run on the head node, instead they should be run and tested on the cluster itself using the scheduling system tools on the head node. You should restrict your usage of the head node to programs that let you build and prepare your cluster programs and manage and view your data. In many cluster systems, there are often resources (cluster nodes) explicitly reserved for testing purposes.
Although in theory you may have no need to access any of the individual compute nodes, in practice there are many potential reasons you may need to access any given compute node. You may need to monitor the actual data being generated by a program, which may not be readily apparent from the head node, especially if your program is using local scratch. Also, it is generally accepted that computers are fallible things, and in the event of any errors, the scheduling system or your script may leave data behind on the head node that you need to retrieve.
In cluster systems access to the compute nodes is restricted to the head node or gateway nodes and other compute nodes. Therefore, access to the compute nodes usually starts from the cluster head node. So accessing any particular compute node is as simple as using ssh to connect to that compute node.
Example:
[jdpoisso@axiom ~]$ ssh compute-0-4 [jdpoisso@compute-0-4 ~]$
Note: You may not be asked for a password when using ssh from the head node to a compute node. In cluster configurations the compute nodes are configured to ``trust'' the head node. If you are asked for a password it may represent a problem with the system or your account.
Note: The names of the individual compute nodes vary from cluster to cluster, but most clusters are based on a particular schema, such as greek letters (alpha, beta, gamma, delta, etc...) or numbers (one, two, three, four, etc...). In the axiom and amino clusters, the node names have a reference to their physical locations. So compute-0-4 would be the fourth system in rack zero. Some clusters may also have shorthand names, compute-0-4 could also be called c0-4 on the amino cluster.
Most computer systems have names: axiom, amino, legio, or compute-0-4. When you want to interact with these particular systems you contact them by name. You may notice though that not all names work at all places. Like legio may know axiom and amino by name, but axiom and amino do not know legio and are unable to establish a connection back to that workstation.
It is important to realize that not all systems know the names for other systems (or may know them by different names). Conceptually, you should think of each system as having the computer equivalent of a personal address book, in it are the names and numbers of all the systems that it needs to talk to. Once it calls a system (using ssh or scp), its like any phone conversation and information can be passed back and forth on that phone line until the connection is broken. Unfortunately, not all systems have everyone in their address books. like in real life, if an address book is too large, it can take a long time (relatively) to find what the system is looking for. This leads to situations where legio might have axioms number, but axiom may not have legios number.
Practically, this concept is important to understand when you want to copy data from the cluster to another location: a workstation, a backup system, or another cluster. system may not know the name of another system on a first name basis. So when you try ``scp data.file legio:~'' on axiom, axiom will return an error saying it cant find legio. In these cases, most often you need to look at a phone book (in computer terms - DNS) and to find someone in the phone book you need their full name. As briefly shown above, the full name of ``axiom'' is ``axiom.ccmb.med.umich.edu''. So if a system doesnt know axiom by first name, it may have more luck with its full name. Its also important to remember, just like in real life, most computer phone books (DNS) are for local areas. So while it may have all the entries for everything in its area, and be able to look up most internet names, there may not be a listing for a system located in a another area. In real life a phone book for Washtenaw county doesnt have a listing of all the telephone numbers of people in Wayne county, even though the two areas are neighbors.
Commands
The screen command is a very powerful and useful command. As you work with a cluster system you may notice that if you lose your connection, or close the terminal you are working on, whatever you were doing on the cluster is terminated (this is not true of jobs submitted to the scheduling system, they are held separately and are not terminated if you log out of the cluster). So if you were copying files and waiting for the process to finish and accidentely close your terminal, the file copy is terminated and you will have to restart it. is why the screen command is so useful. After the ``screen terminal'' has been activated, anything in that terminal stays running even if you log out.
Simple usage of the screen command is easy, however, it may not be readily apparent that its active.
Example:
[jdpoisso@legio ~]$ screen <- Screen clears -> [jdpoisso@legio ~]$
The screen clears itself and it appears as if nothing happened. However, you are now in the ``screen terminal'' which is functionally equivalent to any other terminal, but in case the window is closed, or you want to detach this terminal will instead remain active.
You may go ahead and use this terminal normally. If youre waiting for something to complete you can detach, which leaves the screen terminal active but removes it from view and protects it in case of a connection loss. To detach you must press ``<Ctrl-A>-D'' (this is a key sequence, press control(ctrl) then A while holding control(ctrl) then release those keys and press D). If done properly you will see this.
Example:
[detached] [jdpoisso@legio ~]$
To reattach or to bring up the screen terminal you need to invoke ``screen -r'' and this will show your screen terminal exactly where it left off.
Note: The screen terminal may be exited like any other terminal, by using the ``exit'' command. This destroys the screen terminal and exits the screen program.
Note: screen may not be installed by default on your cluster system. Ask your system administrator to install it for you.