Before starting to use the cluster it is necessary for you to know a bit about how the cluster is organised to avoid incorrect usage of the machines.
Below a simplified picture can be seen that illustrates how the cluster is organised.
The picture illustrates a user accessing the cluster through the login1 or login2 hosts. From there the user can get access to the compute nodes through the scheduler. Furthermore, there are some share storage resources(/user and /pack) made available by the storage node. These folders are mounted on all nodes. The /user folder contain all users, e.g. if your user name is michael then this will be accessible through out the cluster in /user/michael/. The /pack folder contain software that is shared between all the compute nodes. One example is the openmpi library that is commonly used for communication in clusters. Finally, there is another special folder, the /scratch folder. This folder is a local folder on each compute node which should be used to write intermediate results before collecting this on the storage node (by writing the data to your home folder).
For the MCC cluster we have the following hardware:
- 9 compute nodes with 1 Tb memory and 64 cores (4 AMD Opteron 6376 Processors each with 16 cores) and a 1 SATA Tb disk
- Three of the nodes will also have 500Gb SSD disk installed
- Two frontend machines for login and administration
- One storage machine with 12 Tb of partitioned disk storage
Each of the computer nodes have 1Tb of memory and 4 CPU's (AMD Opteron Processor 6376). Each of the CPU's have 16 physical cores running at 2,3Ghz so in total each node have 1 Tb of memory and 64 cores. As with all modern 4 CPU systems the machines use a NUMA architecture. Below a picture can be seen of how the components are connected:
This means that each CPU have fast access to 256Gb of memory and a bit slower access to memory that is local to other CPU's. To get good performance it is therefore paramount to primarily use memory that is local to the CPU where a task is running. It should also be mentioned that PCIe devices are attached to the AMD HyperTransport link but will be bound to one of the CPU's. It can be beneficial to use this CPU for low latency communication.
All compute nodes have a 40GB QDR InfiniBand host control adapter installed(High speed/low latency network card) besides that they are interconnected with 1 GB ethernet. The 1 GB ethernet is used for file transfer to avoid noise in the InfiniBand communication which would otherwise make benchmarking of InfiniBand communication difficult. The storage node is connected to the switch with a 10 GB ethernet link so that each node can write with 1GB simultaneously. However, since the disk storage system only have a sustained write speed of 620-6 Mb/s it is better to write intermediate results to the /scratch partition. The /scratch partition a sustained write speed of 160Mb/s. This should be compared to a max write speed of 103-4 Mb/s (limited by the 1 GB ethernet) to the storage system(users home directory). If more than 6 nodes are writing simultaneously to the storage system the 103-104Mb/s write speed will decrease to about 68 Mb/s if all nodes are writing simultaneously for a sustained period of time.
Details for the interested reader:
The hardware of MCC are located on two locations. The hardware listed above constitute administration, storage and computation nodes and are located in dc2 where there only is access together with a person (e.g., Mads Boye) from ITS (see the picture below). Besides that we also have a backup storage server, and machines used for various other purposes. These machines are older and need more frequent physical maintenance therefore these machines are located in Forrummet 0.1.84 where we can get access.
- 3 Dell MD 1000 Direct-attached storage systems of which we are using one of them for production
- 4 Dell 2950 PowerEdge with two quad core CPU's (8 cores in total) and 32 Gb of memory
- 1 Dell 1950 PowerEdge with two dual core CPU's (4 cores in total) and 4 Gb of memory
One of the Dell 2950 PowerEdge servers are used for the backup storage system together with a Dell MD 1000 storage system. The 3 other Dell 2950 PowerEdge are used for various projects. The Dell 1950 PowerEdge is used for monitoring and various other maintenance tasks.