Recent site activity

Random Hacks‎ > ‎

AMD GPU Coolness

Deja Vu

It may come as no big surprise, that when trying to rig a GPU server using 4 AMD/ATI FirePro3D V8800 cards, I ran into very similar issues - although not as difficult to work around - than what i experienced earlier this year with GPU hardware from Nvidia.

Background

When using GPUs for computing purposes, there is always a lot of concern about the heat that the GPUs generate and when operating machines with several densely packed "Desktop GPUs" in a single case in a data center environment. The cards are usually set up for being fully used only on occasion and otherwise stay quiet and thus trading the risk of higher wear and increased failure rates though higher temperatures for less noise. Yet in a data center noise is no concern, but proper airflow is. As a result, the GPUs are running too hot without any real need. Most workstation cases allow a "Performance" setting for case fans, yet a similar option for the GPUs is missing.

What is different for AMD GPUs?

The biggest difference when setting up AMD GPUs for GPU computing vs. Nvidia GPUs is that AMD GPUs (still) need you to run an X server with all GPUs configured and set the X server up, so that any client can access it (runlevel 5 won't work). But at the same token, that requirement also means, that you can configure all GPUs to at the same time and furthermore, the vendor provided driver allows you to operate the X server without any physical display connected. This reduces the amount of complications that one needs to deal with in order to set the GPU fans to produce a stronger airflow.

The Remaining Problems

Starting the X Server Without a Console

Just launching the X server with "X :0 &" in /etc/rc.local unfortunately seems to lead to the X server desperately searching for a console to attach itself to and thus consuming 100% of a CPU core. Not exactly an ideal scenario. However, this can be alleviated by using two tricks: explicitly telling the X server to attach itself to an otherwise unused Linux console and redirecting input and output from/to /dev/null. The resulting command in /etc/rc.local thus becomes:

/usr/bin/Xorg :0 vt9 < /dev/null > /dev/null 2> /dev/null &

Setting the Fan Speed

The AMD drivers come with a very powerful setup tool, aticonfig, that allows to tweak many settings. However, setting the fan speed is not a documented feature. A little time searching the web leads to the information that there is an undocumented flag --pplib-cmd that provides the needed functionality by accepting command strings. One complication is that aticonfig --pplib-cmd "set fanspeed 0 85" will happily set the speed of a GPU fan, but only for the first GPU, and the assumption that aticonfig --pplib-cmd "set fanspeed 1 85" will set the fan on the second GPU is incorrect. This would set a - hypothetical second fan on the same GPU. Rather the selection of GPUs has to be done via the DISPLAY environment variable, where a display of :0.0 would refer to the first configured GPU, and :0.1 to the second and so on. Our code for /etc/rc.local to be executed after the X server is launched, would thus become:

for s in 0 1 2 3
  do env DISPLAY=:0.$s /usr/bin/aticonfig --pplib-cmd "set fanspeed 0 85"
done

Making it stick

So far, so good. Sadly, the command from above doesn't work very well. Only some of the GPU fans will be adjusted and when launching OpenCL programs using the GPUs sooner or later the entire effect is lost. Same as in the Nvidia case, there is obviously a driver re-initialization at work that will be launched and resets all previous settings, as soon as the last "context" accessing the GPU is removed. So we need to create some GPU context and keep it alive for as long as the machine is in running, but without interfering with GPU accelerated applications. A very simple OpenCL application can help us here (the full source code is attached below). This can be easily lifted from OpenCL example codes that show how to query what existing devices are available. The key code are the following lines, which open a context for all supported GPU devices and then enter an infinite loop of sleeps:

cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, cl_context_properties)(platformList[0])(), 0}; cl::Context context(CL_DEVICE_TYPE_GPU, cprops,NULL,NULL,&err); cl::vector devices; devices = context.getInfo(); while (1) sleep(600);

Putting it All Together

The final script code in /etc/rc.local is becomes now:
/usr/bin/Xorg :0 vt9 < /dev/null > /dev/null 2> /dev/null &
sleep 10

/sbin/ocl_context_hack.x < /dev/null &
sleep 5

for s in 0 1 2 3
  do env DISPLAY=:0.$s /usr/bin/aticonfig --pplib-cmd "set fanspeed 0 85"
done

The Result

Same as with the Nvidia cards, GPU temperatures are significantly lower on the GPU servers with 4 GPUs. Also in this case the GPU temperature under full load drops on average by about 20C from around 75C with the default settings to less than 55C with the fan speed set to 85%. This is quite a pleasing outcome of this little hack.

Č
ċ
ď
ocl_context_hack.cpp
(2k)
Axel Kohlmeyer,
Nov 24, 2011, 7:24 PM
Comments