Deja VuIt may come as no big surprise, that when trying to rig a GPU server using 4 AMD/ATI FirePro3D V8800 cards, I ran into very similar issues - although not as difficult to work around - than what i experienced earlier this year with GPU hardware from Nvidia. BackgroundWhen using GPUs for computing purposes, there is always a lot of concern about the heat that the GPUs generate and when operating machines with several densely packed "Desktop GPUs" in a single case in a data center environment. The cards are usually set up for being fully used only on occasion and otherwise stay quiet and thus trading the risk of higher wear and increased failure rates though higher temperatures for less noise. Yet in a data center noise is no concern, but proper airflow is. As a result, the GPUs are running too hot without any real need. Most workstation cases allow a "Performance" setting for case fans, yet a similar option for the GPUs is missing. What is different for AMD GPUs?The biggest difference when setting up AMD GPUs for GPU computing vs. Nvidia GPUs is that AMD GPUs (still) need you to run an X server with all GPUs configured and set the X server up, so that any client can access it (runlevel 5 won't work). But at the same token, that requirement also means, that you can configure all GPUs to at the same time and furthermore, the vendor provided driver allows you to operate the X server without any physical display connected. This reduces the amount of complications that one needs to deal with in order to set the GPU fans to produce a stronger airflow. The Remaining ProblemsStarting the X Server Without a ConsoleJust launching the X server with " /usr/bin/Xorg :0 vt9 < /dev/null > /dev/null 2> /dev/null &Setting the Fan SpeedThe AMD drivers come with a very powerful setup tool,
Making it stickSo far, so good. Sadly, the command from above doesn't work very well. Only some of the GPU fans will be adjusted and when launching OpenCL programs using the GPUs sooner or later the entire effect is lost. Same as in the Nvidia case, there is obviously a driver re-initialization at work that will be launched and resets all previous settings, as soon as the last "context" accessing the GPU is removed. So we need to create some GPU context and keep it alive for as long as the machine is in running, but without interfering with GPU accelerated applications. A very simple OpenCL application can help us here (the full source code is attached below). This can be easily lifted from OpenCL example codes that show how to query what existing devices are available. The key code are the following lines, which open a context for all supported GPU devices and then enter an infinite loop of sleeps:
Putting it All TogetherThe final script code in/etc/rc.local is becomes now:
The ResultSame as with the Nvidia cards, GPU temperatures are significantly lower on the GPU servers with 4 GPUs. Also in this case the GPU temperature under full load drops on average by about 20C from around 75C with the default settings to less than 55C with the fan speed set to 85%. This is quite a pleasing outcome of this little hack. |

