xenomai-user threads (Athlon)

Note: In the following, I use "cpu" and "core" pretty much interchangeably.

A particularly interesting configuration supported by Michael Haberler's single codebase is based on Xenomai with realtime threads in userspace. Michael has made good arguments for the desirability of this configuration in his messages to the emc-developers list. I won't repeat them here.

Late on 20121111, I pulled the  rtos-integration-preview1 branch from Michael's git repository and built LinuxCNC 2.5 on a 4-core system consisting of an ASUS M488T-M motherboard, 3GHz AMD Athlon II X4-640 processor, 4GB Ram, onboard Radeon 4250. The operating system was Ubuntu 10.04LTS running on the 3.2.21-xenomai+ kernel available on Michael's website. Once in the src directory, the build proceeded as "./autogen.sh; ./configure --with-threads=xenomai-user; make; sudo make setuid". It's a joy to be able to make quick work of the build on a 4-core system by adding "-j8" to make.

The build proceeded without incident.

I rebooted into the 3.2.21-xenomai+ kernel with the boot cheat code "isolcpus=2,3" in order to isolate two of the four cores from the Linux scheduler (see the Wiki to understand why this might be beneficial). BIOS options Cool 'n' Quiet, C1E Support and Advanced Clock Calibration were all disabled in order, I hope, to minimize potentially latency-inducting behavior.

I wanted to determine how the performance of this system compared to the performance of a traditional LinuxCNC based on RTAI. In particular, I wanted to know how much jitter was present in realtime-thread timings for the two configurations. As well, I wanted to try to gain an understanding of the effect of spreading realtime processes over the multiple cores.

A latency-test script is provided in the LinuxCNC distribution. In a multiple-core system built on RTAI, the two realtime threads in this script start on the last, typically isolated core. I have previously posted results to the Latency-Test page of the Wiki. In summary, with the candidate computer and RTAI threads, I was able consistently to obtain max-jitter values of 4.4us and 2.4us on the servo (1ms) thread and the base (25us) thread respectively.

Since Michael has modified the loadrt HAL command so that one can assign a realtime thread to a specific CPU (or assign a CPU to a thread, whatever), I was able to modify the latency-test script to measure maximum jitters for all 16 combinations of core and assigned thread. In order to get reasonably repeatable results but not take forever to do the tests, I ran timed 2-minute tests for each combination, always with a 1ms servo thread and a 25us base thread. I performed the 16 tests first under "noload" conditions, e.g., with nothing other than the usual system processes running, and then with "CPU hogs" running on the scheduled CPUs 0 and 1. The CPU hog was taken from the Wiki

   
while true ; do echo "nothing" > /dev/null ; done

Here are my first results.

Each box in the tables below represents a test for a specific combination of threads and CPUs and contains the max-jitter values reported for the servo/base threads in that test. The columns are numbered by the CPU assigned the servo thread and the rows are numbered by the CPU assigned the base thread. For example, in Series 1, with the servo thread assigned to CPU 2 and the base thread assigned to CPU 1, the reported max-jitter values were 11.1us for the servo thread and 6.1us for the base thread.

Test series 1. "noload" on scheduled CPUs 0/1, CPUs 2/3 are isolated.
Table entries are max-jitter values reported for a given test with servo/base threads running in the indicated CPUs.
Each test ran 2 minutes.

base-
thread
CPU

servo-thread CPU

0
1
2
3
0
7.7us/7.5us
14.8us/7.8us
10.1us/9.8us
10.5us/9.2us
1
15.4us/6.7us
4.2us/5.2us
11.1us/6.1us
13.7us/9.4us
2
16.5us/5.5us
14.3us/8.4us
2.5us/5.3us
10.8us/7.5us
3
15.0us/5.7us
14.2us/7.3us
12.3us/8.9us
2.9us/5.5us


Test series 2. "CPU hogs" running on scheduled CPUs 0/1, CPUs 2/3 isolated.
Table entries are max-jitter values reported for a given test with servo/base threads running in the indicated CPUs.
Each test ran 2 minutes.


base-
thread
CPU

servo-thread CPU

0
1
2
3
0
5.8us/8.0us
12.8us/7.9us
7.8us/8.7us
8.1us/10.0us
1
14.3us/5.4us
3.7us/4.9us
7.3us/7.4us
7.9us/6.4us
2
15.3us/4.6us
15.2us/4.8us
2.5us/4.9us
8.8us/5.8us
3
14.3us/5.6us
13.4us/5.0us
7.4us/7.4us
2.6us/4.7us

At a glance, the most notable feature of both series of tests is the low max-jitter results reported along the major diagonals of the tables, e.g., for the cases where both servo and base threads run on the same CPU. With the exception of case (0,0), the reported numbers are as good as or better than those I reported lat year with the same computer and an RTAI-system running its realtime threads in kernel space. Since case (0,0) corresponds to the realtime threads running on the same CPU with many system processes, the exception doesn't seem remarkable.

Another notable feature is the minimal benefit of running CPU hogs on the scheduled CPUs, seen by comparing comparable Series 1 and Series 2 tests. This nostrum was developed to prevent a cache-flushing situation leading to excessive latencies observed in the early days of LinuxCNC development. Although it seemed to have a beneficial effect in earlier versions of Michael's Xenomai work, it seems to offer little benefit now, perhaps a few microseconds reduction in max jitter. Looking at CPU utilization with the Linux top command, CPU 0 utilization was always less than 2% user mode, a fraction of a percent system mode, and 95+% idle in the "noload" condition, with the other cpus essentially 100% idle. With the CPU hog running on a CPU, its utilization jumped to about 80% user mode and slightly less than 20% system mode. That the max-jitter values reported don't drop dramatically would seem to be an argument against behind-the-scenes power-management techniques like dynamic cpu frequency-changing being a major actor in these measurements.

Finally, even the worst reported max-jitter values should be quite acceptable for systems employing software step-generation.

While not reflected in these tables, this build seems more stable than previous ones, with less variability in the reported jitter numbers. Starting other processes such as a Web browser or glxgears occasionally kicked the jitter but only minimally, as did the software update agent unexpected starting up during a test.

Note: On this system, examining the /proc/xenomai/stat pseudofile (e.g., 'cat /proc/xenomai/stat') during latency-test runs reliably kicks up the max-jitter values, especially the base-thread value which typically jumps to ~250us to ~350us.

CAVEAT: These tests were 2-minute drills. For completeness, tests should be run for much longer durations. As a practical matter, I'll probably first run just the best and worst cases for 12h to 24h.

Kent, 20121112
Comments