Experimental Results

We have conducted thorough experiments to evaluate PERIOD with a set of widely-used benchmarks, and compared it with various existing techniques.

Experimental Setup

Results on CVE Benchmarks

Results on SCTBench

Cumulative Plot

Improvement of Parallel Scheduler

Worst-case Bug-finding Ability Evaluation

Performance Overhead

Multi-core Speedup

The impact of the choice of period-number bounds

New Bugs Period Found

Case Study

Threats to Validity

Reference

Experimental Setup

Benchmark Programs. To evaluate PERIOD, we make use of a set of real-world CVEs and widely-used benchmarks, written in C/C++ for the Linux/Pthreads platform, including all 10 programs from the CVE benchmark [1] and 36 programs from SCTBench [2]. The CVE benchmark contains 10 programs that have various concurrency bugs, and each program corresponds to a real-world CVE. SCTBench collects 52 concurrency bugs from previous parallel workloads (PARSEC [3] and SPLASH2 [4]) and concurrency testing/verification works (CB [5], CHESS [6], CS [7] and Inspect [8]). Note that we exclude 16 programs in SCTBench as 5 of them fail to compile on the LLVM platform and the bugs for the other 11 programs can be exposed 100% of the time.

Baselines. We used existing implementations of compared tools when available. The main baseline tools are selected based on the category of CCT techniques. We select 3 systematic CCT techniques: iterative preemption bounding (IPB) [9], (iterative delay bounding) IDB [10], depth-first search with no schedule bound (DFS) [11], and 3 non-systematic CCT techniques: Maple [12], probabilistic concurrency testing (PCT) [13], a controlled random scheduler that randomly chooses a thread to execute at a time (Random). For comparison purposes, we also include our proposed serialized scheduler (SERIAL) and native execution where the schedule nondeterminism is not controlled (Native). We used existing implementations of compared tools when available. Here are their websites: PCT, Random, IDB, IPB, DFS, Maple, ConVul, UFO, TSAN, FastTrack, Helgrind+.

Configuration Parameters. The schedule bounds for the CCT tools are set to checking bugs with bug-depth less than 5 and 3 respectively. Each invocation of a CCT tool has a budget of exploring up to 10,000 schedules. For the other non-CCT tools, we adopted their default configurations. For each compared tool, we invoke tests run for each program 10 times and collect their results. All our experiments have been performed on a workstation with an Intel(R) Xeon(R) Silver 4214 processor, installed with Ubuntu 18.04, GCC 7.5, LLVM 10.0.

Results on CVE Benchmarks

The descriptive statistics and detection results on the CVE benchmark are shown in Table 1. The column "schedules to 1st bug" shows the number of schedules that were explored up to and including the detection of a bug for the first time. The column "schedules" denotes the total number of schedules explored in total, while "buggy schedules" indicates the number of schedules explored that exhibited the bug. These figures can demonstrate how capable and how quickly each tool finds the bugs in those benchmark programs. We here assume that the overhead of controlled scheduling is the same for all techniques. The smaller the number of "schedules to 1st bug", it usually indicates that the corresponding technique detects the bug more quickly. And the smaller the number of schedules, it usually means that a smaller number of schedules need to be considered by a tool. Note that it is possible that one of the techniques “gets lucky” and finds a bug quickly due to the search order. We can also consider the worst-case bug-finding ability through the total number of non-buggy schedules within the bound (the difference between the "schedules" and "buggy schedules" columns.

PERIOD has successfully identified all 10 programs as buggy programs, while other CCT techniques (i.e., IPB, IDB, DFS, PCT, Random, Maple) have only identified 7 buggy programs. The native execution which relies on Linux OS's real-time scheduling, could only identify 3 buggy programs. Interestingly, PERIOD also reports 5 undocumented bugs in the CVE benchmark. For example, In CVE-2017-6346, PERIOD reports that there exist null-pointer-dereference (NPD), use-after-free (UAF) and double-free (DF) bugs. All other systematic CCT techniques fail to report bugs in the program of CVE-2017-6346, and only two non-systematic CCT techniques (i.e., Random and Maple) can identify NPD or DF. The DF bug in the program of CVE-2016-1972 can be found only by PERIOD within the schedule limit of 10,000. In terms of schedules to 1st bug, PERIOD performs the best in 10 out of 15 bugs among 7 techniques (i.e., PERIOD, IPB, IDB, DFS, Native, PCT, Random) in Table 1. In terms of schedules, PERIOD also performs the best in 7 out of 10 programs in Table 1.

We also include the results in Table 1 with three representative race detectors (i.e., FastTrack [13], Helgrind [14] and TSAN [15]), as well as two concurrency vulnerabilities detectors (i.e., UFO [16] and ConVul [17]). The three data race detectors can only identify at most 2 bugs each. ConVul and UFO identify 9 and 3 buggy programs, respectively. In total, PERIOD successfully reports 15 bugs while the second-best counterpart (i.e., Maple) reports 11 bugs. This signifies that PERIOD is more effective than the other tools on the CVE benchmark.

Results on SCTBench

The descriptive statistics and detection results on the SCTBench are shown in Table 2. PERIOD totally identified 38 bugs in all 36 tested benchmark programs. PCT found 33 bugs, performing the second-best, while IPB, IDB, DFS, Native, Random and Maple only find 25, 27, 19, 13, 29 and 18, respectively. Note that PERIOD was able to find concurrency bugs in all benchmark programs except for Misc.safestack and RADBench.bug5. Only PERIOD and PCT successfully found the assertion failure in CS.twostage_100_bad, CS.reorder_10_bad, CS.reorder_20_bad, and RADBench.bug2. And only PERIOD successfully detected the buffer-overflow (BOF) in CB.pbzip2. We had a manual inspection on the two bugs missed by PERIOD (and also by all other competitors). Misc.SafeStack is an implementation of a lock-free stack [18]. The bug is challenging, requiring at least 5 context switches, and was missed by all techniques in a prior evaluation [19]. The RADBench.bug5 was unable to reproduce in our experiments, despite that we followed the instructions in the document. Nevertheless, it has shown that PERIOD has been able to identify more bugs than other competitors in our experiments.

In terms of schedules to 1st bug, PERIOD performs the best in 20 out of 40 bugs in Table 2. When compared to other systematic testing techniques (i.e., IPB, IDB, DFS), in terms of schedules, PERIOD also performs the best in 30 out of 36 programs in Table 2. Take CS.reorder_x_bad (x = 3, 4, 5, 10, 20) as an example, where x denotes the numbers of sub-threads created by the main thread, IDB generate 205 schedules for testing CS.reorder_3_bad, 518 schedules for CS.reorder_4_bad, 996 schedules for CS.reorder_5_bad, and 7406 schedules for CS.reorder_10_bad. This indicates that as the number of threads increases, it becomes more and more difficult for IDB to find the assertion failure in CS.reorder_x_bad. However, PERIOD generates far fewer schedules than IPB, IDB and DFS, and uses fewer schedules to quickly trigger the assertion failure in CS.reorder_x_bad. Thus, PERIOD is able to find concurrency bugs more quickly than other systematic testing techniques and requires fewer schedules to be explored.

Cumulative Plot

Fig. 3 is a cumulative plot showing the number of bugs found (y-axis) after x schedules (x-axis) for each technique. Each line represents one technique and is labeled by the name of the technique and the number of bugs found by the technique within the schedule limit. If a given technique has a point at coordinate (x,y) then there are y concurrency bugs exposed by the technique using x schedules. For example, with a schedule limit of 10,000, PERIOD and PCT found 53 and 39 bugs, respectively. We also note that more than 40 bugs could be exposed by PERIOD, using schedule limits lower than 100, while others would require 1,000 to 10,000 schedules.

From the analysis of Table 1, Table 2, Fig. 9, we can positively conclude that PERIOD significantly outperforms other competitors in terms of concurrency bugs finding ability.

Improvement of Parallel Scheduler

We have evaluated the improvement of our proposed parallel scheduler (PERIOD) over serialized scheduler (SERIAL). The different results between PERIOD and SERIAL are marked in blue font in Table 2, while their results are the same in Table 1. In total, PERIOD reports 8 more bugs than SERIAL. When testing programs with a large number of threads, SERIAL could miss bugs even with depth 2. This is because the generated schedules of SERIAL grow exponentially as the number of threads increases, which often runs out of the given schedules budget. For example, SERIAL generate 30, 384 and 5040 schedules for testing CS.reorder_3_bad, CS.reorder_4_bad and CS.reorder_5_bad, respectively. When the number of threads grows to 10 (CS.reorder_10_bad), SERIAL requires generating a particularly large number of schedules.

The parallel scheduler prunes the schedule space to be explored from (n*k)^(n+d-1) to (n*k)^(d+1), where d denotes the bug depth, When n=2, parallel scheduler and serialized scheduler perform the same. The experimental results show that the PERIOD requires fewer schedules to find the same bug for the first time, and requires fewer schedules to end up its exploration. Moreover, when the number of threads increases, the improvement of the parallel scheduler can be substantial.

From the analysis of Table 1 and Table 2, we can positively conclude that our proposed parallel scheduler can significantly improve serialized scheduler when testing the programs with more than 2 threads.

Worst-case Bug-finding Ability Evaluation

The above experiments show that PERIOD is effective and efficient in finding concurrency bugs. Note that it is possible that one of the techniques “gets lucky” and finds a bug quickly due to the search order. To avoid this implementation-dependent bias, in the following Figure we consider the worst-case bug-finding ability through the total number of non-buggy schedules within the bound (the difference between the schedules and buggy schedules. The result shows that PERIOD outperforms the best in 33 out of all 46 benchmark programs, compared to other systematic CCT techniques, in terms of the worst-case number of schedules that might have to be explored to find a bug.

Performance Overhead

To evaluate the runtime overhead required for the underlying periodical scheduling policy and caused by the additional instrumentation, we use benchmark programs various from 2 to 100 threads, from the SCTbench (the number inside the program name represents the number of subthreads). Fig. 10 shows the execution speed slowdown incurred by PERIOD, IPB, IDB and PCT, compared to native execution. In all benchmark programs, the overhead incurred by PERIOD is lower than other competitors. PERIOD has a slowdown ranging from 9 to 16 times, while others require 55 to 210 times. As PERIOD allows parallelism in periodical execution, when the number of threads increases, the effect of runtime slowdown can be weakened.

The reasons for the low overhead of PERIOD could be: (1) the periodical execution achieves control scheduling with non-preemptive synchronizations, which could avoid false deadlock and starvation; and (2) \name allows parallelism in periodical execution, instead of serializing executions, which can boost the performance.

From the analysis of Fig. 10, we can positively conclude that the implementation of PERIOD incurs some noticeable runtime overhead, which is still much lower than other tools.

Multi-core Speedup

Fig. 11 also compares the execution speed speedup when using 2, 4, and 8 cores. The speedup is highly variable between targets and configurations. When testing programs owning 2 to 10 threads (e.g., twostate_bad, reorder_5_bad), the execution speed of 2-core, 4-core, and 8-core configurations are very similar. However, when testing programs have 100 threads (e.g., twostate_100_bad), the 8-core configuration achieves 37x speedup than a single core, while 2-core and 4-core configurations only achieve 5x and 22x, respectively. We hypothesize that this increase is due to concurrent thread executions in periodical scheduling.

The impact of the choice of period-number bounds

The following sheet compares different choices of period-number bounds when applying PERIOD on CVE benchmark and SCTBench.

The experimental results show that those bugs of bug-depth d will be missed if the period-number bounds are set to smaller than d+1. Moreover, the generated schedule could significantly increase as the period-number bounds increase.

bounds-impact

New Bugs Period Found

we have applied PERIOD to a good number of real-world programs. With PERIOD, we have discovered 5 previously unknown concurrency bugs (e.g., a UAF in lrzip, buffer-overflow in pbzip2, invalid address dereference in ctrace). These concurrency bugs were not previously reported and we informed the maintainers. More details can be found on this page.

Case Study

To demonstrate the reason behind PERIOD's superiority, we present two case studies.

Fig. 1 shows an example simplified from CVE-2016-1972.

Two threads T0, T1 concurrently invoke the function once(), and are synchronized with the help of three variables (i.e., lock, done, waiters). The variable lock, allocated in the main thread (line 22) and released in the child thread (line 17), is used to protect the critical section (lines 10-14). The variable done will be set to 1 once the critical section is completed (lines 12-13). Threads are created after a thread finishes the critical section would return directly (lines 6-7). The variable waiters indicate the number of threads waiting to enter the critical section (line 8). The expected behavior is that only the last thread should release the lock (lines 16-18). Note that all statements except for ``return 0`` in function once() either access shared memory locations or contain synchronization primitives, therefore are considered as key points.

This program demonstrates 3 kinds of concurrency bugs, that is, null-pointer-dereference (NPD), use-after-free (UAF), and double-free (DF). In more detail, if the lock is set to null in T0 at line 18 before T1 uses the lock at line 10, an NPD will occur. A UAF can be triggered at line 10 where thread T0 releases the lock at line 17 and thread T1 uses the lock at line 10. A DF can be triggered at line 18 where both threads T0 and T1 try to release the lock. All these bugs are actually hard to be triggered, as they each require a specific sequence of operations on variable done, waiter, and lock.

Several existing CCT tools have been performed on this example. Maple [12] can detect NPD but misses UAF and DF, as it heuristically steers thread scheduling to attempt to force a set of predefined interleaving idioms, which does not include the required interleaving idioms of such UAF and DF in the example. PCT [13] relies on a randomizer to explore the schedule space of the example. The probability for finding NPD is high, but the probabilities for finding UAF and DF are very low, which are respectively about 0.15% and 0% in our experiment. IPB [9] and IDB [10] are two representative systematic tools. Both tools can detect NPD and UAF, but miss DF, whose bug depth is 5 (i.e., requiring 5 context switches). The reason is that they try to iterate the scheduling decision on each key point, going through a large schedule space. Moreover, their bounds could not faithfully reflect the context switch bounds such that they often require (potentially a lot) more context switches than the depth of the bugs they try to expose. These results demonstrate the limitations of current CCT tools.

As shown in Fig.3, with the help of the feedback analyzer, PERIOD only generates 573 schedules and successfully finds NPD, UAF and DF bugs. However, IPB and DFS generate schedules that exceed $10,000$ but fail to find the DF bug.

The program CS.reorder_10_bad contains an assertion failure that none of the other systematic testing techniques was able to detect. Ideally, it only takes two context switches between T0 and any other thread, or three periods in our terminology, as illustrated in Fig.4. Similar to SERIAL, the main reason is that the generated schedules of other systematic testing techniques (e.g., IPB and IDB) grows exponentially as the number of threads increases, which runs out of the given schedules budget. On the other hand, IDB itself does not have the ability to expose this bug through few delays, as exposing this bug would require at least 10 delays. PCT is a randomized method, it may have the probability to trigger this bug, but with extremely low probability (0.09% in our experiment).

Other case studies about our newly detected bugs can be found on this page.

Threats to Validity

We selected a variant of existing benchmarks and real-world programs to show the capabilities of PERIOD, and compared it against other CCT techniques. However, our evaluation dataset may still include a certain sample bias. Further studies on more real-world programs can help to better evaluate PERIOD.

This work also assumes that the inputs to the program are predetermined, for example by an existing test suite, and do not vary between runs, as is typical in other work in the literature on CCT. Adopting some test case generation techniques (e.g., fuzzing and symbolic execution) might help mitigate this threat.

Moreover, this work doesn't handle weak memory models (WMMs) [20] and probabilistic programming models [21]. We are seeking solutions to further improve PERIOD.

Reference

[1] 2016. SCTBench: a set of C/C++ pthread benchmarks for evaluating concurrency testing techniques. Retrieved August 20, 2021. https://github.com/mc-imperial/sctbench [online].

[2] 2019. CVE Benchmark. Retrieved August 20, 2021. https://github.com/mryancai/ConVul [online].

[3] Christian Bienia. 2011. Benchmarking modern multiprocessors. Princeton University.

[4] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. ACM SIGARCH computer architecture news 23, 2 (1995), 24–36.

[5] Jie Yu and Satish Narayanasamy. 2009. A case for an interleaving constrained shared-memory multi-processor. ACM SIGARCH Computer Architecture News 37, 3 (2009), 325–336.

[6] Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Piramanayagam Arumuga Nainar, and Iulian Neamtiu. 2008. Finding and Reproducing Heisenbugs in Concurrent Programs.. In OSDI, Vol. 8. USENIX, USA, 267–280.

[7] Lucas Cordeiro and Bernd Fischer. 2011. Verifying multi-threaded software using SMT-based context-bounded model checking. In 2011 33rd International Conference on Software Engineering (ICSE). IEEE, 331–340.

[8] Yu Yang, Xiaofang Chen, and Ganesh Gopalakrishnan. 2008. Inspect: A runtime model checker for multithreaded C programs. Technical Report. Citeseer.

[9] Madanlal Musuvathi and Shaz Qadeer. 2007. Iterative context bounding for systematic testing of multithreaded programs. ACM Sigplan Notices 42, 6 (2007), 446–455.

[10] Michael Emmi, Shaz Qadeer, and Zvonimir Rakamarić. 2011. Delay-bounded scheduling. In Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on principles of programming languages. ACM, 411–422.

[11] Paul Thomson, Alastair F Donaldson, and Adam Betts. 2016. Concurrency testing using controlled schedulers: An empirical study. ACM Transactions on Parallel Computing (TOPC) 2, 4 (2016), 1–37.

[12] Jie Yu, Satish Narayanasamy, Cristiano Pereira, and Gilles Pokam. 2012. Maple: A coverage-driven testing tool for multithreaded programs. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications. ACM, Tucson, Arizona, USA, 485–502.

[13] Cormac Flanagan and Stephen N Freund. 2009. FastTrack: efficient and precise dynamic race detection. ACM Sigplan Notices 44, 6 (2009), 121–133.

[14] Ali Jannesari, Kaibin Bao, Victor Pankratius, andWalter F Tichy. 2009. Helgrind+: An efficient dynamic race detector. In 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, Chengdu, China, 1–13.

[15] Konstantin Serebryany and Timur Iskhodzhanov. 2009. ThreadSanitizer: Data race detection in practice. In Proceedings of the workshop on binary instrumentation and applications. ACM, New York, NY, USA, 62–71.

[16] Jeff Huang. 2018. UFO: predictive concurrency use-after-free detection. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, Gothenburg, Sweden, 609–619.

[17] Yan Cai, Biyun Zhu, Ruijie Meng, Hao Yun, Liang He, Purui Su, and Bin Liang. 2019. Detecting concurrency memory corruption vulnerabilities. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Paris, France, 706–717.

[18] Dmitry Vyukov. 2010. Bug with a context switch bound 5. In Microsoft CHESS Forum.

[19] Paul Thomson, Alastair F Donaldson, and Adam Betts. 2016. Concurrency testing using controlled schedulers: An empirical study. ACM Transactions on Parallel Computing (TOPC) 2, 4 (2016), 1–37.

[20] Mohamed Faouzi Atig, Ahmed Bouajjani, Sebastian Burckhardt, and Madanlal Musuvathi. 2010. On the verification problem for weak memory models. In Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 7–18.

[21] András Prékopa. 2003. Probabilistic programming. Handbooks in operations research and management science 10 (2003), 267–351.