spurious wakeup and infinite waiting on condvar

Condition variable, or condvar, or simply cv, is a synchronization primitive supported by POSIX.

A simplistic way to use condvar are illustrated below:

// condvar consumer, thread A runs below code

pthread_cond_t condvar = PTHREAD_COND_INITIALIZER;

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

int predicate = 0;

pthread_mutex_lock( &mutex );

while ( predicate == 0 ) {

pthread_cond_wait( &condvar, &mutex );

}

pthread_mutex_unlock( &mutex ); // no need for mutex as I don't use predicate any more

// condar signaler, thread B runs below code

pthread_mutex_lock( &mutex );

predicate = 1;

pthread_cond_signal( &condvar );

pthread_mutex_unlock( &mutex);

There are few questions out of the above code snippet.

1. why do we need to use "while" loop, instead of a "if"?

because POSIX requires so due to spurious wakeup:

http://pubs.opengroup.org/onlinepubs/000095399/functions/pthread_cond_wait.html

quote:

" When using condition variables there is always a Boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return."

2. why do we have spurious wakeup?

There are a few reasons for spurious wakeup.

2a: spurious wakeup due to multiple awakenings ( of more than 1 thread ) on multi-procoessor system.

The above spurious wakeup is briefly explained from below link.

http://pubs.opengroup.org/onlinepubs/000095399/functions/pthread_cond_signal.html

Following is the quote from the above link for code snippet:

pthread_cond_wait(mutex, cond): value = cond->value; /* 1 */

pthread_mutex_unlock(mutex); /* 2 */

pthread_cond_signal(cond):

pthread_mutex_lock(cond->mutex); /* 3 */

cond->value++; /* 4 */

if (cond->waiter) { /* 5 */

sleeper = cond->waiter; /* 6 */

cond->waiter = sleeper->next_cond; /* 7 */

able_to_run(sleeper); /* 8 */

}

pthread_mutex_unlock(cond->mutex); /* 9 */

pthread_mutex_lock(cond->mutex); /* 10 */

if (value == cond->value) { /* 11 */

me->next_cond = cond->waiter;

cond->waiter = me;

pthread_mutex_unlock(cond->mutex);

unable_to_run(me);

} else {

pthread_mutex_unlock(cond->mutex); /* 12 */

}

pthread_mutex_lock(mutex); /* 13 */


Basically, it says that on a multi-processor system, if thread A is running on one processor and finished step #2. At the same time, thread B started running pthread_cond_signal() on a different processor, and thread C already blocked on pthread_cond_wait(), then it is possible that after thread B returns from pthread_cond_signal(), both thread A and thread C will be waken up and return from pthread_cond_wait() one by one. Both of them will get to lock the mutext (step #3) one by one, and by the time second thread got the mutex, the predicate might have been modified by the first wakenup thread and hence may not be true any more.

Note that in the above scenario, Thread A and thread B are running on different processors and thread C is waiting on condvar, there is no context switch among them.

2b: spurious wakeup due to scheduling ( even only one thread is waiting on the condvar ) on single processor system.

Reconsider the example above, thread A can be context switched out right after finishing step #2. There are two reasons that this can happen:

1. in a priority based scheduling ( a real time operating system ), if thread B is trying to get the mutex, and has a higher priority, thread A, once finished pthread_mutex_unlock(mutex), will drop its priority to normal (mutex priority inheritance), and scheduler needs to immediately give thread B the mutex, and start running Thread B, which will call pthread_cond_signal().

2. in a round robin scheduling, it can happen that thread A, as soon as it finished pthread_mutex_unlock(mutex), runs out of its time quota, and scheduler needs to context switch it out and start running thread B.

2c: spurious wakeup due to pthread_cond_wait() interrupted by a signal

pthread_cond_wait() will eventually end up with a system call, and kernel will put the caller to sleep to wait for another thread to call pthread_cond_signal(). During the waiting, if there is a signal happening and got delivered to the waiting thread, the waiting thread is allowed to be waken up and return from pthread_cond_wait() successfully:

http://pubs.opengroup.org/onlinepubs/000095399/functions/pthread_cond_wait.html

quote:

"If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup."

Typically, if a thread is blocked on a system call, and a signal happens, it usually return from the system call with error code EINTR. Here is the definition of EINTR:

[EINTR]

Interrupted function call. An asynchronous signal was caught by the process during the

execution of an interruptible function. If the signal handler performs a normal return, the

interrupted function call may return this condition (see the Base Definitions volume of

POSIX.1-2008, <signal.h>).

Returning EINTR is in fact a Unix tradition, as per "worse is better", and in fact, majority of POSIX system interface APIs allow to return EINTR.

http://www.jwz.org/doc/worse-is-better.html

However, it is interesting to note that none of the APIs from POSIX pthread extension is allowed to return EINTR, which is a deviation from "worse is better". The exact rationale behind it might be worth another topic.

2d: spurious wakeup due to pthread_cond_broadcast()

Reconsider the example above, thread A and C can both be waken up due to a pthread_cond_broadcast(). After that, both treads will take turn to do "pthread_mutex_lock(mutex)", whoever got the mutex first got to run, and could potentially change the predicate to false. When the thread that got the mutex secondly may return back from pthread_cond_wait(), holding the mutex, and still find the predicate is not true, and needs to go back and wait on pthread_cond_wait again.

2e: spurious wakeup due to bad design.

Consider the following wakeup code:

pthread_mutex_lock( &mutex );

predicate = 1;

pthread_mutex_unlock( &mutex );

pthread_cond_signal( &condvar );

Consider also that thread A is blocked on condvar, thread B trying to do the above code, and thread C is simply blocked on mutex and has higher priority over thread A.

After thread B releases the mutex, scheduler will immediately context switch to thread C, which got the mutex, changed the predicate and unlock the mutex. Now thread A will wakeup from pthread_cond_wait(), lock the mutex, only found that predicate is still not true.

Above scenario can be arguably a bad design as thread C should not mess with predicate without being part of the condvar party.

In summary: spurious wakeup can happen due to 5 reasons, and because of spurious wakeup, application must use a "while" loop to re-check predicate whenever it returns back from pthread_cond_wait().

3. shall we hold mutex before calling pthread_cond_signal()?

Recall the earlier coding paradigm:

pthread_mutex_lock( &mutex );

predicate = 1;

pthread_cond_signal( &condvar );

pthread_mutex_unlock( &mutex);

another possible code snipppet is:

pthread_mutex_lock( &mutex );

predicate = 1;

pthread_mutex_unlock( &mutex );

pthread_cond_signal( &condvar );

Which coding paradigm is correct? The first code snippet is recommended as per:

http://pubs.opengroup.org/onlinepubs/000095399/functions/pthread_cond_signal.html

quote:

"The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal()."

The above quote really does not mandate that we must hold the mutext when calling pthread_cond_signal(). It is just a recommendation. It is very interesting to have some background on the above context:

https://groups.google.com/forum/#!topic/comp.programming.threads/wEUgPq541v8

You can just jump to Dave Butenhof's response directly.

The argument is another classic example of balancing between application complex and API complexity. Unix has traditionally been trying to make APIs as simple as possible, and pushing lots of complexity to applications (EINTR is a good example). The fact that none of the POSIX pthread extension APIs allows EINTR to be returned is a self-perplexing evolution where one can argument both way, and have standard to go both way.

The issue becomes more complicated when POSIX realtime extension on priority based scheduling and pthread extension may fight with each other. Refer this discussion:

http://www.domaigne.com/blog/computing/condvars-signal-with-mutex-locked-or-not/

the complication caused by POSIX realtime extension is also further complicated by the condvar implementation for pthread_cond_signal(), mainly Mesa style or Hoare style signaling.

Mesa style or Hoare style signaling refers to the implementation of condvar such that as soon as a wakener calls pthread_cond_signal(), and a waiter got waken up, should scheduler context switch to the waiter, or continue to run the wakener? The former is called Mesa style, the latter is called Hoare style. Clearly, realtime extension with prioritiy scheduling will demand Mesa style implementation, otherwise the priority will be violated ( if waiter has higher priority and the wakener was merely inherited that priority before signalling )

One can also argue that a design which depends on either Mesa style or Hoare style is not a good design and not portable, etc.

In my opinion, which code paradigm is desirable has to be put under the context of the application. A carefully designed application need to have the scheduling, priority, realtime requirements, condvar waiter and signaler all together all on the table in order to come up with a solution that is not only logically correct, but also satisfies time constraints.

4. why do we have infinite wait?

Assume thread A runs the following code ( this is simplified gnu implementation of pthread_cond_wait )

101__pthread_cond_wait (cond, mutex)

102 pthread_cond_t *cond;

103 pthread_mutex_t *mutex;

104{

107 int err;

117

118 /* Make sure we are alone. */

119 lll_lock (cond->__data.__lock, pshared);

120

121 /* Now we can release the mutex. */

122 err = __pthread_mutex_unlock_usercnt (mutex, 0);

128

129 /* We have one new user of the condvar. */

130 ++cond->__data.__total_seq; 131 ++cond->__data.__futex;

132 cond->__data.__nwaiters += 1 << COND_NWAITERS_SHIFT;

133

134 /* Remember the mutex we are using here. If there is already a

135 different address store this is a bad user bug. Do not store

136 anything for pshared condvars. */

137 if (cond->__data.__mutex != (void *) ~0l)

138 cond->__data.__mutex = mutex;

139

147

148 /* The current values of the wakeup counter. The "woken" counter

149 must exceed this value. */

150 unsigned long long int val;

151 unsigned long long int seq;

152 val = seq = cond->__data.__wakeup_seq;

155

156 do

157 {

158 unsigned int futex_val = cond->__data.__futex;

159 /* Prepare to wait. Release the condvar futex. */

160 lll_unlock (cond->__data.__lock, pshared);

164

187 /* Wait until woken by signal or broadcast. */

188 lll_futex_wait (&cond->__data.__futex, futex_val, pshared);

189

192

193 /* We are going to look at shared data again, so get the lock. */

194 lll_lock (cond->__data.__lock, pshared);

195

199

200 /* Check whether we are eligible for wakeup. */

201 val = cond->__data.__wakeup_seq; 202 }

203 while (val == seq || cond->__data.__woken_seq == val);

204

205 /* Another thread woken up. */

206 ++cond->__data.__woken_seq;

207

208 bc_out:

209

210 cond->__data.__nwaiters -= 1 << COND_NWAITERS_SHIFT;

218

219 /* We are done with the condvar. */

220 lll_unlock (cond->__data.__lock, pshared);

235 return __pthread_mutex_cond_lock (mutex);

236}

237

Note that after line # 160, thread can be context switched out due to all the reasons for spurious wakeup.

Now assume the new running thread B will do pthread_cond_signal() and successfully returns from it. Thread A will start running again at line #188, it will wait at line #188 forever as nobody will wake it up any more in the future, even the predicate might have changed to true ( a non zero value ). We end up with a system that variable "predicate" is not zero, yet thread A wait on pthread_cond_wait() forever. This is so called "infinite sleep".

below is gnu pthread_cond_signal code, you can see it could run to completion without any context switch.


32__pthread_cond_signal (cond)

33 pthread_cond_t *cond;

34 {

39

40 /* Make sure we are alone. */

41 lll_lock (cond->__data.__lock, pshared);

42

43 /* Are there any waiters to be woken? */

44 if (cond->__data.__total_seq > cond->__data.__wakeup_seq)

45 {

46 /* Yes. Mark one of them as woken. */

47 ++cond->__data.__wakeup_seq;

48 ++cond->__data.__futex;

49

66 /* Wake one. */

67 if (! __builtin_expect (lll_futex_wake_unlock (&cond->__data.__futex,

68 1, 1,

69 &cond->__data.__lock, 70 pshared), 0))

71 return 0;

72

73 /* Fallback if neither of them work. */

74 lll_futex_wake (&cond->__data.__futex, 1, pshared);

75 }

76

77 /* We are done. */

78 lll_unlock (cond->__data.__lock, pshared);

79

80 return 0;

81}

82

Infinite sleep can be avoided relatively easily by pthread implementation. Above gnu implementation clearly still suffers the problem.

Note that some website suggest that if pthread_cond_signal() is not called while the mutex is hold, it may cause infinite wait, that is purely a wrong statement and wrong interpretation of the standard and such implementation is buggy to the least:

http://docs.oracle.com/cd/E19455-01/806-5257/6je9h032r/index.html

5. what are the correct way to use condvar then?

Given all the above discussion, how to use condvar to avoid all the problems? This is more or less a design issue than a check list, however the following points needs special attention during design phase:

1. be careful with pthread_cond_broadcast(). Using broadcast suggests that your design has several threads waiting on the same condvar. This could lead to so called "thundering herd" and "lock convoy" problems and may suggest a unnecessarily complicated design. However, "thundering herd" may just be a desirable effect in multiple consumer case.

2. multiple predicate scenarios where we need to decide to use one condvar or two condvars?

3. mutexa and condvar are dynamic binding, meaning we can use different mutexs for the same condvar over and over again, as long as all the threads involved using the same mutex in each occurrence of the "wait and signal" session. However, this type of usage may involve pthread_mutex_destroy() and my unnecessarily complicate the design due to various race conditions ( pthread_cond_wait() trying to lock a dead mutex, etc) .

4. all POSIX mutex, condvar, spinlock, semaphore allow to have an PTHREAD_PROCESS_SHARED attributes, which means the same variable can be shared among processes. When a condvar is shared among processes, its associated mutex needs to be shared among the same set of processes as well. Although condvar has no concept of owner, each mutex can have a owner ( process that hold the mutex ). When the owner dies before it can release the mutex, we may end up with a "dangling mutex". In such cases, "mutex revival" could be a solution.

5. multiple waiter waiting on the same condvar. This could be a common design choice with multiple consumers as the waiters.

6. multiple wakeners signaling on the same condvar. The race condition among those wakeners could suggest that we should split those wakener with different condvars, and split waiters with different predicates.

7. condvar is stateless. If a thread signal a condvar and no other thread is waiting on the condvar, that signal is permanently lost. Fortunately, this barely an issue as the waiter that comes up later will probably find the predicate is true ( because when a thread signals a condvar, it would have set predicate to true before the signaling, otherwise, why signaling? ) and will not wait on the condvar any more. However, the stateless nature of condvar still may trigger a race condition that may cause infinite wait ( refer earlier section on infinite wait ). pthread_condvar_timedwait() would be a preferred choice to make application more robust to condvar implementation.

http://pubs.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwait.html