Scalable SeqlockX version 1.08
Author: Amine Moulay Ramdane.
Email: aminer@videotron.ca
Description:
This scalable Seqlock's variant was invented by Amine Moulay Ramdane, it is much faster and more powerful than the the Dmitry Vyukov's distributed reader-writer mutex , cause the Dmitry Vyukov's distributed reader-writer lock will become slower and slower on the writer side with more and more cores because it transfers too many cache-lines between cores on the writer side, this invention has eliminated the weakness of the Dmitry Vyukov's distributed reader-writer mutex, so that the writers throughput has become faster and very fast, and my scalable SeqlockX elminates the weaknesses of the classical Seqlock (sequential lock) that is "livelock", so it avoids livelock when there is more writers, so my scalable SeqlockX is a powerful synchronization mechanism that can replace RCU and that can replace Seqlock and that can replace Dmitry Vyukov's distributed reader-writer mutex. And if you look at my algorithm you will notice that on the reader side it looks like a classical sequential lock, but i have added a variable called fcount2^.fcount2 that allows the readers to not livelock when there is more writers ,this scalable SeqlockX is faster and more scalable than the Dmitry Vyukov's distributed reader-writer mutex and it beats the classical Seqlock because it avoids livelock when there is many writers.
Please look at the "test.pas" example that i have included and you will notice that the reader side have to be used in a transactional mode with a repeat-until loop, like this:
repeat
t.rlock(id1);
until t.runlock(id1);
it is like a transactional mode of a Seqlock and id1 must be of type "long" that must be imported from the SeqlockX unit. The writer side is easy to program , please look at the "test.pas" example and you will understand how to use my scalable SeqlockX .
My scalable SeqlockX can be used from accross processes and threads.
My new invention called scalable SeqlockX beats the following algorithm, it is more scalable and faster than the following algorithm of Dmitry Vyukov called Distributed reader-writer mutex:
And my new invention called scalable SeqlockX beats also the classical Seqlock (i have just explained to you why ...), read here about it:
http://en.wikipedia.org/wiki/Seqlock
And this new invention called scalable SeqlockX can replace RCU cause it is a powerful synchronization mechanism.
I will explain my new algorithm of a scalable SeqlockX that is very powerful...
I have arrived at my new algorithm by also reading many PhD papers and by looking first at the weakness of the following algorithm of Dmitry Vyukov's Reader-writer mutex:
look at the writer side of the Dmitry Vykov's algorithm , it is doing with every rwlock in each corresponding core:
for (i = 0; i != mtx->proc_count; i += 1)
pthread_rwlock_wrlock(&mtx-cell[i].mtx);
but this render this algorithm inefficient for the writer side , cause every pthread_rwlock_wrlock() call transfer many cache-lines between cores, so this will generate too much cache line transfers between cores when each writer executes distr_rw_mutex_wrlock() and this will become worse and worse when you add more and more cores, i mean this will transfer more and more cache-lines between cores with more and more cores... so this will render the writer side throughput slow and this is the weakness of this algorithm...
So this is why i have used a sequential lock mechanism in the write side of my new algorithm so that to minimize efficiently the cache-lines transfers, so if you look at my algorithm, it adds the following to the classical Seqlock:
On the writer side if the result of TSeqlockX.RUnlock() method is "false", that means the reader transaction has failed, so i am putting 1 into the FCount2^.FCount2 variable and if it's true, that means the reader transaction has succeeded, i am putting 0 into the FCount2^.FCount2 variable, and inside the writer side inside the TSeqlockX.WLock() method, i am doing this:
while FCount2^.FCount2<>0 do sleep(0);
So if FCount2^.FCount2 is equal to its initial value it will exit the while loop, and if FCount2^.FCount2 is equal to 1 it will wait for TSeqlockX.RUnlock() to return true, that means it will wait for a reader transaction to succeed, this part of the algorithm allows my algorithm to avoid livelock when there is many writers, and the rest of the algorithm look like a classical seqlock. That's all.
About the sequential consistency of my scalable distributed sequential lock, my algorithm works on x86 architecture and it compiles on Delphi and FreePascal compilers. I think my algorithm is correct cause look at the source code of the WLock() method, since i am using a Ticket spinlock with a proportional backoff on the writer side, the Ticket spinlock is using a "lock add" assembler instruction to increment a counter in the Enter() method of the ticket spinlock , and this "lock add" assembler instruction is a barrier for stores and loads on x86, and the loads of Fcount2^.fcount2 is not reordered with the previous atomic lock of the ticket spinlock, and the stores of FCount5^.fcount5 are not reordered with olher stores and older loads , so the WLock() method is sequentially consistent and correct, now look at the WUnlock() , we don't need an "sfence" cause stores and loads on WUnlock() are not reordered with older stores or older loads on x86 etc. so WUnlock() method is sequentially consistent and correct, now look at the RLock() method, it's also sequentially consistent., so all in all i think my algorithm is sequentially consistent and correct on x86 and on the Delphi and FreePascal compilers. So i think you can be confident cause i have reasoned correctly and i think my scalable SeqlockX algorithm is correct and it is a powerful synchronization mechanism that can replace RCU and that can replace classical Seqlock cause it beats Seqlock.
You have to know also that the first parameter of the constructor is the name that identifies the scalable SeqlockX object accross processes bounderies.
Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/
Operating Systems: Windows, Mac OSX , Linux on x86.
Required FPC switches: -O3 -Sd -dFPC -dFreePascal
-Sd for delphi mode....
Required Delphi switches: -$H+ -DDelphi
{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems
{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems