Thursday, February 8, 2018

Can't protect a pointer without a store-load fence

This post is about memory reclamation algorithms in shared memory concurrency. That's things like Hazard Pointers, Drop the Anchor, epoch-based reclamation (User-space RCUs) and of course, Hazard Eras. There's an insight about memory reclamation, I had a long time ago, that I want to share. This insight probably isn't new, but it was to me when I thought about it  ;)

Let's start from the beginning:

Many years ago, Leslie Lamport made a short paper where he showed that to have mutual exclusion on shared memory system, you need at least one store-load fence. Not sure this is the right reference but it's close enough to this topic:
In other words, for two threads to have mutually exclusive access to a certain data, this data must be protected with an algorithm (a lock) and to implement this algorithm you will need at least one store and one load, whose visibility must not be re-ordered to the other thread.
It's probably easier to think about this in the context of Peterson's algorithm, or Dekker's, or one of our 10 algorithms which do the same kind of thing:

If you get in the context of these algorithms, you can think about the store as being a way of signaling the intent to acquire the lock, and the load as a check that the other thread doesn't already have the lock. If the effects or visibility of these two operations could be changed (by the compiler or CPU or cache-coherence system) then the algorithm would no longer be correct.
Placing a store-load fence between the store and the load prevents re-ordering from happening, making the algorithm correct, at the cost of adding synchronization.

Another analogy, is to think about a store as a mechanism to send a message to a certain centralized location, and the load as a mechanism to read the message at that centralized location. This is easier for us humans to reason about.
In the end it's all physics, transmitting information takes time, and it involves waiting which by definition is blocking or at least involves some kind of synchronization, in this case, the store-load fence.

As an aside, there are no explicit store-load fences on the C11/C++ memory model, therefore we must use a seq-cst store followed by a seq-cst load, or a relaxed store and relaxed load but with an atomic_thread_fence(memory_order_seq_cst) in the middle.

Waaaaiiit, you said you were going to talk about memory reclamation, but you're talking about mutual exclusion?!? You tricked us!!!

Not exactly, you see, the two-thread mutual exclusion problem and the memory reclamation problem have a lot in common.
A lock algorithm allows two threads to access an object, one a time. Each thread has the guarantee that when it is accessing the object, the other thread is not accessing the object.
A memory reclamation algorithm allows two threads (the reader and the reclaimer) to access an object, with the reclaimer having the guarantee that at a certain point in time the reader will never access the object again.
Notice the difference?

For mutual exclusion a thread needs to know if the other thread is not currently accessing the object, while for memory reclamation, a thread (the reclaimer) needs to know if the other thread (the reader) is not currently accessing the object nor will it ever try to access it again.

In summary, memory reclamation can be seen as mutual exclusion with a stronger requirement.
Which means, that if to do mutual exclusion we need a store-load fence, then we also need (at least) a store-load fence for doing memory reclamation.
To be more precise, a reader needs one store-load fence to signal its intent to protect an object to the reclaimer.

Now, I never proved this formally, this is just a sequence of logical deductions. So let me substantiate this bold statement with some examples.

Start by looking at figure 3 of our Hazard Eras paper:

Looking at the left side, to protect a pointer in Hazard Eras we do a seq-cst store of this pointer on a shared array, and then we check that the pointer is still valid. There is an implicit store-load fence placed in by the compiler when we use seq-cst atomic store followed by a load.
In Hazard Pointers, at least one store-load fence is needed for every object needing protection.

Looking at the right side, to protect a pointer with URCU or Epoch-based reclamation, we do a seq-cst store of an epoch and then load whatever pointer we want to use. Again here there is an implicit store-load fence so that the store of the epoch will not be re-ordered with any of the loads of the object that will be used in the read-side critical section.
In Epoch-based reclamation, a single store-load fence is needed to protect all objects, now and until the read-side critical section ends.

Looking at the middle code, to protect a pointer with Hazard Eras, we do a seq-cst store of an era, load whatever pointers we need, and then check that the era has not changed. Again here, there is an implicit store-load fence between the store of the era and the subsequent loads. In fact, there is also a constraint that the load of the pointer(s) and the following load of the era can not be re-ordered, but that's another story.

Do you see the "pattern" now?
All three classes of algorithms require one store-load fence.
The difference between the three is that:
- Hazard Pointers does one fence to protect one object;
- Epoch-based does one fence to protect all objects (until the end of the read-side critical section);
- Hazard Eras does one fence to protect all currently live objects;
and of course, Epoch-based reclamation is blocking, while Hazard Pointers and Hazard Eras are lock-free.

Having said this, there are some apparent exceptions to this, but I would argue that the synchronization of the store-load fence is always there, it just might be hidden from sight.
For example, we can use hardware transactional memory to do memory reclamation, and no explicit store-load fence is needed... because the hardware does the corresponding synchronization internally.
It's also possible to use optimistic memory reclamation, but then we're relaxing the premises of the memory reclamation problem I stated above and that may not always be possible.
There are also other approaches that make use of services provided by the Operating System, just like there are mutual exclusion locks that are OS provided. Ultimately, they will need synchronization or a store-load fence, maybe it's the kernel that's doing it instead of the user code explicitly, but the synchronization is there somewhere, hidden behind all the OS/Kernel layers.

The final insight is the following: If it does correct and safe memory reclamation in a shared memory concurrency system, it needs at least one store-load fence, or an equivalent form of synchronization!

Saturday, January 27, 2018

Optimization to Michael-Scott Persistent Lock-Free Queue

After making the previous post, I went back to the code of the Michael-Scott persistent queue and figured out there is one psync and one pwb calls that can be removed in the enqueue() method, namely, the ones before the return statement.

This is possible because, for enqueue() to return, it means the CAS on tail has been done, and although the value of tail may not be persisted, the CAS on tail guarantees that the value of the node->next is persisted. This means that if a crash occurs (in the scenario shown on the previous post) and a_is_persisted is persisted and the change to tail occurring on the enqueue is not, it's still ok because the node->next is persisted, which will allow the recover() of the queue to advance the tail and persist it as well.
Ok I know, that sentence was long and confusing, but if I were to make a formal proof, it would be waaayyy more confusing  ;)

Unfortunately, for the dequeue(), no such trick is possible on the head, therefore, we really do need the PWB(&head) and PSYNC() before returning from dequeue().

The source code on github has been updated:

The performance increase is barely noticeable so I won't even bother to show the new plot.

This is  nice illustration of the similarities between the C++/C11/Java sequentially-consistent atomics, and the pwb/pfence/psync model described here. Sure, we can do the algorithms with everything seq-cst, and everything full persistence fences, but to get better performance, we need to understand the algorithm to reduce the number of fences to a minimum, regardless of whether those fences are for synchronization (concurrency) or for persistency.

Wednesday, January 24, 2018

A Lock-Free Persistent Queue

Lock-free queues have been discovered many years ago, with the best-known and likely the simplest of all, being the one by Maged Michael and Michael Scott back in 1996

Nowadays, the trendy stuff is in mixing concurrency with persistence, and when I talk about persistence, I mean Non-Volatile Memory, or Storage-Class Memory, or NVDIMMs.

The simplest way to have a persistent queue is to take a transactional persistency engine, or PTM as I like to call it (Persistent Transactional Memory) and wrap a sequential queue implementation in it. An example is to use PMDK, which this CppCon presentation covers in some detail:

However, PMDK and the other log-based techniques for consistent persistence are all lock-based. What if you want a lock-free data structure, something as simple as a queue?
Well then, you would be stuck, or at least until now you would be stuck  :)

A few top names got together recently and decided to make a persistent lock-free queue, likely based on the MS queue. I haven't read the paper because it isn't out yet, there is only a brief announcement online:
Their paper has been accepted to PPoPP 2018 so hopefully in a month it will be available somewhere.
Now these names are well known in Concurrency, I mean, we're talking about Maurice Herlihy (invented wait-free and linearizability) and Erez Petrank (him and Alex Kogan made the first MPMC wait-free queue), if these guys have decided to team up to make a queue, it's something worth noticing!

Although no lock-free persistent queue existed so far, since 2016 there was a kind of recipe provided by these other guys:
Notice that one of the authors is none other than Michael Scott, one of the authors of the original lock-free queue.
Their recipe to transform lock-free algorithms in persistent lock-free algorithms is simple and elegant, even if overkill for most usages.
On x86, you can think of a pwb as being a CLFLUSHOPT instruction, and the pfence or psync being an SFENCE instruction.
  • Add a pfence before every store-release and a pwb after;
  • After every load-acquire add a pwb and then a pfence;
  • Before and after every CAS/fetch_add/exchange add a pfence;
And that's it, it's so simple that even a compiler could do it for you... and probably one day a compiler will do it for you!

... or maybe it isn't so simple. Things are never as easy as they seem once you start to actual implement stuff.
You see, this automatic transformation works well in most lock-free cases (there are some for which it doesn't, but that's a subject for another post), but a lock-free queue is not just lock-free code, there is also the code in the constructor and the destructor, and that is sequential code.

Implicitly, the MS queue needs a head and tail (persistent) variables and they must be initialized to point to the same sentinel node.
We can't really expect this to happen magically in some atomic way. A failure may occur anywhere during the initialization, leaving these two variables in an inconsistent state, therefore, the initialization and de-initialization must follow a particular sequence, an algorithm.
In fact, this algorithm is quite complex, and I would say is the trickiest part of getting a correct persistent lock-free queue implementation.
Hopefully, that is what Maurice, Erez, Virendra and Michal will show in their paper next month at PPoPP 2018, but until then, here is my take on it, available on github:

Andreia and I discussed a bit about this and I've done a very preliminary implementation of a persistent lock-free queue.
First, we followed the transformation rules with pwb/pfence/psync, but they add too many fences. Andreia is awesome at reducing algorithms to their bare essentials and I gave some contribution as well, the end result being we got rid of most pfences.
This algorithm was designed such that on enqueue(), a successful CAS on ltail->next implies that the pwbs for newNode->item, newNode->next and tail have been done, and a successful CAS on tail means that the pwb on ltail->next has been done. This kind of happens-before implicit guarantee means the queue is always in (at worst) a semi-consistent state, which the next operation can safely recover from, without the need for an explicit recovery method.

Here is the code for enqueue(), and yes, it's just the MS algorithm plus some strategically placed persistence fences and pwbs  ;)
void enqueue(T* item, const int tid) {
    if (item == nullptr) throw std::invalid_argument("item can not be nullptr");
    Node* newNode = new Node(item);   // TODO: replace this with NVM allocator
    PWB(&newNode->next); // Just in case 'item' and 'next' are not on the same cache line
    while (true) {
        Node* ltail = hp->protectPtr(kHpTail, tail, tid);
        if (ltail == tail.load()) {
            Node* lnext = ltail->next.load();
            if (lnext == nullptr) {
                if (ltail->casNext(nullptr, newNode)) {
                    casTail(ltail, newNode);

            } else {
                casTail(ltail, lnext);

And here is the code for dequeue()
T* dequeue(const int tid) {
    Node* node = hp->protect(kHpHead, head, tid);
    while (node != tail.load()) {
        Node* lnext = hp->protect(kHpNext, node->next, tid);

        if (casHead(node, lnext)) {

            T* item = lnext->item;  

            hp->retire(node, tid); 
            return item;
        node = hp->protect(kHpHead, head, tid);
    return nullptr;                  // Queue is empty

The bold fonts show the added fences needed to guarantee persistency. In fact, this queue is not just durable, it is Durable Linearizable, which is (in my view) the easiest model to reason about for durability, as important to Persistence as Linearizability is important to Concurrency.

The reason the pfences were taken out is because we're assuming that CAS has persistent semantics similar to pfence that doesn't act on the load/store of the CAS itself, only on the other loads and stores. In other words, it's as if a CAS is equivalent to a:
  CAS()     // concurrent

The reason we assume this, is because on x86, LOCK instructions and read-modify-write instructions like CAS, ensure order for CLFLUSHOPT and CLWB (pwbs). For more details see Intel's manual for CLFLUSHOPT:

As for the pwb and psync before returning, they're not always needed but it helps to reason about in terms of composability.
The only way to observe effects from this queue is to call enqueue() or dequeue(), therefore, the next call to the same method will flush the cache line and persist it. However, if you want to do something like:
   a_is_persisted = true;

then the only way to guarantee correct ordering of a_is_persisted with the element 'a' actually being in the queue and persistent, is to have the pwb and a psync (or pfence) before returning from enqueue()/dequeue().

Adding the persistence fences and then reducing them to a minimum was the easy part. The though part is gluing that together with the constructor and destructor that recovers after failure.
Here is what the constructor and destructor look like:
PMichaelScottQueue() {
    recover();  // re-use the same code as the recovery method

~PMichaelScottQueue() {
    destructorInProgress = true;
    recover();  // Re-use the same code as in the recovery method

Simple huh?... not so fast, now we need to show the recover() method:
void recover() {
    if (destructorInProgress) {
        if (head.load(std::memory_order_relaxed) != nullptr) {
            while (dequeue(0) != nullptr); // Drain the queue
  , std::memory_order_relaxed);
            delete head.load(std::memory_order_relaxed);  // Delete the last node    // TODO: replace this with NVM deallocator
    hp = new HazardPointers<Node>{2, maxThreads};
    // If both head is null then a failure occurred during constructor
    if (head.load(std::memory_order_relaxed) == nullptr) {
        Node* sentinelNode = new Node(nullptr);    // TODO: replace this with NVM allocator, std::memory_order_relaxed);
    // If tail is null, then fix it by setting it to head
    if (tail.load(std::memory_order_relaxed) == nullptr) {, std::memory_order_relaxed);
    // Advance the tail if needed
    Node* ltail = tail.load(std::memory_order_relaxed);
    Node* lnext = ltail->next.load(std::memory_order_relaxed);
    if (lnext != nullptr) {, std::memory_order_relaxed);

Yep, now things are starting to get complicated, which I'm not a big fan of, but it's as good as I can get it  :(
Hopefully the approach which will be shown at PPoPP will be simpler than this.

About the constructor:
As long as the allocator returns a zeroed-out memory region, the 'head' and 'tail' will be nullptr even if there is a crash immediately at the start of the call to the constructor. If the allocator can't guarantee that the data is zero-ed out, then there is no 100% safe way to distinguish between a proper initialization and trash immediately after allocating the queue.
A crash occurring after the 'head' and 'tail' are made persistent (with nullptr) is always recoverable, although there are a few different cases:
- If the head is nullptr then the sentinel node must be allocated;
- If the head is non-null but tail is null, then the sentinel was allocated and assigned to head but not to tail;
- If both head and tail are non-null and tail->next is non-null then tail is not pointing to the last node and we need to advance tail;

About the destructor:
The destructor must first drain the queue to avoid leaking as much as possible. Then, it needs to de-allocate the last node and zero-out the head to make sure then in the event of a failure, the recovery operation will not try to "recover" the queue, therefore, we have a persistent variable named 'destructorInProgress' which is set before starting the destruction operation.
After destructorInProgress has been set to true and ordered with a pfence, we can clear head, and only then can we de-allocate the last node.

How fast is it?
Well, on DRAM emulating NVDIMMs, implementing pwb/pfence/psync as CLFLUSHOPT/SFENCE/SFENCE, we get some not too bad results when compared with the regular MS queue:

For the most attentive of you, this queue seems to have slightly better performance than the one shown in this brief announcement:
however, our implementation has integrated memory reclamation which may be acting as a kind of back-off... or maybe we have a smaller number of fences. Whatever the reason, when the other queue is out in a month I'll try to re-run the benchmark to compare them  ;)
Btw, our synthetic benchmark is very simplistic: do an enqueue, followed by a dequeue, and then repeat 10 million times. I'm not a big fan of this benchmark but it's what everybody uses in academic papers to measure queue performance, so that's what we use too.

One last note, all allocation and de-allocation operations in this queue are prone to leaking, if the failure occurs immediately before a de-allocation or immediately after an allocation. There is no way around this problem without transactions, and seen as we're trying to get lock-free progress, the transactional mechanism would have to be also lock-free, and there is no lock-free PTM published (yet).
I'm not the only one complaining about this. Paul McKenney points this out as being one of the fallacies in the "lock-free data structures are resilient" argument, typically touted as one of the advantages for lock-free (by myself included). Maurice Herlihy has some interesting stuff to say around that topic as well:

We'll talk more about transactions in a future post, for today that's all. In the meantime, have fun with the code:

Wednesday, December 20, 2017

Software Transactional Memory is Simple

Still super busy so I'll just post a light talk about an STM that is being used in production at AppNexus. It's not everyday you see production-use of STMs.

Tuesday, December 5, 2017

Maged Michael on Non-Blocking Synchronization and Memory Management

I've been busy with a lot of stuff, so here's just a quick post with a video from Maged Michael explaining non-blocking memory reclamation (and a bit about allocation as well).

Monday, October 9, 2017

Is Parallel Programming still hard?

CppCon 2017 is over and some of the talks are starting to trickle in.
One that is definitely worth watching is the one by Michael Wong, Paul McKenney and Maged Michael. If you don't know who these guys are then go and read this post.

Here are the two videos:

My favorite quote from the talk:
"It is possible to split atoms, but doing that spoils them for electronics use".

Nice intro to false sharing, store buffers and caches. All of those are hardware details that (unfortunately) are important in multi-threaded programming, and are rarely talked about.
About the Reader-Writer Lock slides (a subject I've dug deep into with Andreia), there is a much simpler way than using a TLS RW-Lock, namely the scalable reader-writer locks. More info can be seen on this post:

Let's see what the next talks will be about...

Wednesday, September 6, 2017

Dawn of the Universal Constructs

Historical Introduction

Some time ago I saw this nice presentation by Alexei Alexandrescu about Generic Locking.
He starts the talk by giving an introduction to how the "Concurrency Problem" has been attacked.

And what is the "Concurrency Problem" you might ask?
Well, remember this little issue about Moore's Law (on CPU clock speed) having stopped back in 2004 and that nowadays to get more performance you need to use multiple cores, which means using some form of concurrency? Yeah, that's the "Concurrency Problem"!
Sure, if you can write your application or algorithms in a map-reduce form, then you can parallelize it, but not all algorithms are inherently parallel to profit from this approach, and those are the ones we need to worry about.

As Alexei says in his presentation, in the beginning there were mutual exclusion locks, and those were pretty much the only way to deal with concurrency.
Then came reader-writer locks, lock-free and wait-free data structures, CSP, Actor Models, Software Transactional Memory, Hardware Transactional Memory and many others.
For a while, lock-free and wait-free data structures seemed to be the thing, but the fact that there wasn't a memory model to program them in, and that it was very hard to verify their correctness, and very hard to do efficient lock-free memory reclamation (wait-free is even harder), it caused people to try the other approaches. Nowadays, there is no clear winner, and maybe there never will be. Actor models have been gaining a lot of followers, with the whole Rx/Reactive movement based on it... I'm not sure they all realize that at the base of the actor model is the message-passing queue between the actors which is itself based on a lock-free queue.

Anyways, fast-forwarding to 2017, one of the upcoming CppCon talks which will definitly be worth watching, is the one by Paul McKennney, Maged Michael and Michael Wong:
As they mention in the abstract, there was a time when writing a simple loop seemed as perilous then as writing a lock-free data structure seems now.
I don't know what their talk will be about, but I do agree that most of the dangers in concurrent programming can be avoided by using better programming tools, and on that note, I would dare go so far as to say that we're entering "The Dawn of the Universal Constructs"!

What is a Universal Construct

Universal Constructs aren't a new thing. The first Wait-Free Universal Construct (WFUC) was invented by Maurice Herlihy back in 1993:
WFUCs are a construction that encapsulate an object (or data structure) which was written for single-threaded usage, and provide a way to call its methods with wait-free progress.
It's kind of like wrapping all the methods in an object with a call to lock()/unlock() of a mutual exclusion lock, except that a lock is blocking while WFUC are, well, wait-free!

In summary, a Wait-Free Universal Construct lets you write your favorite data structure as if it was "single-threaded code" and then wraps its methods to provide wait-free progress.
If you saw Alexei's presentation at the beginning of the post, you'll know that he has a cool way (in C++) of automatically exporting the methods of an object/data-structure wrapped with a read-lock or a write-lock depending on whether the methods are mutative or read-only. The same kind of approach can be done with WFUCs, which means that they are truly generic.

How come I never heard of this before?

Well, you kind of have heard about this (if you read this blog).
Left-Right for example, is a Universal Construct that provides wait-free progress for readers and blocking (starvation-free) progress for writers:
It's fast, it reclaims memory, it's "universal", but it's not completely wait-free, i.e. it's not wait-free for mutations.

There are truly WFUCs in the literature and we've even added (wait-free) memory reclamation to a few of them. The most famous being P-Sim by Panagiota Fatourou and Nikolaos Kallimanis:

But even the best of them still works on the COW principle, in the sense that it requires a full copy of the data structure for every mutation.
Yes, that's right, if we have a map with 1M keys/values, and we want to insert a new key (or remove an existing one), we have to make a full copy of the 1M nodes of the data structure.

Waaaaiiiit wwwhhhaaat? Did you just say that this thingy has to copy 1 million nodes every time I want to insert a new key/value in my hashmap?

Yes, I'm afraid that's how it works. It's not as bad as it looks if the data structure is small, but as soon as you start going to anything bigger than a few thousand nodes in the data structure, throughput is going to go down the drain. And cache locality is a big issue with COW.

What's on the horizon

I happen to have insider knowledge that it is possible to make a wait-free universal construct that is fast enough to be used in production, because it doesn't have to copy the entire data structure (except on very rare occasions). This new way of doing WFUCs is fast enough to be used in real-life applications, in fact, it's capable of beating hand-written lock-free and wait-free data structures.
Yes, it's that good. We call it CX and the paper is almost ready.
... did I say it does wait-free memory reclamation as well?  ;)

As soon as other researchers see what CX is capable of, I'm sure they will have their own ideas and improve upon it, thus bringing about a new age of production-capable WFUCs.

Going back to the Concurrency Problem, how does this change the way we program multi-threaded applications?

Based on my experience as a developer, software engineers rarely use an off-the-shelf data structure as is. We always add a little twist to it.
I don't know whether this happens due to us having big egos and wanting to give our little personal touch to everything we do (I'm generalizing here), or if it's simply because we're tinkerers and we like to see what's under the hood and tweak it to our own desire. the truth is, if we could use it as is then there wouldn't be a need to write all the software people write, there would be just one search engine, just one banking software, just one customer management software, etc.

Whatever the reason may be, the consequence is, that we always make changes to the data structures, and the problem with that is that lock-free and wait-free data structures will fall apart if they're touched. They have sequences of instructions that have to be done in a particular way, they have optimizations for relaxed memory orderings and subtle details, and memory reclamation, etc, etc. The temptation to touch and tinker a lock-free data structure is just too much, and we end up getting burned and then we say that "lock-free data structures suck", which is not true.
The algorithms behind these data structures are very subtle, and our human nature is to tinker with it, which is incompatible.

This is where Wait-Free Universal Constructs come in.
With an efficient WFUC, any software engineer without any knowledge of lock-free data structures (or atomics or the memory model) can just take a "single-threaded data structure", modify it to satisfy its needs (or ego) and then wrap it with WFUC and BOOOOM, out comes a fast and scalable implementation of its customized data structure, with wait-free progress.
Even more, another software engineer working on the code can later fix bugs or add new features to the data structure without breaking its correct functionality. Try doing that with a lock-free data structure, it's hard enough to get it right the first time, to add new functionality later is just too nuts.
As a software engineer, what more could I ask for?

No wonder that WFUCs are seen by researchers as the Holy Grail of concurrency, it's because they make it so easy to solve the Concurrency Problem in production.
Yes, I'm a fan boy of Universal Constructs, but the hype is called for and needed. This stuff doesn't solve all the concurrency problems, but it solves a a lot of them, enough to become a standard way of dealing with multi-threaded code, or at least for writing multi-threaded data structures.

Whatever happens, Universal Constructs are here to stay as one of the easiest tools to use when writing multi-threaded applications, and we all know that when it comes to multi-threaded programs, any help we can get is not enough  ;)