Fraser and Harris’ Multiple Compare and Swap

Keir Fraser and Tim Harris’ Multiple Compare and Swap packs a lot of shared memory programming techniques in a page or two of code. It’s a quick way to get a jump start on lock-free programming without trudging through a stack of papers.

That and its just a neat piece of code.

What’s an MCAS?

Compare and swap compares the value at an address to a expected value, then copies in a new value if the value at the address matches the expected value. And it does this atomically.

atomically CAS(word *address, word expected_value, word new_value){
    if(*address == expected) *address = new_value;
}

A multiple compare and swap compares all the address values to all of the expected values, and if all of them match, it copies in all of the new values.

atomically MCAS(word *addresses[],
  word expected_values[], word new_values[], int N) {
    // Compare
    for (int i = 0; i < N ; i++ ){
        if(*addresses[i] != expected_values[i]) return FALSE;
    }
    // Swap
    for(int i = 0; i < N; i++){
        *addresses[i] = new_values[i];
    }
    return TRUE;
}

Many multicore processors provide compare and swap, but multiple compare and swap is a lot easier to use when programming. To use it, MCAS will be built using compare and swap, and regular reads and writes.

This MCAS depends on being able to tag values. In the original, it’s assumed that all the values are word-aligned pointers, so the last two bits can be used as tags. The extra pointer syntax is distracting though so I’ve removed it and made the MCAS look like it operates on any value.

Lock-freedom

The term lock-free is confusing because its a technical definition that means more than you would guess. Roughly, a lock-free implementation has to guarantee that some operation must complete after some finite number of steps is taken by the system.

A regular data structure with a big lock around it doesn’t satisfy this definition. If one thread starts an operation, acquires the lock, then goes to sleep, then another thread starts an operation on the same data structure, then this second thread can take any number of steps without any operation getting completed 1.

Here are two strategies you can use to make a lock-free data structure.

1. Do most of the work of the operation where no other thread can interfere, then use one atomic operation to “commit” it2.

2. Do the work of the operation in a way that allows any other thread to help complete it if the initial thread happens to sleep.

The MCAS uses strategy 2, using descriptors to allow helping behaviour.

Using Descriptors

Descriptors describe the state of the operation in progress in a way that allow other threads to complete the operation.

Say you were working in a bank and had to make changes to an account, but another clerk had half completed some changes to the account, then went on vacation. If enough details had been left behind you might be able to complete those changes on your own, and then get your own work done.

A descriptor is a data structure that holds the details that are needed.

The MCAS implementation has two phases for its descriptor to manage:

1. Compare: Acquire all the locations needed for the MCAS in memory address order to avoid deadlock, comparing the values in the locations to the expected values, failing if a value in a location does not match its expected value.

2. Release: If all locations were acquired with successful compare, then release the locations, swapping in the required new values. If the compare stage failed, release to locations, swapping in the expected values, which are equal to the original values in the acquired locations.

The descriptor needs to remember what locations are acquired and what status the operation is currently in.

typedef struct{
    word status of {UNDECIDED =0, SUCCESSFUL=1, FAILED=2};
    int N;
    word *addresses[MAX_N];
    word expected_values[MAX_N], new_values[MAX_N];
} mcas_descriptor;

The locations acquired by the descriptor are tracked by swapping the descriptor’s address into the desired locations. This also lets other threads get the descriptor in order to help complete the operation.

Here’s an “artist’s rendering” of the process.

Now the MCAS just needs to create the descriptor and call a function to complete the descriptor’s operation.

The MCAS arguments are sorted so that the acquires will happen in address order.

bool MCAS(word *addresses[],
  word expected_values[], word new_values[], int N){
    mcas_descriptor *d = new mcas_descriptor(addresses,
        expected_values, new_values, N);
    d->status = UNDECIDED;
    sort_addresses(d);
    return MCASComplete(d);
}

Now we build MCASComplete(), the main descriptor completing function.

An atomic MCAS that is not lock-free

We’re first going to build MCASComplete() so that MCAS is atomic, then make it lock-free in the next section.

I’ve expanded out the original code in the paper and made it more pseudocode than C to highlight some tricky points. The acquire phase tries to acquire the locations in order, shifting to the release phase when a compare has failed, or all the acquisitions have been completed successfully.

bool MCASComplete(mcas_descriptor *d){
    for(int i:= 0; i < d->N; i++){
        while(TRUE){
            switch(MCASAcquire(d,i)){
                 case ACQUIRE_SUCCESS:
                     break;
                 case ACQUIRE_RETRY:
                     continue;
                 case ACQUIRE_FAILED:
                     return MCASRelease(d, FAILED)
                 // Can't really do this, but you get the point.
                 case (ACQUIRE_BLOCKED, v):
                     //Another MCAS, with descriptor *v is at this address
            }
        }
    }
    return MCASRelease(d, SUCCESSFUL)
}

MCASAcquire() attempts to acquire a location and figures out if that attempt succeeded or failed.

acquire_result MCASAcquire(mcas_descriptor *d, int i){
    CAS(d->addresses[i], d->expected_values[i]  ,d);
    v := *d->addresses[i];

    // v has to be read anyway, chance to try again
    // Also may compensate for CAS that fails because of an interrupt
    if( v == d->expected_values[i] ) return ACQUIRE_RETRY;
    // Location acquired, descriptor installed
    if( v == d ) return ACQUIRE_SUCCESS;
    // v not another MCAS so it is a value and not the expected one
    if(not IsMCASDesc(v) ) return ACQUIRE_FAIL;
    return (ACQUIRE_BLOCKED,v);
}

IsMCASDesc() handles checking the tags of the values. It is crucial that no expected or new value is ever equal to a descriptor address since these are used for acquiring locations.

Once we’re done acquiring, we have to release.
The release phase needs to either replace the original values, or new values into the acquired addresses, depending on how the operation went.

bool MCASRelease(mcas_descriptor *d, word desired){
    success := desired;
    if(success == SUCCESSFUL){
        for(int i := 0; i < d ; i++ )
            CAS(d->address[i], d, d->new[i]);
        return TRUE;
    } else {
        // We only acquired locations that matched the expected value,
        // so to undo, just replace those expected values back.
        for(int i := 0; i < d ; i++ )
            CAS(d->address[i], d, d->expected[i]);
        return FALSE;
    }
}

We’re almost done! We have a complete atomic MCAS, but we still have to make it lock-free. Take another look through the code and run it through your head to see how it works in a single thread case.

For multiple threads, notice how the CAS operations act as locks, meaning operations block when they run into each other.

To remove the blocking, we have to use the helping ability that we’ve created the descriptors for.

Recursive Helping

To allow helping, we have to make it safe for multiple MCASComplete() calls to run at the same time. There are a lot of subtleties here, but you can get the basic idea by thinking about this:

  1. Assume that the expected values match the values at the addresses and that only MCASComplete() calls are running on the same descriptor3.
    • The acquire phases can be safely run at the same time.
    • The release phases can be safely run at the same time.
  2. When acquire and release phases mix, bad things can happen. What we need is a way to “turn off” all acquires once any release starts.

The first step to this is to set a flag in the descriptor once a release starts. CAS is used to do this so that we can guarantee that it is set exactly once.

bool MCASRelease(mcas_descriptor *d, word desired){
    //Try to commit status to the descriptor
    CAS(&d->status, UNDECIDED, desired);
    success := d->status;
    if(success == SUCCESSFUL){
        for(int i := 0; i < d ; i++ )
            CAS(d->address[i], d, d->new[i]);
        return TRUE;
    } else {
        // Only acquired locations that matched the expected value,
        // so to undo, just replace those expected values back.
        for(int i := 0; i < d ; i++ )
            CAS(d->address[i], d, d->expected[i]);
        return FALSE;
    }
}

To make this flag disable the acquire phases, we use a variant of CAS called Conditional CAS (CCAS).

atomically CCAS(word *address, word expectedv, word newv, word *condition){
    if( (*address == expectedv) and (*condition == 0) ) *address := newv;
    return x;
}

The details of how to build a CCAS are in the paper and thesis. The CCAS makes the acquire depend on the status of the descriptor. Changing the status of the descriptor to a non-zero value, SUCCESSFUL or FAILED, disables the acquire phase.

acquire_result MCASAcquire(mcas_descriptor *d, int i){
    CCAS(d->address[i], d->expected[i], d, &d->status);
    v := *d->a[i];
    if((v==e) and (d->status = UNDECIDED)) return ACQUIRE_RETRY;
    if(v == d) return ACQUIRE_SUCCESS;
    if(not IsMCASDesc(v) ) return ACQUIRE_FAIL;
    return (ACQUIRE_BLOCKED,v);
}

It is now safe to run any number of MCASComplete() calls at the same time, any time, on the same descriptor. The remaining problem is this bit of code in MCASComplete:

    case (ACQUIRE_BLOCKED, v):
        //Another MCAS, with descriptor *v is here

When the acquire hits another MCAS that has already acquired that location, we just continue the loop and retry again. If the thread that started that blocking MCAS has gone to sleep, our thread will endlessly loop and retry.

To get lock-freedom, we guarantee that anytime we encounter such an MCAS, we will help it.

case (ACQUIRE_BLOCKED, v):
    //Another MCAS, with descriptor *v is here
    MCASComplete(v)

The problem is that helping is expensive.

If 8 processors try to MCAS the same location at once, one will win the first spot, then it is possible that all the other 7 duplicate all the work trying to help. Each redundant help run potentially uses 2N expensive CCAS instructions.

This is worse than an efficient locking implementation, where the other threads sleep or perform very cheap operations until it is their turn. So lock-freedom does not always mean efficient in the way that real-time does not always mean efficient.

A common solution to contention problems is exponential backoff. It would be a pretty neat project to benchmark the basic helping algorithm vs. exponential backoff vs. no helping.

case (ACQUIRE_BLOCKED, v):
    //Another MCAS, with descriptor *v is here
    if(sleep_on_block){
        sleep(backoff);
        backoff = backoff * 2;
        sleep_on_block := FALSE;
    } else {
        MCASComplete(v);
        sleep_on_block := TRUE;
    }

And we’re done! An atomic, lock-free MCAS that can be implemented on many different architectures.

Where to from here

We are store descriptors in data locations, which should never be returned as data. This means that we also need a special MCASRead() to read values from MCAS-able locations. Since CCAS also uses its own descriptors in its implementation, it also needs a corresponding CCASRead() operation.

We assumed that descriptors are never reused. This makes sure that when we turn off the acquire phase, it is permanently turned off, and when a location is released, there is no way to put that descriptor back into that location, so other release phases will never change that location.

Fraser’s implementation uses reference counting to reclaim descriptors, which has to be carefully done to work concurrently.

The pseudocode has to be translated into a mix of C and assembly to work properly. The Practical lock-free data structures page has Fraser and Harris’ source code.

Using descriptors with acquire and release phases is a powerful technique that has a lot of applications. Fraser and Harris’ paper Concurrent Programming without Locks and Fraser’s thesis Practical Lock Freedom show you how this technique can be used to make data structures, and software transactional memories.

Mellor-Crummey and Scott’s paper Algorithms for scalable synchronization on shared-memory multiprocessors explores a lot of powerful techniques for synchronizing threads. A carefully done, practical introduction to high performance shared memory programming.

Thanks for reading this not-so-short intro, I hope this gets people thinking of more ways to program concurrently.

Footnotes

  1. This is where it matters how you count “steps”. If you say that a thread that is waiting for a lock cannot take any steps, then you break the spirit of the definition.
  2. The classic example of this is from Herlihy’s work. Have a purely functional data structure and a pointer to it. To perform an operation, grab the pointer, do the functional update, then CAS the new pointer in. Lock-free but can be very inefficient.
  3. Then think about what can go wrong with acquires running at the same time when another MCAS comes in.

Comments are closed.