ES proposal: Shared memory and atomics


The ECMAScript proposal “Shared memory and atomics” by Lars T. Hansen has reached stage 4 this week and will be part of ECMAScript 2017. It introduces a new constructor SharedArrayBuffer and a namespace object Atomics with helper functions. This blog post explains the details.






Parallelism vs. concurrency


Before we begin, let’s clarify two terms that are similar, yet distinct: “parallelism” and “concurrency”. Many definitions for them exist; I’m using them as follows:



  • Parallelism (parallel vs. serial): execute multiple tasks simultaneously

  • Concurrency (concurrent vs. sequential): execute several tasks during overlapping periods of time (and not one after another).


Both are closely related, but not the same:



  • Parallelism without concurrency: single instruction, multiple data (SIMD). Multiple computations happen in parallel, but only a single task (instruction) is executed at any given moment.

  • Concurrency without parallelism: multitasking via time-sharing on a single-core CPU.


However, it is difficult to use these terms precisely, which is why interchanging them is usually not a problem.


Models of parallelism

Two models of parallelism are:



  • Data parallelism: The same piece of code is executed several times in parallel. The instances operate on different elements of the same dataset. For example: MapReduce is a data-parallel programming model.



  • Task parallelism: Different pieces of code are executed in parallel. Examples: web workers and the Unix model of spawning processes.





A history of JS parallelism


  • JavaScript started as being executed in a single thread. Some tasks could be performed asynchronously: browsers usually ran those tasks in separate threads and later fed their results back into the single thread, via callbacks.



  • Web workers brought task parallelism to JavaScript: They are relatively heavweight processes. Each worker has its own global environment. By default, nothing is shared. Communication between workers (or between workers and the main thread) evolved:



    • At first, you could only send and receive strings.

    • Then, structured cloning was introduced: copies of data could be sent and received. Structured cloning works for most data (JSON data, Typed Arrays, regular expressions, Blob objects, ImageData objects, etc.). It can even handle cyclic references between objects correctly. However, error objects, function objects and DOM nodes cannot be cloned.

    • Transferables move data between workers: the sending party loses access as the receiving party gains access to data.



  • Computing on GPUs (which tend to do data parallelism well) via WebGL: It’s a bit of a hack and works as follows.



    • Input: your data, converted into an image (pixel by pixel).

    • Processing: OpenGL pixel shaders can perform arbitrary computations on GPUs. Your pixel shader transforms the input image.

    • Output: again an image that you can convert back to your kind of data.



  • SIMD (low-level data parallelism): is supported via the ECMAScript proposal SIMD.js. It allows you to perform operations (such as addition and square root) on several integers or floats at the same time.



  • PJS (codenamed River Trail): the plan of this ultimately abandoned project was to bring high-level data parallelism (think map-reduce via pure functions) to JavaScript. However, there was not enough interest from developers and engine implementers. Without implementations, one could not experiment with this API, because it can’t be polyfilled. On 2015-01-05, Lars T. Hansen announced that an experimental implementation was going to be removed from Firefox.




The next step: SharedArrayBuffer

What’s next? For low-level parallelism, the direction is quite clear: support SIMD and GPUs as well as possible. However, for high-level parallelism, things are much less clear, especially after the failure of PJS.


What is needed is a way to try out many approaches, to find out how to best bring high-level parallelism to JavaScript. Following the principles of the extensible web manifesto, the proposal “shared memory and atomics” (a.k.a. “Shared Array Buffers”) does so by providing low-level primitives that can be used to implement higher-level constructs.



Shared Array Buffers

Shared Array Buffers are a primitive building block for higher-level concurrency abstractions. They allow you to share the bytes of a SharedArrayBuffer object between multiple workers and the main thread (the buffer is shared, to access the bytes, wrap it in a Typed Array). This kind of sharing has two benefits:



  • You can share data between workers more quickly.

  • Coordination between workers becomes simpler and faster (compared to postMessage()).


Creating and sending a Shared Array Buffer


// main.js

const worker = new Worker('worker.js');

// To be shared
const sharedBuffer = new SharedArrayBuffer( // (A)
10 * Int32Array.BYTES_PER_ELEMENT); // 10 elements

// Share sharedBuffer with the worker
worker.postMessage({sharedBuffer}); // clone

// Local only
const sharedArray = new Int32Array(sharedBuffer); // (B)

You create a Shared Array Buffer the same way you create a normal Array Buffer: by invoking the constructor and specifying the size of the buffer in bytes (line A). What you share with workers is the buffer. For your own, local, use, you normally wrap Shared Array Buffers in Typed Arrays (line B).


Warning: Cloning a Shared Array Buffer is the correct way of sharing it, but some engines still implement an older version of the API and require you to transfer it:



worker.postMessage({sharedBuffer}, [sharedBuffer]); // transfer (deprecated)

In the final version of the API, transferring a Shared Array Buffer means that you lose access to it.


Receiving a Shared Array Buffer

The implementation of the worker looks as follows.



// worker.js

self.addEventListener('message', function (event) {
const {sharedBuffer} = event.data;
const sharedArray = new Int32Array(sharedBuffer); // (A)

// ···
});

We first extract the Shared Array Buffer that was sent to us and then wrap it in a Typed Array (line A), so that we can use it locally.


Accessing Shared Array Buffers

If you access shared memory like you access normal memory, you are facing two problems.


Problem 1: you may read intermediate results

Consider the following code where main.js and worker.js were set up like shown previously.



// Initialization before sharing the Array
sharedArray[0] = 1;

// main.js
sharedArray[0] = 2;

// worker.js
while (sharedArray[0] === 1) ; // (A)
console.log(sharedArray[0]); // (B)

Propagating shared state takes time, which is why the values we read in lines A and B may be neither 1 nor 2.


Problem 2: writes may be reordered

Let’s assume reading intermediate results is not an issue. Then you face the problem that readers (such as worker.js in the following example) cannot rely on writes (by main.js) happening in a deterministic order.



// Initialization before sharing the Array
sharedArray[0] = 1;

// main.js
sharedArray[0] = 2;
sharedArray[0] = 3;

// worker.js
let v;
while ((v = sharedArray[0]) === 1);
console.log(v); // either 2 or 3

worker.js cannot rely on main.js writing 2 first and then 3. The reason for that is that within the same thread, compilers (and even CPUs) may reorder writes if no (local) read depends on them.


How do we fix these two problems?


Solution: Atomics

You can’t use normal operations to read and write data to Shared Array Buffers, you need to use the functions provided via the namespace object Atomics. For example, if we rewrite the first of the previous two code fragments, we get:



// Initialization before sharing the Array
Atomics.store(sharedArray, 0, 1);

// main.js
Atomics.store(sharedArray, 0, 2);

// worker.js
while (Atomics.load(sharedArray, 0) === 1) ;
console.log(Atomics.load(sharedArray, 0)); // 2

Atomics ensures two things:



  • Its operations are like database transactions in that they happen atomically: all of the writing is done in a single step; you can’t observe intermediate states.



  • Additionally, the order of writes is fixed; they will never be reordered.




Shared Array Buffers and the run-to-completion semantics of JavaScript

JavaScript has so-called run-to-completion semantics: every function can rely on not being interrupted by another thread until it is finished. Functions become transactions and can perform complete algorithms without anyone seeing the data they operate on in an intermediate state.


Shared Array Buffers break run to completion (RTC): data a function is working on can be changed by another thread during the runtime of the function. However, code has complete control over whether or not this violation of RTC happens: if it doesn’t use Shared Array Buffers, it is safe.


This is loosely similar to how async functions violate RTC. There, you opt into a blocking operation via the keyword await.


Shared Array Buffers and asm.js and WebAssembly

Shared Array Buffers enable emscripten to compile pthreads to asm.js. Quoting an emscripten documentation page:



[Shared Array Buffers allow] Emscripten applications to share the main memory heap between web workers. This along with primitives for low level atomics and futex support enables Emscripten to implement support for the Pthreads (POSIX threads) API.



That is, you can compile multithreaded C and C++ code to asm.js.


Discussion on how to best bring multi-threading to WebAssembly is ongoing. Given that web workers are relatively heavyweight, it is possible that WebAssembly will introduce lightweight threads. You can also see that threads are on the roadmap for WebAssembly’s future.


Sharing data other than integers

At the moment, only Arrays of integers (up to 32 bits long) can be shared. That means that the only way of sharing other kinds of data is by encoding them as integers. Tools that may help include:



  • TextEncoder and TextDecoder: The former converts strings to instances of Uint8Array. The latter does the opposite.

  • stringview.js: a library that handles strings as arrays of characters. Uses Array Buffers.



  • FlatJS: enhances JavaScript with ways of storing complex data structures (structs, classes and arrays) in flat memory (ArrayBuffer and SharedArrayBuffer). JavaScript+FlatJS is compiled to plain JavaScript. JavaScript dialects (TypeScript etc.) are supported.



  • TurboScript: is a JavaScript dialect for fast parallel programming. It compiles to asm.js and WebAssembly.




Eventually, there will probably be additional – higher-level – mechanisms for sharing data. And experiments will continue to figure out what these mechanisms should look like.


How much faster is code that uses Shared Array Buffers?

Lars T. Hansen has written two implementations of the Mandelbrot algorithm (as documented in his article “A Taste of JavaScript’s New Parallel Primitives” where you can try them out online): A serial version and a parallel version that uses multiple web workers. For up to 4 web workers (and therefore processor cores), speed-up improves almost linearly, from 6.9 frames per seconds (1 web worker) to 25.4 frames per seconds (4 web workers). More web workers bring additional performance improvements, but more modest ones.


Hansen notes that the speed-ups are impressive, but going parallel comes at the cost of the code being more complex.



Example

Let’s look at a more comprehensive example. Its code is available on GitHub, in the repository shared-array-buffer-demo. And you can run it online.


Using a shared lock

In the main thread, we set up shared memory so that it encodes a closed lock and send it to a worker (line A). Once the user clicks, we open the lock (line B).



// main.js

// Set up the shared memory
const sharedBuffer = new SharedArrayBuffer(
1 * Int32Array.BYTES_PER_ELEMENT);
const sharedArray = new Int32Array(sharedBuffer);

// Set up the lock
Lock.initialize(sharedArray, 0);
const lock = new Lock(sharedArray, 0);
lock.lock(); // writes to sharedBuffer

worker.postMessage({sharedBuffer}); // (A)

document.getElementById('unlock').addEventListener(
'click', event => {
event.preventDefault();
lock.unlock(); // (B)
});

In the worker, we set up a local version of the lock (whose state is shared with the main thread via a Shared Array Buffer). In line B, we wait until the lock is unlocked. In lines A and C, we send text to the main thread, which displays it on the page for us (how it does that is not shown in the previous code fragment). That is, we are using self.postMessage() much like console.log() in these two lines.



// worker.js

self.addEventListener('message', function (event) {
const {sharedBuffer} = event.data;
const lock = new Lock(new Int32Array(sharedBuffer), 0);

self.postMessage('Waiting for lock...'); // (A)
lock.lock(); // (B) blocks!
self.postMessage('Unlocked'); // (C)
});

It is noteworthy that waiting for the lock in line B stops the complete worker. That is real blocking, which hasn’t existed in JavaScript until now (await in async functions is an approximation).


Implementing a shared lock

Next, we’ll look at an ES6-ified version of a Lock implementation by Lars T. Hansen that is based on SharedArrayBuffer.


In this section, we’ll need (among others) the following Atomics function:



  • Atomics.compareExchange(ta : TypedArray<T>, index, expectedValue, replacementValue) : T
    If the current element of ta at index is expectedValue, replace it with replacementValue. Return the previous (or unchanged) element at index.


The implementation starts with a few constants and the constructor:



const UNLOCKED = 0;
const LOCKED_NO_WAITERS = 1;
const LOCKED_POSSIBLE_WAITERS = 2;

// Number of shared Int32 locations needed by the lock.
const NUMINTS = 1;

class Lock {

/**
* @param iab an Int32Array wrapping a SharedArrayBuffer
* @param ibase an index inside iab, leaving enough room for NUMINTS
*/
constructor(iab, ibase) {
// OMITTED: check parameters
this.iab = iab;
this.ibase = ibase;
}

The constructor mainly stores its parameters in instance properties.


The method for locking looks as follows.



/**
* Acquire the lock, or block until we can. Locking is not recursive:
* you must not hold the lock when calling this.
*/
lock() {
const iab = this.iab;
const stateIdx = this.ibase;
var c;
if ((c = Atomics.compareExchange(iab, stateIdx, // (A)
UNLOCKED, LOCKED_NO_WAITERS)) !== UNLOCKED) {
do {
if (c === LOCKED_POSSIBLE_WAITERS // (B)
|| Atomics.compareExchange(iab, stateIdx,
LOCKED_NO_WAITERS, LOCKED_POSSIBLE_WAITERS) !== UNLOCKED) {
Atomics.wait(iab, stateIdx, // (C)
LOCKED_POSSIBLE_WAITERS, Number.POSITIVE_INFINITY);
}
} while ((c = Atomics.compareExchange(iab, stateIdx,
UNLOCKED, LOCKED_POSSIBLE_WAITERS)) !== UNLOCKED);
}
}

In line A, we change the lock to LOCKED_NO_WAITERS if its current value is UNLOCKED. We only enter the then-block if the lock is already locked (in which case compareExchange() did not change anything).


In line B (inside a do-while loop), we check if the lock is locked with waiters or not unlocked. Given that we are about to wait, the compareExchange() also switches to LOCKED_POSSIBLE_WAITERS if the current value is LOCKED_NO_WAITERS.


In line C, we wait if the lock value is LOCKED_POSSIBLE_WAITERS. The last parameter, Number.POSITIVE_INFINITY, means that waiting never times out.


After waking up, we continue the loop if we are not unlocked. compareExchange() also switches to LOCKED_POSSIBLE_WAITERS if the lock is UNLOCKED. We use LOCKED_POSSIBLE_WAITERS and not LOCKED_NO_WAITERS, because we need to restore this value after unlock() temporarily set it to UNLOCKED and woke us up.


The method for unlocking looks as follows.




/**
* Unlock a lock that is held. Anyone can unlock a lock that
* is held; nobody can unlock a lock that is not held.
*/
unlock() {
const iab = this.iab;
const stateIdx = this.ibase;
var v0 = Atomics.sub(iab, stateIdx, 1); // A

// Wake up a waiter if there are any
if (v0 !== LOCKED_NO_WAITERS) {
Atomics.store(iab, stateIdx, UNLOCKED);
Atomics.wake(iab, stateIdx, 1);
}
}

// ···
}

In line A, v0 gets the value that iab[stateIdx] had before 1 was subtracted from it. The subtraction means that we go (e.g.) from LOCKED_NO_WAITERS to UNLOCKED and from LOCKED_POSSIBLE_WAITERS to LOCKED.


If the value was previously LOCKED_NO_WAITERS then it is now UNLOCKED and everything is fine (there is no one to wake up).


Otherwise, the value was either LOCKED_POSSIBLE_WAITERS or UNLOCKED. In the former case, we are now unlocked and must wake up someone (who will usually lock again). In the latter case, we must fix the illegal value created by subtraction and the wake() simply does nothing.


Conclusion for the example

This gives you a rough idea how locks based on SharedArrayBuffer work. Keep in mind that multithreaded code is notoriously difficult to write, because things can change at any time. Case in point: lock.js is based on a paper documenting a futex implementation for the Linux kernel. And the title of that paper is “Futexes are tricky” (PDF).


If you want to go deeper into parallel programming with Shared Array Buffers, take a look at synchronic.js and the document it is based on (PDF).



The API for shared memory and atomics

SharedArrayBuffer

Constructor:



  • new SharedArrayBuffer(length)
    Create a buffer for length bytes.


Static property:



  • get SharedArrayBuffer[Symbol.species]
    Returns this by default. Override to control what slice() returns.


Instance properties:



  • get SharedArrayBuffer.prototype.byteLength()
    Returns the length of the buffer in bytes.



  • SharedArrayBuffer.prototype.slice(start, end)
    Create a new instance of this.constructor[Symbol.species] and fill it with the bytes at the indices from (including) start to (excluding) end.




Atomics

The main operand of Atomics functions must be an instance of Int8Array, Uint8Array, Int16Array, Uint16Array, Int32Array or Uint32Array. It must wrap a SharedArrayBuffer.


All functions perform their operations atomically. The ordering of store operations is fixed and can’t be reordered by compilers or CPUs.


Loading and storing

  • Atomics.load(ta : TypedArray<T>, index) : T
    Read and return the element of ta at index.



  • Atomics.store(ta : TypedArray<T>, index, value : T) : T
    Write value to ta at index and return value.



  • Atomics.exchange(ta : TypedArray<T>, index, value : T) : T
    Set the element of ta at index to value and return the previous value at that index.






  • Atomics.compareExchange(ta : TypedArray<T>, index, expectedValue, replacementValue) : T
    If the current element of ta at index is expectedValue, replace it with replacementValue. Return the previous (or unchanged) element at index.


Simple modification of Typed Array elements

Each of the following functions changes a Typed Array element at a given index: It applies an operator to the element and a parameter and writes the result back to the element. It returns the original value of the element.



  • Atomics.add(ta : TypedArray<T>, index, value) : T
    Perform ta[index] += value and return the original value of ta[index].



  • Atomics.sub(ta : TypedArray<T>, index, value) : T
    Perform ta[index] -= value and return the original value of ta[index].



  • Atomics.and(ta : TypedArray<T>, index, value) : T
    Perform ta[index] &= value and return the original value of ta[index].



  • Atomics.or(ta : TypedArray<T>, index, value) : T
    Perform ta[index] |= value and return the original value of ta[index].



  • Atomics.xor(ta : TypedArray<T>, index, value) : T
    Perform ta[index] ^= value and return the original value of ta[index].




Waiting and waking

Waiting and waking requires the parameter ta to be an instance of Int32Array.



  • Atomics.wait(ta: Int32Array, index, value, timeout=Number.POSITIVE_INFINITY) : ('not-equal' | 'ok' | 'timed-out')
    If the current value at ta[index] is not value, return 'not-equal'. Otherwise go to sleep until we are woken up via Atomics.wake() or until sleeping times out. In the former case, return 'ok'. In the latter case, return 'timed-out'. timeout is specified in milliseconds. Mnemonic for what this function does: “wait if ta[index] is value”.



  • Atomics.wake(ta : Int32Array, index, count)
    Wake up count number of workers who are waiting at index of ta.




Miscellaneous

  • Atomics.isLockFree(size)
    This function lets you ask the JavaScript engine if operands with the given size (in bytes) can be manipulated without locking. That can inform algorithms whether they want to rely on built-in primitives (compareExchange() etc.) or use their own locking. Atomics.isLockFree(4) always returns true, because that’s what all currently relevant supports.



FAQ

What browsers support Shared Array Buffers?

At the moment, I’m aware of:



  • Firefox (50.1.0+): go to about:config and set javascript.options.shared_memory to true

  • Safari Technology Preview (Release 21+): enabled by default.

  • Chrome Canary (58.0+): There are two ways to switch it on.

    • Via chrome://flags/ (“Experimental enabled SharedArrayBuffer support in JavaScript”)

    • --js-flags=--harmony-sharedarraybuffer --enable-blink-feature=SharedArrayBuffer





Further reading

More information on Shared Array Buffers and supporting technologies:



Other JavaScript technologies related to parallelism:



Background on parallelism:



  • Concurrency is not parallelism” by Rob Pike [Pike uses the terms “concurrency” and “parallelism” slightly differently than I do in this blog post, providing an interesting complementary view]



Comments

Popular posts from this blog

Steve Lopez and the Importance of Newspapers

Ideas for fixing unconnected computing

Omar to kill me