Pessimal atomics in Ogre
Posted: Wed Nov 16, 2016 8:48 am
Some of the concurrency primitives in Ogre, especially the atomic operations (i.e., AtomicScalar<T>), are implemented in some pessimal ways.
When using atomic operations for things like thread-safe reference counts, as they are in Ogre's SharedPtr, we need two sets of guarantees. The obvious one is atomicity, i.e., changes from one thread must be seen "all at once" in other threads. Less obviously, atomic operations have to establish ordering guarantees for the reads and writes around them. Both the compiler and the actual CPU hardware can - and will - reorder instructions for performance reasons. One of the atomic operations' jobs is to insert memory barriers as-needed to prevent reorderings that could make the program run incorrectly.
C++11 (and C11, and Rust, and almost every other systems programming language) have developed a memory model to discuss these problems. The model provides the following orderings:
All of Ogre's atomic operations (provided they don't fall back to using a mutex) are sequentially consistent. No performance is sacrificed on x86 or x86_64, since they're architectures with fairly strong ordering. Both a relaxed addition and a sequentially consistent addition translate into a single instruction:
and, similarly, sequentially consistent subtraction and acquire-release subtractions turn into the same thing:
The story totally changes on other architectures. Take ARM, which you get on almost any Android or iOS system. ARM is weakly-ordered. Unlike x86, its reads and writes provide no implicit ordering gaurantees, so atomic operations need to start including memory barriers ("dmb" instructions on most ARM platforms). Here, a sequentially consistent addition looks like this:
and a relaxed addition looks like this:
Note that we can skip the memory barriers for relaxed operations. This can be a huge win for performance, since a memory barrier tells the core to stop whatever it's doing until it's synchronized all of its memory operations with all the other cores and their caches. This can take hundreds of clock cycles. And, since Ogre (especially pre-2.0) uses lots of shared pointers, this could happen alarmingly often.
(You'll also note that ARM can't atomically read, modify, and write a value, so it has to loop until no other thread is trying to modify the value. This means that CAS loops, like the ones used to implement AtomicScalar<T>::operator+=, can be implemented more efficiently with a "weak" CAS.)
In summary, the atomic operations in Ogre need some rework, especially if it's going to take its cross-platform support seriously. The best solution would be to use the atomic types built into C++11, as current compilers are quite good at handling the nuances shown above. Barring that option, the next best thing would likely be Boost's Atomic. I know there are some concerns over Boost's performance, but it's not one monolithic library, and Boost.Atomic compiles down to the exact same assembly seen above. If neither of those options are acceptable, inline assembly for all the orderings is probably needed, and doing so requires a thorough understanding the nuances of all the supported architectures.
As a final note: It's also worth noting that the actual scalars in AtomicScalar are marked 'volatile'. This is not helpful here, as explained by this LWN article.
When using atomic operations for things like thread-safe reference counts, as they are in Ogre's SharedPtr, we need two sets of guarantees. The obvious one is atomicity, i.e., changes from one thread must be seen "all at once" in other threads. Less obviously, atomic operations have to establish ordering guarantees for the reads and writes around them. Both the compiler and the actual CPU hardware can - and will - reorder instructions for performance reasons. One of the atomic operations' jobs is to insert memory barriers as-needed to prevent reorderings that could make the program run incorrectly.
C++11 (and C11, and Rust, and almost every other systems programming language) have developed a memory model to discuss these problems. The model provides the following orderings:
- "Sequentially consistent" operations ensure all threads in the program "see" a single, consistent order of reads and writes. They are the easiest to reason about, but can also be the slowest.
- "Relaxed" operations are on the other side of the spectrum - they provide atomicity, but no ordering.
- In between these two extremes are "acquire" and "release" operations. Storing a value in thread A with release semantics and loading it in thread B with acquire semantics causes thread B to "see" all writes performed by A before the store.
All of Ogre's atomic operations (provided they don't fall back to using a mutex) are sequentially consistent. No performance is sacrificed on x86 or x86_64, since they're architectures with fairly strong ordering. Both a relaxed addition and a sequentially consistent addition translate into a single instruction:
Code: Select all
lock inc DWORD PTR [rdi]
and, similarly, sequentially consistent subtraction and acquire-release subtractions turn into the same thing:
Code: Select all
mov eax,0xffffffff
lock xadd DWORD PTR [rdi],eax
Code: Select all
dmb ; Memory barrier
loop:
ldrex r3, [r0] ; Load the value
adds r3, #1 ; Add one
strex r2, r3, [r0] ; Store the value if no other thread tried to change it.
cmp r2, #0
bne.n loop ; Try again if another thread did change it.
dmb
Code: Select all
loop:
ldrex r3, [r0] ; Load the value
adds r3, #1 ; Add one
strex r2, r3, [r0] ; Store the value if no other thread tried to change it.
cmp r2, #0
bne.n loop ; Try again if another thread did change it.
(You'll also note that ARM can't atomically read, modify, and write a value, so it has to loop until no other thread is trying to modify the value. This means that CAS loops, like the ones used to implement AtomicScalar<T>::operator+=, can be implemented more efficiently with a "weak" CAS.)
In summary, the atomic operations in Ogre need some rework, especially if it's going to take its cross-platform support seriously. The best solution would be to use the atomic types built into C++11, as current compilers are quite good at handling the nuances shown above. Barring that option, the next best thing would likely be Boost's Atomic. I know there are some concerns over Boost's performance, but it's not one monolithic library, and Boost.Atomic compiles down to the exact same assembly seen above. If neither of those options are acceptable, inline assembly for all the orderings is probably needed, and doing so requires a thorough understanding the nuances of all the supported architectures.
As a final note: It's also worth noting that the actual scalars in AtomicScalar are marked 'volatile'. This is not helpful here, as explained by this LWN article.