When using atomic operations for things like thread-safe reference counts, as they are in Ogre's SharedPtr, we need two sets of guarantees. The obvious one is atomicity, i.e., changes from one thread must be seen "all at once" in other threads. Less obviously, atomic operations have to establish ordering guarantees for the reads and writes around them. Both the compiler and the actual CPU hardware can - and will - reorder instructions for performance reasons. One of the atomic operations' jobs is to insert memory barriers as-needed to prevent reorderings that could make the program run incorrectly.
C++11 (and C11, and Rust, and almost every other systems programming language) have developed a memory model to discuss these problems. The model provides the following orderings:
- "Sequentially consistent" operations ensure all threads in the program "see" a single, consistent order of reads and writes. They are the easiest to reason about, but can also be the slowest.
- "Relaxed" operations are on the other side of the spectrum - they provide atomicity, but no ordering.
- In between these two extremes are "acquire" and "release" operations. Storing a value in thread A with release semantics and loading it in thread B with acquire semantics causes thread B to "see" all writes performed by A before the store.
All of Ogre's atomic operations (provided they don't fall back to using a mutex) are sequentially consistent. No performance is sacrificed on x86 or x86_64, since they're architectures with fairly strong ordering. Both a relaxed addition and a sequentially consistent addition translate into a single instruction:
Code: Select all
lock inc DWORD PTR [rdi]
and, similarly, sequentially consistent subtraction and acquire-release subtractions turn into the same thing:
Code: Select all
mov eax,0xffffffff
lock xadd DWORD PTR [rdi],eax
Code: Select all
dmb ; Memory barrier
loop:
ldrex r3, [r0] ; Load the value
adds r3, #1 ; Add one
strex r2, r3, [r0] ; Store the value if no other thread tried to change it.
cmp r2, #0
bne.n loop ; Try again if another thread did change it.
dmb
Code: Select all
loop:
ldrex r3, [r0] ; Load the value
adds r3, #1 ; Add one
strex r2, r3, [r0] ; Store the value if no other thread tried to change it.
cmp r2, #0
bne.n loop ; Try again if another thread did change it.
(You'll also note that ARM can't atomically read, modify, and write a value, so it has to loop until no other thread is trying to modify the value. This means that CAS loops, like the ones used to implement AtomicScalar<T>::operator+=, can be implemented more efficiently with a "weak" CAS.)
In summary, the atomic operations in Ogre need some rework, especially if it's going to take its cross-platform support seriously. The best solution would be to use the atomic types built into C++11, as current compilers are quite good at handling the nuances shown above. Barring that option, the next best thing would likely be Boost's Atomic. I know there are some concerns over Boost's performance, but it's not one monolithic library, and Boost.Atomic compiles down to the exact same assembly seen above. If neither of those options are acceptable, inline assembly for all the orderings is probably needed, and doing so requires a thorough understanding the nuances of all the supported architectures.
As a final note: It's also worth noting that the actual scalars in AtomicScalar are marked 'volatile'. This is not helpful here, as explained by this LWN article.