Pessimal atomics in Ogre

Discussion area about developing or extending OGRE, adding plugins for it or building applications on it. No newbie questions please, use the Help forum for that.
Post Reply
mrkline
Gnoblar
Posts: 2
Joined: Wed Nov 16, 2016 6:54 am

Pessimal atomics in Ogre

Post by mrkline »

Some of the concurrency primitives in Ogre, especially the atomic operations (i.e., AtomicScalar<T>), are implemented in some pessimal ways.

When using atomic operations for things like thread-safe reference counts, as they are in Ogre's SharedPtr, we need two sets of guarantees. The obvious one is atomicity, i.e., changes from one thread must be seen "all at once" in other threads. Less obviously, atomic operations have to establish ordering guarantees for the reads and writes around them. Both the compiler and the actual CPU hardware can - and will - reorder instructions for performance reasons. One of the atomic operations' jobs is to insert memory barriers as-needed to prevent reorderings that could make the program run incorrectly.

C++11 (and C11, and Rust, and almost every other systems programming language) have developed a memory model to discuss these problems. The model provides the following orderings:
  • "Sequentially consistent" operations ensure all threads in the program "see" a single, consistent order of reads and writes. They are the easiest to reason about, but can also be the slowest.
  • "Relaxed" operations are on the other side of the spectrum - they provide atomicity, but no ordering.
  • In between these two extremes are "acquire" and "release" operations. Storing a value in thread A with release semantics and loading it in thread B with acquire semantics causes thread B to "see" all writes performed by A before the store.
Getting back to why this matters for Ogre, let's discuss how these should be used for an atomic reference count. Adding to the count can be relaxed, since no actions need to be taken based on the resulting value. But when we decrement the count, we need acquire-release ordering. Without it, the reference count could hit zero and we could delete the object before preceding writes to that object occur! Herb Sutter explains it in more depth during his "<atomic> weapons" talk. (The entire talk is worth watching, but this particular example can be found around the 1:20:00 mark of Part 2.)

All of Ogre's atomic operations (provided they don't fall back to using a mutex) are sequentially consistent. No performance is sacrificed on x86 or x86_64, since they're architectures with fairly strong ordering. Both a relaxed addition and a sequentially consistent addition translate into a single instruction:

Code: Select all

lock inc DWORD PTR [rdi]

and, similarly, sequentially consistent subtraction and acquire-release subtractions turn into the same thing:

Code: Select all

mov  eax,0xffffffff
lock xadd DWORD PTR [rdi],eax
The story totally changes on other architectures. Take ARM, which you get on almost any Android or iOS system. ARM is weakly-ordered. Unlike x86, its reads and writes provide no implicit ordering gaurantees, so atomic operations need to start including memory barriers ("dmb" instructions on most ARM platforms). Here, a sequentially consistent addition looks like this:

Code: Select all

dmb                ; Memory barrier
loop:
ldrex r3, [r0]     ; Load the value
adds  r3, #1       ; Add one
strex r2, r3, [r0] ; Store the value if no other thread tried to change it.
cmp   r2, #0
bne.n loop         ; Try again if another thread did change it.
dmb
and a relaxed addition looks like this:

Code: Select all

loop:
ldrex r3, [r0]     ; Load the value
adds  r3, #1       ; Add one
strex r2, r3, [r0] ; Store the value if no other thread tried to change it.
cmp   r2, #0
bne.n loop         ; Try again if another thread did change it.
Note that we can skip the memory barriers for relaxed operations. This can be a huge win for performance, since a memory barrier tells the core to stop whatever it's doing until it's synchronized all of its memory operations with all the other cores and their caches. This can take hundreds of clock cycles. And, since Ogre (especially pre-2.0) uses lots of shared pointers, this could happen alarmingly often.

(You'll also note that ARM can't atomically read, modify, and write a value, so it has to loop until no other thread is trying to modify the value. This means that CAS loops, like the ones used to implement AtomicScalar<T>::operator+=, can be implemented more efficiently with a "weak" CAS.)

In summary, the atomic operations in Ogre need some rework, especially if it's going to take its cross-platform support seriously. The best solution would be to use the atomic types built into C++11, as current compilers are quite good at handling the nuances shown above. Barring that option, the next best thing would likely be Boost's Atomic. I know there are some concerns over Boost's performance, but it's not one monolithic library, and Boost.Atomic compiles down to the exact same assembly seen above. If neither of those options are acceptable, inline assembly for all the orderings is probably needed, and doing so requires a thorough understanding the nuances of all the supported architectures.

As a final note: It's also worth noting that the actual scalars in AtomicScalar are marked 'volatile'. This is not helpful here, as explained by this LWN article.
Last edited by mrkline on Wed Nov 16, 2016 4:34 pm, edited 1 time in total.
paroj
OGRE Team Member
OGRE Team Member
Posts: 1993
Joined: Sun Mar 30, 2014 2:51 pm
x 1073
Contact:

Re: Pessimal atomics in Ogre

Post by paroj »

I think it would be sufficient to change __ATOMIC_SEQ_CST to __ATOMIC_ACQ_REL for the __atomic_* functions to get the described behavior.
mrkline
Gnoblar
Posts: 2
Joined: Wed Nov 16, 2016 6:54 am

Re: Pessimal atomics in Ogre

Post by mrkline »

That sounds like a start in the right direction, but if I'm reading things right, doesn't that only apply to GCC builds?
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5292
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: Pessimal atomics in Ogre

Post by dark_sylinc »

Please note these AtomicScalars are only used when Ogre is compiled with OGRE_THREAD_SUPPORT. When it comes to post-2.0 Ogre code, this threading model will be fading away because it sucks (for all the reasons you're mentioning here, and many, many more).

I recently accepted a Pull Request that added support for C++11's std::thread, because it was an external contribution, and it is optional to use it. So it's not like we're closed to improvements, but we're just not actively working on them either.

It is true that when it comes to SharedPtr, atomics will be needed. Such thing arose when planning the texture refactor (TexturePtrs in two threads is recipe for disaster). My solution to that is to either have the Texture Refactor get rid of SharedPtr usage for textures (I don't like the usage of shared_ptrs for resources that already have explicit ownership patterns; there is no need) or workaround the problem via void* castings so the references won't increase/decrease but we can still pass TexturePtr from thread A to B, then back from B to A.
mrkline wrote:As a final note: It's also worth noting that the actual scalars in AtomicScalar are marked 'volatile'. This is not helpful here, as explained by this LWN article.
I haven't analyzed if the usage of volatile in AtomicScalar is either intentional or a failure to understand how volatile works; but please note that volatile is not always incorrect when dealing with concurrent code.
Just like you need hardware barriers to order memory accesses, you need compiler barriers to ensure order from the compiler's generated code. e.g. a compiler is free to compile this code:

Code: Select all

*somePtr = 0;
++mValue;
as;

Code: Select all

inc mValue
mov [somePtr] 0
Because it doesn't change the observable behavior of the program when ran from a single thread. The proper way to fix that would be like this:

Code: Select all

*somePtr = 0;
memoryBarrier();
compilerBarrier();
++mValue;
But compiler barriers are not always available, and thus volatile gets the job done.

There also cases may be where:

Code: Select all

mValue = 1;
//... do something
mValue = 1;
May be optimized to setting mValue = 1; only once.

When OS synchronization primitives are involved, volatile isn't needed as your article says, because the OS sync primitive will issue a memory barrier; and the compiler cannot assume what happens inside of the sync primitive won't affect the state of the variables, so it acts as a compiler barrier as well.
However when doing lockless programming, OS sync primitives are often avoided and one needs to be careful.
Post Reply