I just found out about this thread and been reading the posts.
As you all may know, the situation has changed dramatically.
Ogre 2.0 uses now our own SIMD library and a SoA arrangement. We can't reuse other libs because our SoA memory arrangement is, to my knowledge, different from any other approach seen in the industry (XXXXYYYYZZZZ instead of XYZ_XYZ_XYZ_ or three streams of X, Y & Z)
I too have been wondering if it would be worth to SIMDify Vector3 & Quaternion; and the more I analyze it, the more it looks like a waste of time:
What used to be our major hotspots were fixed (using our ArrayMath library). The rest of usage of Vector3 Quaternion & Matrix4 is too scattered to be of any improvement.
Furthermore SIMD math libraries need an usage pattern/philosophy. Often it's faster to do myVector += SimdVector3::UNIT_Y * 3.0f; than doing myVector.y += 3.0f. But Ogre wasn't designed with this in mind, and has direct variable access all over its code base.
For example, the following bit of code is excruciatingly slow (I'm assuming SSE2):
Code: Select all
myVector = a + b;
myVector.y += 3.0f;
myVector = myVector * c;
What's the problem? xyzw (w is unused) is stored in an xmm register using movaps. Same with 'a' & 'b'. This can translate to a few movaps, then an addps instruction.
When myVector.y is accessed, we need to move the Y component. If we're extremely smart (and some help from the compiler), this may get translated to a shufps instruction (but will add register pressure). Else, the compiler will translate the code to:
Code: Select all
movaps [tmpMemory], xmm0
movss xmm0, [tmpMemory+4]
Accessing memory that you just saved is known as "
store to load forwarding". Basically, the CPU knows this memory transfer hasn't happened yet (let's remember CPUs are pipelined) so it looks in it's pipe and takes the value from there; rather than getting it from the cache or main RAM. It's a huge performance optimization... that works when you save and read the same amount of memory.
In this case, we're storing with a 128-bit memory move and reading with a 32-bit memory load. Like Fabian Giesen said in his blog, latest Intel architectures are able to gracefully handle this situation. So let's assume there is no performance hit (even though this is barely true).
However then we do the addition. And right after that, we perform a multiplication in simd form again (myVector * c); thus the assembly will look like this:
Code: Select all
addss xmm0, 3 //.y += 3
movss [tmpMemory+4], xmm0
movaps xmm0, [tmpMemory]
mulps xmm0, xmm2 //assuming xmm2 contains c.
There's no chance store to load forwarding will work in this scenario. We're reading a 128-bit value from storing 32-bit values. The pipeline will stall. Any performance benefit you hoped to gain from using SIMD went down the drain.
This is the reason you'll see some SIMD math libraries put their __m128 variables as protected; rather than public access. So whenever you want to access x, y or z; you have to call getScalar().x() or similar. And if you think that looks ugly, it
is ugly.
So, when refactoring Vector3/Quaternion to use SIMD, one has to take stuff like this in mind (we need to refactor it's usage too all over Ogre code). And IMHO is not worth it.
Still it may be nice to have another simd implementation so that it can start deprecating the old one. And for the Vector3, I still made a couple modifications so that it uses maxss & minss instructions whenever possible.
There's one big exception I think that may be worth the trouble: Matrix4.
The RenderQueue (or the AutoParams class) performs matrix concatenations too often to send to the vertex & pixel shader (world-view matrix, world-view-proj matrix, etc). The more Entities you have, the more concatenations. Do you use shadows? then even more concatenations.
Matrices rarely have their individual components accessed directly (and when it is, it's often unavoidable) and are prime candidate for SIMDification (I made up that word).
Futhermore matrix4 concatenations are almost never against itself (i.e. mat = mat * mat) which is the reason I put RESTRICT_ALIAS in ArrayMatrix4's concatenation code (plus an assert to check this never happens in debug mode).
If you've ever written matrix concatenation code in assembly, then you'll know there's a lot to gain from knowing that the pointers in memory aren't related (otherwise you're forced to copy the entire matrix into a temporary memory region).
I think making Matrix4 SIMD & RESTICT_ALIAS is worth the shot. Furthermore we can then investigate into using 4 movntps instructions to move Matrices, which is very fast (I already do that to move the Matrices from Instancing implementations to GPU buffers and noticed a few milliseconds less; though I may need to revisit this later as D3D9 does not guarantee the memory is 16-byte aligned, but D3D11 does, and GL 2.1 does too if
GL_ARB_map_buffer_alignment is present).
While I was talking from an x86/x64 perspective, it is still very important for other platforms (i.e. ARM, PPC) since they usually don't even have store to load forwarding, they just stall.