*EDIT* The original version works fine. This change should not be made. On MSVC it actually made the code very slightly slower when compiling on the standard Ogre settings. Sorry for wasting peoples time... The differences were seen in our code ( and were improved by the recommended additions), but did not apply since we compile Ogre differently than the standard settings. With the standard settings, the compiler was better able to optimize the instruction order, so now I must go figure out why our version is messed up.
I've noticed this in the Quaternion operator* code, while doing a profiling run with AQTime.
Vector3 uv, uuv;
Vector3 qvec(x, y, z);
uv = qvec.crossProduct(v);
uuv = qvec.crossProduct(uv);
uv *= (2.0f * w);
uuv *= 2.0f;
return v + uv + uuv;
While in a particularly optimization happy mood, I rewrote it to be the following, and saw a reduced number of calls in the disassembly.
Vector3 qVec(x, y, z);
Vector3 uv( 2.0 * w * qVec.crossProduct(v));
Vector3 uuv( 2.0 * qVec.crossProduct(uv));
return v + uv + uuv;
In particular, this chunk of asm ( generated by MSVC 9.0 release build ) was removed, which is directly related to not declaring the vectors before initializing them.
Vector3 uv, uuv;
lea ecx,[uv]
call Vector3::Vector3 (611D00h)
lea ecx,[uuv]
call Vector3::Vector3 (611D00h)
I have not done a full timing work up on the differences, but just seeing the reduction in the disassembly makes me think it will be a 2%-5% improvement, at least on windows. Since this is used all over the place ( at least by me! ), it seems like it is worth investigating. The overall code change is minimal, but it might save some cycles. If someone has a good way of profiling this in Ogre, across multiple platforms, I would be very curious to see the results.
Quaternion operator* optimization
-
- Gnoblar
- Posts: 7
- Joined: Wed Jan 09, 2008 4:46 pm
Quaternion operator* optimization
Last edited by nitsuj33 on Mon Jan 10, 2011 5:45 am, edited 1 time in total.
-
- Gnoblar
- Posts: 7
- Joined: Wed Jan 09, 2008 4:46 pm
Re: Quaternion operator* optimization
After some more thinking about it, the math is not exactly the same, so I fixed it up a bit.
Here's the new version which should do the exact same calculations as the original, skipping the default constructions, plus an extra assignment.
Vector3 qVec(x, y, z);
Vector3 uv( qVec.crossProduct(v));
Vector3 uuv( 2.0f * qVec.crossProduct(uv));
uv *= 2.0f * w;
return v + uv + uuv;
Here's the new version which should do the exact same calculations as the original, skipping the default constructions, plus an extra assignment.
Vector3 qVec(x, y, z);
Vector3 uv( qVec.crossProduct(v));
Vector3 uuv( 2.0f * qVec.crossProduct(uv));
uv *= 2.0f * w;
return v + uv + uuv;
-
- Gnoblar
- Posts: 7
- Joined: Wed Jan 09, 2008 4:46 pm
Re: Quaternion operator* optimization
It seems I'm only talking with myself here, but there are more areas that are ripe for optimization in ColourValue ( at least as of 1.70, I've not got the latest for quite some time now ).
Experimentation has shown that stuff like this will actually use a divide instead of a friendly multiplication by its inverse.
void ColourValue::setAsABGR(const ABGR val)
...
// Red
r = ((val32 >> 24 ) & 0xFF) / 255.0f;
// Green
g = ((val32 >> 16 ) & 0xFF) / 255.0f;
// Blue
b = ((val32 >> 8 ) & 0xFF) / 255.0f;
// Alpha
a = (val32 & 0xFF) / 255.0f;
...
Simply rewriting it to be * ( 1.0f / 255.0f) forces it to an fmul, instead of the slower fdiv.
Once again, I am working only on msvc, so I can't speak for gcc, but MSVC will not optimize a requested divide to a multiply unless it can guarantee identical floating point results. / 2.0 seems to generally go to an fmul with 0.5, but 255 did not, for me at least.
Heheh, maybe its just Microsoft's compiler, but I assume it has much more to do with guarantees that are made to the programmer through the c++ standard.
Experimentation has shown that stuff like this will actually use a divide instead of a friendly multiplication by its inverse.
void ColourValue::setAsABGR(const ABGR val)
...
// Red
r = ((val32 >> 24 ) & 0xFF) / 255.0f;
// Green
g = ((val32 >> 16 ) & 0xFF) / 255.0f;
// Blue
b = ((val32 >> 8 ) & 0xFF) / 255.0f;
// Alpha
a = (val32 & 0xFF) / 255.0f;
...
Simply rewriting it to be * ( 1.0f / 255.0f) forces it to an fmul, instead of the slower fdiv.
Once again, I am working only on msvc, so I can't speak for gcc, but MSVC will not optimize a requested divide to a multiply unless it can guarantee identical floating point results. / 2.0 seems to generally go to an fmul with 0.5, but 255 did not, for me at least.
Heheh, maybe its just Microsoft's compiler, but I assume it has much more to do with guarantees that are made to the programmer through the c++ standard.
-
- OGRE Retired Team Member
- Posts: 2903
- Joined: Thu Jan 18, 2007 2:48 pm
- x 58
Re: Quaternion operator* optimization
All those are micro-optimisations, though. While I have nothing against the former (quaternion), I do find that a divide by 255.0 is slightly easier on the eyes. Well, it's not a big deal, but I'd be interested in actual performance differences.
-
- Gnoblar
- Posts: 7
- Joined: Wed Jan 09, 2008 4:46 pm
Re: Quaternion operator* optimization
Yeah. These won't show up as noticeable differences in framerate really, but I'm guessing that the routines themselves would show a significant difference, at least for the color ones. If http://www.phatcode.net/res/224/files/h ... 63-02.html still applies, we could have a quite significant difference in those routines. Last I saw about the fmul vs fdiv on a recent processor was about an 11 times longer latency for the fdiv.
You could always do something like const float kInverseMultiplier = 1.0f / 255.0f ; to make it a little more clear, and less ugly than * (1.0f / 255.0f ); A better name would be good too.
Also, most of the math division operators first calculate the inverse, so maybe it would add some symmetry there.
If I get time soon I will run some tests on these to see if there is any real speedup, but I figured someone out there would be better equipped to try it out cross platform style.
You could always do something like const float kInverseMultiplier = 1.0f / 255.0f ; to make it a little more clear, and less ugly than * (1.0f / 255.0f ); A better name would be good too.
Also, most of the math division operators first calculate the inverse, so maybe it would add some symmetry there.
If I get time soon I will run some tests on these to see if there is any real speedup, but I figured someone out there would be better equipped to try it out cross platform style.
-
- Gnoblar
- Posts: 7
- Joined: Wed Jan 09, 2008 4:46 pm
Re: Quaternion operator* optimization
Awesome. I was wrong on both accounts. The way we are compiling Ogre in our project actually seems to mess this up. So, the original methods work just fine. In fact, the quaternion version that does not default construct is slower for some odd reason. Oh well.
Sorry for the incorrect suggestion, and now I must figure out why our version is compiling so oddly... We are using double precision, maybe that has something to do with it...
In the 1.7.2 code, the color statements correctly compile to an fmul. Not sure about gcc, so it still could be a nice change there.
Sorry for the incorrect suggestion, and now I must figure out why our version is compiling so oddly... We are using double precision, maybe that has something to do with it...
In the 1.7.2 code, the color statements correctly compile to an fmul. Not sure about gcc, so it still could be a nice change there.