I see.
So, just to confirm, this bug does not happen when Ogre is compiled with -DOGRE_SIMD_NEON=0 ?
David (masterfalcon) is the one who wrote the NEON port. SSE2 supports _mm_sqrt_ps directly. It appears there is no NEON equivalent so David had to use x * vrsqrteq_f32 (which calculates x / sqrt( x )) and applies two Netwon-Raphson iterations to improve accuracy, getting it closer to a real sqrt.
On SSE2, I am the one who wrote InvSqrt4 and InvSqrtNonZero4 using newton raphson iterations, and the former indeed accounts for 0 inputs. It seems David grabbed some code from ARM on how to implement sqrt using vrsqrteq that didn't account for that.
Indeed, vrsqrteq_f32 does not work for x = 0. And indeed the solution is to detect and nuke the NaN.
We're not the first ones to fall for that. It seems they use vtstq_u32( x, x ) which stands for x & x ? 0000 : 1111 as an equivalent for vceqq_f32(v, vdupq_n_f32(0.0f) ) which is basically x == 0 ? 0000 : 1111, claiming it's faster.
What I do not know is if it's safe to nuke the NaN at fStep0 or if it should be done at the end (I don't know if vrsqrtsq_f32( 0, 0 ) results in valid output, or if it goes NaN or returns garbage) Stopping the NaN as soon as possible should lead to better performance (since NaN tend to have the habit of slowing down calculations) but if it's not possible, the NaN would have to be nuked at the end.
I'm saying all this because I don't have the time to perform the experiments, but you should be able to; since you'd be literally adding one line of code:
Test this code:
Code: Select all
static inline ArrayReal Sqrt( ArrayReal f )
{
//Netwon-Raphson, 2 iterations.
ArrayReal fStep0 = vrsqrteq_f32( f );
fStep0 = vreinterpretq_f32_u32( vandq_u32( vtstq_u32( f, f ), vreinterpretq_u32_f32( fStep0 ) ) ); //Nuke NaN when f == 0
// step fStep0 = 1 / sqrt(x)
const ArrayReal fStepParm0 = vmulq_f32( f, fStep0 );
const ArrayReal fStepResult0= vrsqrtsq_f32( fStepParm0, fStep0 );
// step fStep1 = 1 / sqrt(x)
const ArrayReal fStep1 = vmulq_f32( fStep0, fStepResult0 );
const ArrayReal fStepParm1 = vmulq_f32( f, fStep1 );
const ArrayReal fStepResult1= vrsqrtsq_f32( fStepParm1, fStep1 );
// take the res. fStep2 = 1 / sqrt(x)
const ArrayReal fStep2 = vmulq_f32( fStep1, fStepResult1 );
// mul by x to get sqrt, not rsqrt
return vmulq_f32( f, fStep2 );
}
Otherwise test this code:
Code: Select all
static inline ArrayReal Sqrt( ArrayReal f )
{
//Netwon-Raphson, 2 iterations.
const ArrayReal fStep0 = vrsqrteq_f32( f );
// step fStep0 = 1 / sqrt(x)
const ArrayReal fStepParm0 = vmulq_f32( f, fStep0 );
const ArrayReal fStepResult0= vrsqrtsq_f32( fStepParm0, fStep0 );
// step fStep1 = 1 / sqrt(x)
const ArrayReal fStep1 = vmulq_f32( fStep0, fStepResult0 );
const ArrayReal fStepParm1 = vmulq_f32( f, fStep1 );
const ArrayReal fStepResult1= vrsqrtsq_f32( fStepParm1, fStep1 );
// take the res. fStep2 = 1 / sqrt(x)
const ArrayReal fStep2 = vmulq_f32( fStep1, fStepResult1 );
fStep2 = vreinterpretq_f32_u32( vandq_u32( vtstq_u32( f, f ), vreinterpretq_u32_f32( fStep2 ) ) ); //Nuke NaN when f == 0
// mul by x to get sqrt, not rsqrt
return vmulq_f32( f, fStep2 );
}
Test it by doing MathlibeNEON::Sqrt( 0 ) rather than waiting for the whole thing skeleton to generate a NaN.
My only worry however, is if this routine will still generate NaNs when x is very close to 0, but not exactly 0 (the routine would have to be modified so that when vcneqq_f32( fStep2, fStep2 ) returns false i.e. fStep2 is a NaN, and f >= 0, then we should nuke the NaN, otherwise leave the NaN, as f < 0 should generate NaNs).
Update: According to David,
I am the one who wrote that routine. Gosh I remember the commit, but I don't remember
that.