NaNs coming out of ArrayQuaternion::FromOrthoDet1RotationMatrix

claytongarrett · Post by **claytongarrett** » Tue Dec 05, 2017 8:46 am

I have a mesh with a skeleton, and a TagPoint attached to a bone in that skeleton. Periodically, when updateAllTransformsBoneToTag is called, I wind up with NaNs in the mDerivedOrientation of the TagPoint. I've tracked this down to this line in ArrayQuaternion::FromOrthoDet1RotationMatrix called from within ArrayMatrixAf4x3::decomposition:

Code: Select all

mChunkBase[1] = vmulq_f32( MathlibNEON::Sqrt( tmp ), MathlibNEON::HALF );

Specifically, very small numbers go into MathlibNEON::Sqrt, and become NaN when they come out of vrsqrteq_f32( f ) because of a divide by zero.

I noticed in one other place whe vrsqrteq_f32 is used, its followed it up with a line labeled "//Nuke this NaN" -- maybe something like that's missing from MathlibNEON::Sqrt?

Thanks!

Post by **dark_sylinc** » Tue Dec 05, 2017 7:48 pm

I see.

So, just to confirm, this bug does not happen when Ogre is compiled with -DOGRE_SIMD_NEON=0 ?

David (masterfalcon) is the one who wrote the NEON port. SSE2 supports _mm_sqrt_ps directly. It appears there is no NEON equivalent so David had to use x * vrsqrteq_f32 (which calculates x / sqrt( x )) and applies two Netwon-Raphson iterations to improve accuracy, getting it closer to a real sqrt.

On SSE2, I am the one who wrote InvSqrt4 and InvSqrtNonZero4 using newton raphson iterations, and the former indeed accounts for 0 inputs. It seems David grabbed some code from ARM on how to implement sqrt using vrsqrteq that didn't account for that.

Indeed, vrsqrteq_f32 does not work for x = 0. And indeed the solution is to detect and nuke the NaN. We're not the first ones to fall for that. It seems they use vtstq_u32( x, x ) which stands for x & x ? 0000 : 1111 as an equivalent for vceqq_f32(v, vdupq_n_f32(0.0f) ) which is basically x == 0 ? 0000 : 1111, claiming it's faster.

What I do not know is if it's safe to nuke the NaN at fStep0 or if it should be done at the end (I don't know if vrsqrtsq_f32( 0, 0 ) results in valid output, or if it goes NaN or returns garbage) Stopping the NaN as soon as possible should lead to better performance (since NaN tend to have the habit of slowing down calculations) but if it's not possible, the NaN would have to be nuked at the end.

I'm saying all this because I don't have the time to perform the experiments, but you should be able to; since you'd be literally adding one line of code:

Test this code:

Code: Select all

static inline ArrayReal Sqrt( ArrayReal f )
{
	//Netwon-Raphson, 2 iterations.
	ArrayReal fStep0 = vrsqrteq_f32( f );
	fStep0 = vreinterpretq_f32_u32( vandq_u32( vtstq_u32( f, f ), vreinterpretq_u32_f32( fStep0 ) ) ); //Nuke NaN when f == 0
	// step fStep0 = 1 / sqrt(x)
	const ArrayReal fStepParm0  = vmulq_f32( f, fStep0 );
	const ArrayReal fStepResult0= vrsqrtsq_f32( fStepParm0, fStep0 );
	// step fStep1 = 1 / sqrt(x)
	const ArrayReal fStep1      = vmulq_f32( fStep0, fStepResult0 );
	const ArrayReal fStepParm1  = vmulq_f32( f, fStep1 );
	const ArrayReal fStepResult1= vrsqrtsq_f32( fStepParm1, fStep1 );
	// take the res. fStep2 = 1 / sqrt(x)
	const ArrayReal fStep2      = vmulq_f32( fStep1, fStepResult1 );
	// mul by x to get sqrt, not rsqrt
	return vmulq_f32( f, fStep2 );
}

Otherwise test this code:

Code: Select all

static inline ArrayReal Sqrt( ArrayReal f )
{
	//Netwon-Raphson, 2 iterations.
	const ArrayReal fStep0 = vrsqrteq_f32( f );
	// step fStep0 = 1 / sqrt(x)
	const ArrayReal fStepParm0  = vmulq_f32( f, fStep0 );
	const ArrayReal fStepResult0= vrsqrtsq_f32( fStepParm0, fStep0 );
	// step fStep1 = 1 / sqrt(x)
	const ArrayReal fStep1      = vmulq_f32( fStep0, fStepResult0 );
	const ArrayReal fStepParm1  = vmulq_f32( f, fStep1 );
	const ArrayReal fStepResult1= vrsqrtsq_f32( fStepParm1, fStep1 );
	// take the res. fStep2 = 1 / sqrt(x)
	const ArrayReal fStep2      = vmulq_f32( fStep1, fStepResult1 );
	fStep2 = vreinterpretq_f32_u32( vandq_u32( vtstq_u32( f, f ), vreinterpretq_u32_f32( fStep2 ) ) ); //Nuke NaN when f == 0
	// mul by x to get sqrt, not rsqrt
	return vmulq_f32( f, fStep2 );
}

Test it by doing MathlibeNEON::Sqrt( 0 ) rather than waiting for the whole thing skeleton to generate a NaN.

My only worry however, is if this routine will still generate NaNs when x is very close to 0, but not exactly 0 (the routine would have to be modified so that when vcneqq_f32( fStep2, fStep2 ) returns false i.e. fStep2 is a NaN, and f >= 0, then we should nuke the NaN, otherwise leave the NaN, as f < 0 should generate NaNs).

Update: According to David, I am the one who wrote that routine. Gosh I remember the commit, but I don't remember that.

Post by **dark_sylinc** » Tue Dec 05, 2017 9:20 pm

David just tested the first snippet and he told me it works.
Please let me know if the first snippet fixes your problem! If so, I'll push the change to the repo.

Thanks

claytongarrett · Post by **claytongarrett** » Wed Dec 06, 2017 12:40 am

Awesome! Yes, that solved my issue. It's working as expected now. Thank you very much for the quick turnaround.

AtTheWall · Post by **AtTheWall** » Fri Jan 12, 2018 3:59 am

Hello again!

We've run into this issue again, this time with InvSqrt4 producing NaNs for small numbers. It specifically comes up when we have something calculating its world-space position (eg particle systems with the local space option turned off).

Ogre Forums

NaNs coming out of ArrayQuaternion::FromOrthoDet1RotationMatrix

NaNs coming out of ArrayQuaternion::FromOrthoDet1RotationMatrix

Re: NaNs coming out of ArrayQuaternion::FromOrthoDet1RotationMatrix

Re: NaNs coming out of ArrayQuaternion::FromOrthoDet1RotationMatrix

Re: NaNs coming out of ArrayQuaternion::FromOrthoDet1RotationMatrix

Re: NaNs coming out of ArrayQuaternion::FromOrthoDet1RotationMatrix