Ogre Forums

Posted: **Fri Apr 23, 2010 6:27 am**

as i understand it, normalizing requires square root, which limit its usefulness in per-frame uses.

normalizing can also be done with trig function, which can use trig tables for speed.

ie (i doubt you need this example, but still) :
x: 13, y:7
angle : ATan ( y / x ) = ~28
new y : sin(28) = ~0.46
new x : cos(28) = ~0.88
or calculate vector length ( x/cos(a) or y /sin(a) )

understandingly its significantly faster when using trig tables, and can provide a per-frame tool when accuracy is not critical..

Posted: **Fri Apr 23, 2010 8:34 pm**

Using lookup tables on modern hardware is a bad idea. Memory is slow, very very slow.

By all means try it out with a profiler, but I expect calculating the square root on the cpu is going to be faster than a lookup table

Posted: **Sat Apr 24, 2010 7:06 am**

Pretty much every modern compiler is going to use the VPU (SSE/AVX on CPU) for all math operations, and "rsq" (reciprocal square root) is a common operation supported directly in SSE.

When accuracy is not critical, I'd take the few clocks of latency for a RSQRTSS instruction over the untold number of clocks to access a random memory location. In general, most re-normalizations in 3D graphics are not long-lived anyway. As always, the solution that is right for you is the one that works best in your particular situation -- have you profiled both approaches to see which one performs better for you?

Posted: **Sat Apr 24, 2010 7:21 am**

FWIW:

- The accuracy of RSQRTSS according to the IA instruction set reference matches what I've seen documented elsewhere (in the GCC bugs, for example), namely you get 12 bits of accuracy (the relative error is < 1.5 * 2 ^ -12))

As you can see, for the following code

Code: Select all

int _tmain(int argc, _TCHAR* argv[])
{
	float f[3];

	srand(12345);
	f[0] = (float)rand() / (float)RAND_MAX * 3;
	f[1] = (float)rand() / (float)RAND_MAX * 3;
	f[2] = (float)rand() / (float)RAND_MAX * 3;

	float r[3];
	r[0] = 1 / sqrt(f[0]);
	r[1] = 1 / sqrt(f[1]);
	r[2] = 1 / sqrt(f[2]);

	printf("r: (%f,%f,%f)\n", r[0], r[1], r[2]);

	return 0;
}

the Intel compiler produces the following abridged assembly

Code: Select all

;;; 
;;; 	float r[3];
;;; 	r[0] = 1 / sqrt(f[0]);

        movss     xmm0, DWORD PTR [128+esp]                     ;18.2
$LN13:

;;; 	r[1] = 1 / sqrt(f[1]);

        movss     xmm3, DWORD PTR [132+esp]                     ;19.2
$LN15:
        cvtsi2ss  xmm6, eax                                     ;15.2
        divss     xmm6, DWORD PTR [_2il0floatpacket.20]         ;15.2
        mulss     xmm6, DWORD PTR [_2il0floatpacket.21]         ;15.2
$LN17:
        rsqrtss   xmm1, xmm0                                    ;18.2
        mulss     xmm0, xmm1                                    ;18.2
        mulss     xmm0, xmm1                                    ;18.2
        subss     xmm0, DWORD PTR [_2il0floatpacket.21]         ;18.2
$LN19:

;;; 	r[2] = 1 / sqrt(f[2]);
;;; 
;;; 	printf("r: (%f,%f,%f)\n", r[0], r[1], r[2]);

        mov       DWORD PTR [esp], OFFSET FLAT: ??_C@_0P@A@r?3?5?$CI?$CFf?0?$CFf?0?$CFf?$CJ?6?$AA@ ;22.2
$LN21:
        mulss     xmm1, xmm0                                    ;18.2
        mulss     xmm1, DWORD PTR [_2il0floatpacket.22]         ;18.2
$LN23:
        cvtss2sd  xmm2, xmm1                                    ;22.2
        movsd     QWORD PTR [4+esp], xmm2                       ;22.2
$LN25:
        rsqrtss   xmm4, xmm3                                    ;19.2
        mulss     xmm3, xmm4                                    ;19.2
        mulss     xmm3, xmm4                                    ;19.2
        subss     xmm3, DWORD PTR [_2il0floatpacket.21]         ;19.2
$LN27:
        rsqrtss   xmm7, xmm6                                    ;20.2
        mulss     xmm6, xmm7                                    ;20.2

So in short, I wouldn't worry about it. You already said this is for cases where accuracy is less important, and for cases where it's very important (HPC for example) there's always Newton-Raphson (slower but more accurate).

Posted: **Sat Apr 24, 2010 8:31 am**

well, i did a small experiment with gravity that required a square root each frame.
when i had one object fps was fine.
adding even one more made a significant slowdown in fps when second object was visible on screen.
i dropped the gravity idea for a few reasons; hard to plot, somewhat unpredictable at times, and the orbit itself doesn't 'stay' (even if locked into endless orbit, the ellipse will rotate around the focal point..) - reality makes it hard

i decided to go on a "fake" gravity on elliptical course.
so square root isn't the devil they say it is ?

Posted: **Sat Apr 24, 2010 9:22 am**

If you are normalizing you can multiply by the inverse square root, so then there's the fast inverse square root function made famous by Quake 3 Arena.

http://en.wikipedia.org/wiki/Fast_inverse_square_root

Code: Select all

float InvSqrt(float x)
{
	union {
		float f;
		int i;
	} tmp;
	tmp.f = x;
	tmp.i = 0x5f3759df - (tmp.i >> 1);
	float y = tmp.f;
	return y * (1.5f - 0.5f * x * y * y);
}

It's a sped up version of the Newton-Raphson method mentioned above that makes use of some weird float-to-integer demonology. I mention it for historical interest, since this page, http://assemblyrequired.crashworks.org/ ... uare-root/, shows that this is still significantly slower and less accurate than RSQRTSS... though it still does impressively well compared to other methods, if you can live with the approximation.

Posted: **Sat Apr 24, 2010 1:48 pm**

just as a suffix, there is a very weird thing going on -

the slowdown i talked about? well its only occurs when i have exactly 3 objects on screen. i a few just to see if there is farther slowdown, but the reverse happened! 4,5,6,7,8.... it just stabilizes. cutting down back to 3, the slow down returns!
odd...

Posted: **Sat Apr 24, 2010 6:51 pm**

I'd guess it isn't the square root then. Use a profiler to figure out where your application is wasting time. Just guessing won't get you anywhere.

Posted: **Fri May 07, 2010 11:09 pm**

JaJDoo wrote:the slowdown i talked about? well its only occurs when i have exactly 3 objects on screen.

The number 3 was removed from the OgreCore as an optimization - the team found that having less than 10 digits can increase performance dramatically. Unfortunately, any reference to the number 3 will cause serious instability and performance problems in the core ogre library, so try to avoid it if at all possible.

Posted: **Sat May 08, 2010 3:48 am**

Jabberwocky wrote: The number 3

5, sir.

Posted: **Sat May 08, 2010 4:59 am**

I know this sounds crazy, but maybe we could get rid of all numbers except for 0 and 1.
... nah, that'd never work.

Posted: **Sat May 08, 2010 5:17 am**

That really made my day! =)

Ogre Forums

normalizing vector without sqaure root?

normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?

Re: normalizing vector without sqaure root?