Experiment (Lookup vs Brute Force Calculate)

stealth977 · Post by **stealth977** » Sat Feb 06, 2010 12:20 pm

Since last couple of months after my research on fast lookup arrays/hashmaps i wanted to make an experiment about the interesting informations i learned from my research. Today i made a very basic/small experiment and i wanted to share it with you guys.

Usually we use lookup tables (result of multi-calculations) so that we can access them faster by not doing the calculation again, right? You may be surprised to learn that this approach may actually make your functions slower....

Code: Select all

#include <stdlib.h>
#include <iostream>

unsigned int looktable[65536];

using namespace std;

void lookup(int count)
{
    unsigned int rndloc;
    for(int i = 0;i < count;i++)
    {
        rndloc = rand() & 0xFFFF;
        rndloc = looktable[rndloc] + 2;
    }
}

void calculate(int count)
{
    unsigned int rndloc;
    for(int i = 0;i < count;i++)
    {
        rndloc = rand() & 0x7FFF;
        rndloc = (rndloc * rndloc) + (rndloc * 5)  + 2;
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    int key;
    cout << "Press a key to run LOOKUP Test...";
    cin >> key;
    lookup(1000000000);
    cout << "Press a key to run CALCULATE Test...";
    cin >> key;
    calculate(1000000000);
    cout << "Done...";
    cin >> key;
    
    return 0;
}

Here there are two functions,

- one is using a random location in a lookup table and adds 2 to it, so its just a lookup + add
- the other calculates the value instead of looking it up and this calculation is 2 multiplications + an addition, quite expensive compared to lookup, right? WRONG!!

I didnt get deeper into the subject, but soon i will, still the result show that the BRUTE FORCE CALCULATION is %10 faster on my machine and it will continue to get faster directly proportional to the CPU Power, where lookup performance will remain same.

Also keep in mind that both functions use a very costly rand() function. When the cost of rand() is stripped, brute force calculation's performance will probably be magnitudes better than lookups, or you can get same performance but do quite more complicated calculations...

So, why?

Very easy answer, the CPUs got so much powerfull, now they can run multiple instructions per clock, instruction clock counts decreased etc. But, accessing memory is still as slow as lets say 5 years ago, even worse, engineers made it so that memory access is faster sequentially compared to previous yers, but also, a cache miss (caused by random access) is much more expensive than before.

So, using lookup tables, you access the memory randomly (a random index usually, thats why we use lookup tables) and the cost of cache misses is equal to hundreds of instructions. So if your calculation can be done in lets say 100 instructions, using the brute force calculation method may yield better results than your lookup table...

Just my 2 cents...

Post by **tuan kuranes** » Sat Feb 06, 2010 1:25 pm

Indeed. That why you might want to take care to avoid any cache miss in any array use/size/access.

You might want to read this with great interest :

Analysis of Processor Cache Effects:
http://igoro.com/archive/gallery-of-pro ... e-effects/

Advanced analysis of cache 'slowness' impact on code and how to organise data in array efficiently, and algorithms accordingly
Pitfalls of Object-Oriented Design
http://research.scee.net/files/presenta ... CAP_09.pdf

_tommo_ · Post by **_tommo_** » Sat Feb 06, 2010 4:48 pm

This is the general trend of the hardware, memory is becoming the processing bottleneck on nearly all architectures...

for example on CUDA, where the hardware is much more streamlined and compute-oriented, a lookup table is always many orders of magnitude slower than anything you can actually compute, with one humble float-read taking as much as 4800 FLOPS in the worst case (uncoherent read vs coherent fmad).

And yes, this will become even more true in the future...

Zeal · Post by **Zeal** » Sun Feb 07, 2010 12:51 am

Interesting, but what about the 'latency hiding' factor? Correct me if I am wrong, but modern gpus have the ability to hide 'lookup latency' by suspending calculation, which frees the core for some other task until the requested value is returned from memory.

Doesnt that factor into all of this?

_tommo_ · Post by **_tommo_** » Sun Feb 07, 2010 2:40 am

Yah this factors much, the situation i pointed out was a "worst case vs optimal case", that is, a completely scattered and chained texture read, versus a code made of FMADs that are lucky enough to get executed in batch of 4 for clock cycle.
Obviously this is always not true, but you get the idea... on modern architecture more and more often the performance bottleneck is the memory throughput.

Zeal · Post by **Zeal** » Sun Feb 07, 2010 2:59 am

on modern architecture more and more often the performance bottleneck is the memory throughput.

So the real question is - How quickly can we expect to see future hardware compensate for this bottleneck? The new 'KraZy Fat' memory bandwidth stats on these new video cards is already pretty scary...

*of course I am talking video cards here, not system memory. But even that is evolving pretty fast no? It seems like its going to have to if it plans to keep up with larabee (that thing is still coming right?).

mkultra333 · Post by **mkultra333** » Sun Feb 07, 2010 4:27 am

Interesting.

I have a project that's quite old but I use every day, and it uses a couple of lookup tables a lot. It's also quite slow. The basic structure was written long before I was even aware of caches and such.

The program is for generating texture maps, and the main lookup table it uses is defined as

Code: Select all

int IMAGECOORD[2][BIGIMX2] ;

// wrap around for smoothing coords, to avoid checks for -1 and BIGIMX
	for(nLoop=0 ; nLoop<BIGIMX2 ; nLoop++)
	{
		IMAGECOORD[1][nLoop]=nLoop%BIGIMX ;
		IMAGECOORD[0][nLoop]=IMAGECOORD[1][nLoop] ;
	}

BIGIMX is 1024 and BIGIMX2 is twice that size, 2048. I'm quessing that makes this lookup table about 16k in size. When the coords get processed, they can end up negative or higher than 1024, although not by more than twice the size of the texture. This was a pain to constantly check for in code since it only was a problem at the boundaries.

So throughout the code, instead of doing something like ProcessPixel(PosX, PosY) I'll do something like Process Pixel( IMAGECOORD[1][PosX], IMAGECOORD[1][PosY]). A lot of this code tends to be focussed around a particular area of a texture, like scanning through it, but then often with perturbations that might throw the coord around a bit.

I was curious to see if replacing all these lookups with a function call would speed things up, so I tried a test. Unfortunately it made virtually no difference. In fact the old code came out slightly faster, but not by any significant amount.

Wondering why the old code with lookups was just as good as the new code using functions, perhaps it's because the lookup table in question is moderately small. Perhaps so much of the table is getting shifted to the cache, and so little other info is needed in the cache at the same time, that calling values from it doesn't cause any slow down.

So maybe small lookup tables are still faster than computing the values, if the entire thing is fitting in the cache and not crowding out other stuff?

Post by **xavier** » Sun Feb 07, 2010 5:28 am

Zeal wrote: larabee (that thing is still coming right?).

Yes. Honestly cannot say more than that; Intel has gone the opposite direction on publicity regarding their Larrabee plans (relative to the early hype they let build up, only to fall short). That said, Larrabee is more than just alive and kicking at Intel.

Post by **xavier** » Sun Feb 07, 2010 5:32 am

mkultra333 wrote: So maybe small lookup tables are still faster than computing the values, if the entire thing is fitting in the cache and not crowding out other stuff?

It depends. Bandwidth and latency are two very different animals, and texture sampling is by far the slowest thing on the GPU (and the main reason they had to implement what amounts to fibering to hide the latency). It still takes tons of time to load the texture, sample it and give the shader the results. Of course, if you aren't thrashing the texture sampler a lot with the scene (CPU scene sorting to minimize texture swaps and making sure that you aren't moving the texture around between units a lot during a draw call) then the bulk of the latency is amortized much better.

As always, YMMV and it's a case-by-case optimization.

mkultra333 · Post by **mkultra333** » Sun Feb 07, 2010 6:28 am

I should probably have pointed out that my code above is all strictly CPU stuff, nothing gets done on the graphics card.

Early on when I was learning shaders from the Nvidia CG tutorials they talked about using a normalization cubemap and how it could speed things up. I implemented that, but it looked pretty rough. Later I learnt that not only were the results inferior to calcuating it on the gpu, but also vastly slower because texture lookups are relatively slow compared to just calling the normalize function.

mkultra333 · Post by **mkultra333** » Sun Feb 07, 2010 11:14 am

On a side note,

(CPU scene sorting to minimize texture swaps and making sure that you aren't moving the texture around between units a lot during a draw call)

I wasn't aware of that one, I'll have to check my shaders.

jjp · Post by **jjp** » Sun Feb 07, 2010 11:40 am

mkultra333 wrote:Early on when I was learning shaders from the Nvidia CG tutorials they talked about using a normalization cubemap and how it could speed things up.

Well, the Nvidia CG tutorial is from the time when NV30 was their top GPU. On that architecture texture lookup often is the better option

On topic: unless your compiler does no optimization at all you trade the lookup for 1 MUL and 1 ADD (rndloc * (rndloc + 5)). I thought the nature of the memory hierarchy should be common knowledge by now?

stealth977 · Post by **stealth977** » Sun Feb 07, 2010 2:12 pm

mkultra:

Your code is an exception to my experiment. As i stated in the post, SEQUENTIAL MEMORY ACCESS is an EXCEPTION to the case. The memory architects usually design in FAVOR OF SEQUENTIAL ACCESS, so its still pretty fast using lookups taking advantage of it. My case scenario uses RANDOM ACCESS LOOKUPS, the lookups which you use to turn any index into a value, not sequential indexes. In case of sequential indexes, since the cache already holds the next up to 16 indexes (assuming each hold 32bit data) only 1st index consumes CPU time but not the other 15 indexes (they are almost free), but in the case of random access, the worst case is all lookups consume equal CPU time...

stealth977 · Post by **stealth977** » Sun Feb 07, 2010 2:22 pm

jjp wrote: On topic: unless your compiler does no optimization at all you trade the lookup for 1 MUL and 1 ADD (rndloc * (rndloc + 5)). I thought the nature of the memory hierarchy should be common knowledge by now?

Yup, its common knowledge, we all know that random memory access has quite alot of pitfalls BUT, since we dont know HOW MUCH, most of us usually overlook at the issue, even most of the time we try to optimize some code, we just try to make it use minimum instructions, moving some common code to lookups etc, which in turn actually slows down our code.

The case gets worse if you are an old school programmer like me (or even older maybe, i am coding only since 1984). We used very old architectures, architectures on which optimization meant totally opposite of what it is now, so, what we learnt about optimization THEN is now a huge barrier infront of us NOW.

Even worse, the ready to use common libraries like std mostly uses the old methods of optimization, did you know, or even, did you try to replace HashTables and Vectors std uses with next-generation counterparts? Did you know that the next-gens are sometimes magnitudes faster? I am using this example, because, those are at least %20-%50 of our code nowadays, just think of OGRE at least, I dont know how much impact that part of code would have on OGRE, but still, it sure would improve the efficiency of those codes a lot...

Anyway, the conclusion is, that yes we know its important, we just dont know how important it is and how much more important it is becoming each day...

jjp · Post by **jjp** » Sun Feb 07, 2010 2:47 pm

stealth977 wrote:Even worse, the ready to use common libraries like std mostly uses the old methods of optimization, did you know, or even, did you try to replace HashTables and Vectors std uses with next-generation counterparts?

Well, you are greatly exaggerating I think

Using a hash table to save two lousy arithmetic instructions per operation is a rather academical example. Do something more expensive and the picture changes. Also, I fail to see where the STL is designed in a way that works against modern computer architectures or what "next generation counterparts" to hash tables and vectors would be. Both data structures are perfectly usable for efficient programming.

_tommo_ · Post by **_tommo_** » Sun Feb 07, 2010 3:09 pm

Well hash tables do have even more importance in "next generation" architectures... and it's obvious why:

with an hashtable you do maths to do less memory accesses (potentially O(1) ) while if you remove the hashtable you could need to make as much as O(n) memory accesses, that, as stated before, is VERY much slower.
So i do not think that the general purpose algorithms like sorts and maps will be replaced soon

Maybe we will just get new primitives like scan...

stealth977 · Post by **stealth977** » Sun Feb 07, 2010 3:19 pm

Umm, i guess i have been misunderstood about the hash tables and vectors and maps...

What i meant is, the next-gen algorithms for them, take advantage of the current architectures and get a huge speed boost. Its already been proved that new algorithms outperform std's implementations and the difference is anywhere between 2x to 10+x...

Should we use them? Maybe not... But we need to keep an eye on them, you never know when you will need them (and we are on a forum where performance is mostly a requirement, not an option)...

NOTE: For example there exists a TREE algorithm (forms the base of hashmaps and maps) which uses a hybrid approach to retrieve indexes to the leaves, not only using lookups for the leave indexes, but also most of the time CALCULATES the index instead of a FAR RANDOM LOOKUP, achieving a huge speed boost over common tree algorithms, resulting in much faster hashmaps/maps

jjp · Post by **jjp** » Sun Feb 07, 2010 5:01 pm

stealth977 wrote:(forms the base of hashmaps and maps)

Hash maps are not based on trees. std::map is based on a tree structure and it is not a hash map. boost::unordered_map is a hash map, for example.

What you describe (calculating an index when you want to retrieve something) is broadly speaking the definition of a hash map.

stealth977 · Post by **stealth977** » Sun Feb 07, 2010 5:28 pm

jjp wrote:Hash maps are not based on trees. What you describe (calculating an index when you want to retrieve something) is broadly speaking the definition of hashing, and of hash maps.

???Are you serious??? ALL MAPS DEPEND ON TREES.period.

HashMaps are MAPS where index is turned into an integer KEY, then this KEY is searched inside the BINARY(*) TREE to reach the VALUE.

Its my mistake i used the term INDEX in my previous explanation. I actually meant KEY. So, what i mean is, there are some nextgen algorithms where while searching for the KEY, the tree is traversed to its leaves using an hybrid approach, not only pointers to leaves are used (old method) but also pointers are ordered/cached to take advantage of cache coherency , plus (the most important part) instead of traversing to leaves, some middle leaves are skipped by substituting lookup with calculations instead.

* Different types of TREEs can be substituted. In my previous post i wanted to mention that there exists a much faster tree implementation.

stealth977 · Post by **stealth977** » Sun Feb 07, 2010 5:40 pm

Anyway, the idea behind this thread is not to discuss different methods or specific algorithms. Th idea is simple:

1 - We all know that memory access affects your applications' performance even more than before.
2 - We mostly skip this knowledge, or give it the lowest priority, or try to comply with it in a broader perspective
3 - This is because we actually do not know how important it is becoming
4 - We compare our applications with applications of others, since its a general flaw, we think we are doing it right

BUT;
At one point, we see that someone's application works much efficiently than the rest and our jaw drops and start to think what advanced methods they are using, where the answer actually lies just in how they utilize the current architecture...

I guess my most important gain from opening this thread is the answer i got from tuan, thank you so much, in my opinion, the article you pointed to i a must read for any developer...

Post by **CABAListic** » Sun Feb 07, 2010 5:43 pm

stealth977 wrote:
jjp wrote:Hash maps are not based on trees. What you describe (calculating an index when you want to retrieve something) is broadly speaking the definition of hashing, and of hash maps.
???Are you serious??? ALL MAPS DEPEND ON TREES.period.

HashMaps are MAPS where index is turned into an integer KEY, then this KEY is searched inside the BINARY(*) TREE to reach the VALUE.

Hash maps do not use binary trees, that would completely defeat their purpose.

A map is an abstract concept of mapping keys to values. They can be implemented in very different ways, trees are just one way. You can also use skip-lists or hash tables, the latter gives you the hash map.

stealth977 · Post by **stealth977** » Sun Feb 07, 2010 5:55 pm

CABAListic:

you are actually right, maps and hashmaps can be implemented without using TREEs, but those are mostly suitable for either small sized maps, or where you actually trade speed/storage space.

The reason is, in an optimal case, where a hashmap hashes to a single bucket, you get the best performance and worst memory usage, so especially for the general purpose hashmaps where the coder doesnt know the actual number of items to be stored, the hash is arranged so that the more than one KEY will map to same bucket and that bucket becomes a linked list of key/value pairs.

But most of the time, instead of this approach BALANCED TREES are used in hashmaps since they guarantee O(log n) rather than O(n).
(check http://en.wikipedia.org/wiki/Hash_table for different methods used)

Post by **CABAListic** » Sun Feb 07, 2010 6:17 pm

I'm not sure what you are referring to. Do you mean the section about separate chaining? Sure you can use a tree there, but as the article hints, given a good hash table size and a fitting hash function, this is very likely overkill.

jjp · Post by **jjp** » Mon Feb 08, 2010 9:15 am

stealth977 wrote:
jjp wrote:Hash maps are not based on trees. What you describe (calculating an index when you want to retrieve something) is broadly speaking the definition of hashing, and of hash maps.
???Are you serious??? ALL MAPS DEPEND ON TREES.period.

Uhm, no. Period. As already said: trees can be used to implement maps, allowing retrieval with log(n) memory accesses. A hash map needs only O(1) memory accesses on average. There are of course a lot of other factors. Depending on the circumstances a tree-based map might be the better choice. The concept of a map is in no way tied to trees. Maybe you should get yourself a refresh on the theoretical side of data structures

? I hate quoting Wikipedia, but here it is: http://en.wikipedia.org/wiki/Associative_array

Post by **jacmoe** » Mon Feb 08, 2010 9:25 am

Yes, but is hash_map not the only map not using a tree?

Ogre Forums

Experiment (Lookup vs Brute Force Calculate)

Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)

Re: Experiment (Lookup vs Brute Force Calculate)