Failed cache friendliness.

Get answers to all your basic programming questions. No Ogre questions, please!
User avatar
mkultra333
Gold Sponsor
Gold Sponsor
Posts: 1894
Joined: Sun Mar 08, 2009 5:25 am
x 116

Failed cache friendliness.

Post by mkultra333 »

Not a question, just an observation.

My coding style is procedural rather than object oriented, I basically just a have a few huge "god" objects handling the overall program structure. All my gamestate stuff, like monsters, doors, items, FX, etc, are held in huge arrays of typedef structs, which I loop through to do various updates. So there's the raw data, and functions that operate on that data.

My structs just get data added as I see the need. I mostly don't worry about padding issues or the like, or even the size of the structs. Some of them, such as for the monsters, get really big. As an example, here's a struct used for Holograms, which are more or less like billboards.

Code: Select all

typedef struct
{
	int Type ;
	int OrientationType ;
	Ogre::Quaternion Orientation ;
	int LastUpdateFrame ;
	double BirthTime ;
	double DeathTime ;
	float Red ;
	float Green ;
	float Blue ;
	Ogre::Vector3 Corner[4] ;
	Ogre::Vector3 Centre ;
	Ogre::Vector3 Origin ;
	float U0 ;
	float V0 ;
	float U1 ;
	float V1 ;
	int Zone ; // what zone the hologram is in
}
HOLOGRAM ;
The loop that updates the hologram goes through them all, and skips them if they are type HOLOGRAM_TYPE_INACTIVE, in the wrong zone, or expired (past their DeathTime).

I wasn't having any speed or performance problems with any of this, in fact is seems surprisingly zippy, but I wondered if I could improve it none-the-less by trying to be more cache friendly... despite the fact that I don't really know much about programming for cache friendliness. The L1 data cache on the computer in question was 32KB, with cache lines of 64 bytes.

I did the following. First, I split the Type, Zone and DeathTime data into three seperate new arrays. Then the remaining data, I modified and optimized until the remaining struct was under 64 bytes in size (tested with sizeof). My reasoning was as follows:

1. By putting the loop-tested data in seperate arrays, more of the tests would be loaded at once. For instance, by making Type a byte and putting it in it's own array, 64 Type values would be loaded into the cache at a time, allowing the loop to quickly skim though 64 sequential HOLOGRAM_TYPE_INACTIVE holograms without any cache misses. Similar for Zone and DeathTime (I changed their types to shorts).

2. By making the entire remaining struct under 64 bytes, whenever there was a hologram that wasn't INACTIVE that needed to be actively updated, the entire hologram would be loaded into a single cache line. All the data would be ready to go without needed to look beyond the cache. I made sure that the very first info looked at was the first element in the struct, so if that wasn't in the cache already the entire hologram would be loaded.


I then ran some timing tests. But what I found was that the performance had either had no change at all, or was a little worse. Considering it a failed experiment, I just went back to the simple, untidy, single struct I'd been using to begin with.

I wonder why it didn't work. My guess is that the compiler optimizer already worked out the best way to set up the loop and my restructuring interfered with the optimizations. Also, cutting down the stuct size to under 64 bytes meant several extra calculations had to be made when an active holograms was found, to reconstruct missing data.
"In theory there is no difference between practice and theory. In practice, there is." - Psychology Textbook.
bstone
OGRE Expert User
OGRE Expert User
Posts: 1920
Joined: Sun Feb 19, 2012 9:24 pm
Location: Russia
x 201

Re: Failed cache friendliness.

Post by bstone »

What about data alignment? Try padding your structure to have 64 bytes exactly and then align the array on a 64-bit word boundary. That will ensure your data fits exactly one cache line. Otherwise you might be spanning two lines per structure.

And I would recommend profiling (e.g. comparing cache misses) instead of guessing.
User avatar
Klaim
Old One
Posts: 2565
Joined: Sun Sep 11, 2005 1:04 am
Location: Paris, France
x 56

Re: Failed cache friendliness.

Post by Klaim »

Isn't cache friendliness dependent on the access pattern too? I mean, if you use the whole structure's data each time you process it, then adding an indirection will not help. It's when you're processing only part of it in different processes that it should potentially help.
User avatar
mkultra333
Gold Sponsor
Gold Sponsor
Posts: 1894
Joined: Sun Mar 08, 2009 5:25 am
x 116

Re: Failed cache friendliness.

Post by mkultra333 »

And I would recommend profiling (e.g. comparing cache misses) instead of guessing.
I haven't used a profiler before. I looked them up for VS2008 Express, weren't too many free ones but eventually I gave Very Sleepy a shot. But I couldn't extract cache misses from it, and it didn't really give clear info about this one loop.

So I decided to just measure cpu clocks instead, with some code I found. It made me realize that the first half of my caching plan wouldn't work (the idea that I could rapidly skip through the INACTIVE holograms). I realized that even when I hit an inactive one, I was still doing some processing. So moving through the loop was a negligible part of the time (which I could tell simply by making the loop test skip every single time.)

I now suspect the second half of my caching plan, of fitting the entire struct into a single cache line, is probably doomed as well. What I might theoretically gain with cache hits I almost definitely lose with the extra processing needed to reconstruct the data squeezed out of the shrunken struct. (Mostly to do with reconstructing those Vector3 corners, since they're the main victim of compressing the struct.)

In reality, I knew these loops weren't a hotspot, I had just wondered if I could make them better. The lion's share of CPU every frame goes on bullet running physics, which pretty much swamps everything else on the cpu.
Isn't cache friendliness dependent on the access pattern too? I mean, if you use the whole structure's data each time you process it, then adding an indirection will not help. It's when you're processing only part of it in different processes that it should potentially help.
Hmm, that might be for temporal locality, but I already have pretty terrible temporal locality, that's just the nature of these kinds of things. A big array that for the most part just gets fully traversed once per frame. I was aiming more for spacial locality, so that data that needs to be accessed is generally near data that has already been accessed.
"In theory there is no difference between practice and theory. In practice, there is." - Psychology Textbook.
bstone
OGRE Expert User
OGRE Expert User
Posts: 1920
Joined: Sun Feb 19, 2012 9:24 pm
Location: Russia
x 201

Re: Failed cache friendliness.

Post by bstone »

Yeah, anyway, without a decent profiling setup all optimizations will be hit and miss. Does the C++ compiler support the "/profile" option in VS2008 express? I know it lacks the profiler just like the professional edition, but what about "/profile"? There's a way to use that if it's there.

Another idea would be to adapt the Ogre's profiler to use performance counters to gather the cache-miss and branching stats alongside the time spent.