My coding style is procedural rather than object oriented, I basically just a have a few huge "god" objects handling the overall program structure. All my gamestate stuff, like monsters, doors, items, FX, etc, are held in huge arrays of typedef structs, which I loop through to do various updates. So there's the raw data, and functions that operate on that data.
My structs just get data added as I see the need. I mostly don't worry about padding issues or the like, or even the size of the structs. Some of them, such as for the monsters, get really big. As an example, here's a struct used for Holograms, which are more or less like billboards.
Code: Select all
typedef struct
{
int Type ;
int OrientationType ;
Ogre::Quaternion Orientation ;
int LastUpdateFrame ;
double BirthTime ;
double DeathTime ;
float Red ;
float Green ;
float Blue ;
Ogre::Vector3 Corner[4] ;
Ogre::Vector3 Centre ;
Ogre::Vector3 Origin ;
float U0 ;
float V0 ;
float U1 ;
float V1 ;
int Zone ; // what zone the hologram is in
}
HOLOGRAM ;
I wasn't having any speed or performance problems with any of this, in fact is seems surprisingly zippy, but I wondered if I could improve it none-the-less by trying to be more cache friendly... despite the fact that I don't really know much about programming for cache friendliness. The L1 data cache on the computer in question was 32KB, with cache lines of 64 bytes.
I did the following. First, I split the Type, Zone and DeathTime data into three seperate new arrays. Then the remaining data, I modified and optimized until the remaining struct was under 64 bytes in size (tested with sizeof). My reasoning was as follows:
1. By putting the loop-tested data in seperate arrays, more of the tests would be loaded at once. For instance, by making Type a byte and putting it in it's own array, 64 Type values would be loaded into the cache at a time, allowing the loop to quickly skim though 64 sequential HOLOGRAM_TYPE_INACTIVE holograms without any cache misses. Similar for Zone and DeathTime (I changed their types to shorts).
2. By making the entire remaining struct under 64 bytes, whenever there was a hologram that wasn't INACTIVE that needed to be actively updated, the entire hologram would be loaded into a single cache line. All the data would be ready to go without needed to look beyond the cache. I made sure that the very first info looked at was the first element in the struct, so if that wasn't in the cache already the entire hologram would be loaded.
I then ran some timing tests. But what I found was that the performance had either had no change at all, or was a little worse. Considering it a failed experiment, I just went back to the simple, untidy, single struct I'd been using to begin with.
I wonder why it didn't work. My guess is that the compiler optimizer already worked out the best way to set up the loop and my restructuring interfered with the optimizations. Also, cutting down the stuct size to under 64 bytes meant several extra calculations had to be made when an active holograms was found, to reconstruct missing data.