Page 1 of 2

Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 6:54 am
by Kojack
I've just finished a semi broken implementation of Tom Forsyth's "Linear-Speed Vertex Cache Optimisation" paper in Ogre (started yesterday, just before my psu died).

Most people probably don't think of it, but graphics cards have a very limited cache for transformed vertices. Different hardware has different sizes, but it's probably something small like 32. Using a rendering primitive like an indexed triangle list can give good performance gains because the indices reuse vertices rather than having 3 unique ones per vertex, but with only around 32 slots available in the cache, many vertices will still have to be recalculated (have their vertex shader run) more than once.

This paper covers a technique for high speed rearranging of the index buffer of a triangle list to make it more cache friendly, without knowing what cache size your target hardware is.

The paper is a bit vague on some bits so my version is slower than it should be (currently takes about 2 minutes to brute force process a high poly mesh, should take maybe a second or less with the proper shortcuts implemented) and I got the cache direction backwards when calculating the score for a vertex, but already the results are interesting.

So far to test it I've used the Stanford Bunny model. It has 35947 vertices and 69451 faces.
Loading the bunny's vertex and index data into a SimpleRenderable in ogre and rendering it (nothing else) I'm getting 785fps (measured in fraps on my geforce 7800gtx).
Loading the bunny's data the same way but giving the index buffer to my optimiser first, the same mesh is now rendering at 1250fps.
This is just fixed function, I'd assume heavy vertex shader use could gain even more from it.
Of course it's not suited to small meshes as much, it won't hurt but the gains wouldn't be as noticeable.

The code needs a ton of work (like speeding up, I'm doing an O(n^2) version because Tom's O(n) version was a little confusing, but I think I get it now), but I think it has potential.
Plus this is only stage 1 of the optimisation process, the full version has 2 more stages.

There's 4 possible places for this when (if) I finish it.
- stand alone code, stick it in your program to optimise what you want
- as an option in OgreXMLConverter
- as an option in meshmagic
- added to Ogre's MeshManager or some other suitable place (only if I REALLY clean it up and can show some consistent benefits).
The advantage of having it in Ogre itself instead of just in the tools would be so you could call it on a ManualObject, StaticGeometry or other runtime constructed mesh.

Damn, I've got to go out. When I get home I think I'll implement another Tom Forsyth paper I've been staring at for ages. :)

Down with teapots! Long live the Stanford Bunny!

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 7:45 am
by Wretched_Wyx
Hey, very cool work there Kojack, and quite helpful on the optimization end. I can definitely see this being extended, and definitely proving useful when dealing with large poly counts.

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 9:04 am
by Kojack
Looks like at the moment (from some debug info which shows cache hits during the optimise) I'm achieving about 1.48 vertices processed per triangle, instead of the worst case scenario of 3 per triangle. That matches the framerate change, so it's probably about right.

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 9:17 am
by AshMcConnell
Wow, some impressive gains there! Would be great to see it in Ogre :)

Thanks Kojack!

All the best,
Ash

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 10:25 am
by Shadow007
You may want to have a look at one "related" thread and paper in http://www.ogre3d.org/forums/viewtopic. ... on#p248884

The related paper is "Fast Triangle Order Optimization for Vertex Locality and Reduced Overdraw" (SIGGRAPH 2007)
Which you can find in Pedro Sander's Page at http://www.cse.ust.hk/~psander/.
Please note the mention under it :
Source code: Now available for researchers and game developers upon request.
Sander also wrote a newer related paper "Efficient Traversal of Mesh Edges using Adjacency Primitives" with the following abstract extract:
In addition, we extend two existing vertex cache optimization algorithms to produce cache-efficient traversal orderings for adjacency primitives.
It should be noted that I don't think Ogre uses "adjacency primitives" (except perhaps in the DX10 RenderSystem) so this paper may be not as generrally usefull.

In the mean-time, well done !

PS : are you sure there's not already some kind of Mesh optimization in MeshMagick ?

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 11:26 am
by SpannerMan
Wow, sounds impressive Kojack. So if this was somehow integrated into Ogre it 'could' give speed enhancements on bigger meshes for free - no catches?

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 12:01 pm
by Shadow007
It's "for free" once the optimization has taken place. The optimization in itself is computation intensive (less for the O(n) version of course), and I don't think it can/should be done "real-time" during rendering.

Or in a separate thread ? let's use those multi-cores we have ... For example a "worker thread" could work on an optimized version, while the original is displayed, and once the optimization is done, switch to the new one ...

Though at preprocess, at the scene's loading, ... these are good points in time I think.
Perhaps it can also be applied to StaticGeometry at it's creation too...

In addition to TomF's Algorithm, an other pass may be inserted, that "degrades" the Vertex Cache optimization, in favor of reduced overdraw: the sequence is split in smaller sub-sequences, and theses sub-sequences are sorted so that "outside looking" fragments are displayed first. I don't know how much impact this pass has, the paper is not completely clear about that

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 12:29 pm
by xadhoom
Hi!

My feeling is that the easiest way would be to integrate this into Meshmagick and hopefully every exporter.
There could be a log message in debug mode which reminds the user of updating the meshes (as it already happens with older mesh versions).

A "clean" application startUp shouldn´t bother around with precomputable optimisations...

xad

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 4:05 pm
by Kojack
The whole thing is pretty tiny so far (main code is only like 150 lines, and another 20 or so for some variables and stuff).

Thanks for the link Shadow. I had a quick search on here but didn't spot anything similar.

No idea if mesh magic has something similar already. This was a spur of the moment thing. Yesterday I was going to implement something from tom's page as an experiment, but as I started to write it I discovered that a bunch of Ogre's demo meshes are broken. So I started thinking about fixing them (I've got a post on the problems written, but I haven't posted it yet because I wanted to make replacements myself rather than just saying someone else should fix it), which reminded me of the optimiser article by tom, so I decided coding it for a few hours was more fun than modelling. If it turns out useful, cool. If not, or I can't get it fast, it's an interesting distraction.

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 4:29 pm
by Shadow007
I just had a quick look at MeshMagick, and it indeed has a mesh "optimization" process,
but as far as I see it mostly unduplicates vertices...
So it's far from Vertex Cache Optimisation, although as a first "prep" pass it could well do...

Which means VCOptimization is a "go"...

I think, you could perhaps integrate the code in one of the tools (perhaps MeshMagick ?), let it "mature" there, and when you've got something sufficiently clean/performant, see with Sinbad about importing it into OgreMain... That way it could be used any of the tools or at runtime if needed.

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 4:44 pm
by tuan kuranes
You can check ATI tootle on amd website, which is "sanders" paper implementation. there is a "ogretootle" integration, and I added it to ceguimeshviewer. Sanders tootle one is very fast, if option are used wisely. code is nearly all in the paper anyway.

It's hard to do benchmark on VC benefits as you might need several mesh (some are badly ordered at start, others aren't), shaders, and even scenes where bottleneck is "Vertex" bound (vc benefits) or "pixel" bound (overdraw reduce count bene

Ogre also already has a vc cache optimizer i think, check Ogre::IndexBuffer::optimize() or something.
its)

Re: Vertex Cache Optimisation

Posted: Tue Jan 20, 2009 8:21 pm
by Kojack
Damn, seems everyone has beaten me to it. :)
(I did search, but nothing seemed obvious)

Tootle does sound interesting. I'll take a look.

I just tested the built in ogre optimiser (IndexData::optimiseVertexCacheTriList).
It bumps my test scene from 785fps to 810fps.
So my one at 1200+fps isn't obsolete yet. But as you say, measuring this stuff realistically is near impossible.


Edit: Ok, I just fixed the cache direction bug (I ordered my cache as index zero is least recently used, Tom's scoring code used index zero as most recently used).
Now when optimised by my code, the scene is 1815fps.
Still a bit useless if I can't speed up the code. Time to sleep then stare at it a bit more.

Re: Vertex Cache Optimisation

Posted: Wed Jan 21, 2009 9:48 am
by _tommo_
That's a good improvement! :D

I don't find it useless at all, also because yours can be more integrated in Ogre... the question is, how much time it requires to optimize a Mesh?

Giving that it's always useful, and has to run only once for each file, for me it could be integrated at least in the mesh converter, which should use it by default.
Maybe, OgreMain could give an advice like "your mesh is unoptimized" when loading...

Re: Vertex Cache Optimisation

Posted: Wed Jan 21, 2009 10:19 am
by xadhoom
I fully agree with _tommo_. :D

Re: Vertex Cache Optimisation

Posted: Wed Jan 21, 2009 12:18 pm
by Shadow007
I agree wth _tommo_ too...

But of course, if you could get from this O(n^2) to O(n) it would be much better.

I think the "cheat" that makes it work in O(n) is to only check for inclusion the triangles that are referenced from the LRU cache list... Except of course when no triangles are left, in which case it tests the whole list.

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 9:15 am
by Kojack
After way too many hours trying to track down a bug in my new version (which turned out to be a stupid mistake), I've finally got the new optimised optimiser working.

Some performances for comparison.
Model: Standford Bunny (35947 vertices and 69451 triangles) with normals, stored in standard mesh file and loaded as an entity. One light source.
Computer: i7 940, geforce 7800gtx, vista64

No optimisation: 410fps
Ogre's optimiseVertexCacheTriList(): 704fps, 11 seconds to process
My version of Tom's optimiser: 958fps, less than 1 second to process

(previous rates were higher because model was unlit)

Not too bad. Not only is this technique giving better results than the built in one, but it's also near instant instead of taking 11 seconds (or several minutes like my last version).

Now that I have it working faster (and I haven't chucked it in my vtune demo yet), I'm going to restructure it a bit better, and put some easier support in for getting to the buffers (currently it needs to be given an IndexBuffer pointer, I'd rather give it an entity and let it find the index buffers for all sub meshes).

I tried using OgreTootle, but I can't get it to run (even if I build it myself). might be a vista problem, I'll try it on xp later.

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 10:20 am
by AshMcConnell
Wow, good work Kojack, can't wait to try it out, it will be a great addition to ogre!

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 11:11 am
by haffax
Kojack wrote: No optimisation: 410fps
Ogre's optimiseVertexCacheTriList(): 704fps, 11 seconds to process
My version of Tom's optimiser: 958fps, less than 1 second to process
Sounds very good. Congratulations!
Kojack wrote: I'd rather give it an entity and let it find the index buffers for all sub meshes).
Don't forget to also let it work on a mesh too. If you want to include it in one of the offline tools (as e.g. meshmagick optimise, hint, hint.), then you don't have an Entity.

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 11:31 am
by Kojack
Yep, it will definitely work on non entities. Renderops, index buffers, meshes, etc should all be easy to handle.

I think I might try stage two of Tom's optimiser, which manipulates the vertex buffer (so far I've only touched the index buffer). It's a much simpler and smaller algorithm, but he says the performance increase can be as big as what stage one did (assuming you've already run stage one, so the index buffer is nice).

Vertex Buffers are messy though, not only all that varied data in them, but they can be split into multiple parts. I might leave stage two until later.

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 12:15 pm
by Shadow007
Well done !!!

Before attacking stage two, you could also test the "Reduced Overdraw" reordering stage from Pedro Sander's paper. It works by ordering some "clusters" with a criterion based on the "occlusion potential" of each cluster. "highly occluding" clusters should be drawn before less occluding ones which could be discarded by the depth test before the pixel shader computations.

It's main advantage is for highly pixel shader intensive models.

The problem being that contrary to Sander's use of the ACMR to insert "soft boundaries", the ACMR is not computed in TomF's algorithm, so a different heuristic needs to be found if you wanted to insert that stage.

Of course such a "reduced overdraw" stage can also be inserted in later (or not at all).

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 7:23 pm
by jacmoe
Wow, Kojack! That's really aggressive improvements. :D

Summer holidays down under? :)

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 8:25 pm
by sinbad
Great work Kojack, we'd love to have a patch once you're happy with it.

Re: Vertex Cache Optimisation

Posted: Fri Jan 23, 2009 9:03 pm
by Kojack
Summer holidays down under?
Yep. Ends in a couple of weeks, so I'm doing some mad coding to distract me from the not as fun mad coding I should be doing to get ready for the start of classes. :)

I've packaged it up a bit easier (add one cpp to project, include one header and call the optimise function on an entity, mesh, renderoperation or hardwareindexbuffer).
But it's acting a little odd. On some ogre meshes it's fine (like ogrehead). But on others like jaiqua and athene, the optimise crashes in release mode but runs fine in debug. Argh, I hate release mode only errors. Probably an uninitialised variable some where.

Hmm, ogre says jaiqua.mesh is old and should be updated with the mesh update tool, but when I run the mesh update tool on it, it also crashes.

Optimising the knot.mesh file actually makes it slower. It was probably already pretty much perfectly ordered, since it looks like a mathematically generated continuous surface, not a polygon soup. I might need to add some form of estimate to see if the existing ordering is already optimal, so it can reject the optimise in the cases where it gives worse results.

Most of the ogre demos I've tried don't show much change, the models just aren't high enough poly to have much of a difference. I'll have to try it with some heavy vertex shaders.

Re: Vertex Cache Optimisation

Posted: Tue Feb 03, 2009 8:32 pm
by Rasengan
impressive work Kojack!

Re: Vertex Cache Optimisation

Posted: Wed Nov 10, 2010 2:20 pm
by bpaberg
Hi,

I'm just wondering about the progress of this project. As far as I understand , this kind of vertex cache optimization is just as interesting as ever.

Kojak seemed to be on good way implementing Tom's algoritm which is probably the preferred method but then it's just silent? :)

Would it be possible to get the code and try it out, even though it might not be ready for adding to Ogre?

/PÅ