Vertex Cache Optimisation
-
- OGRE Moderator
- Posts: 7157
- Joined: Sun Jan 25, 2004 7:35 am
- Location: Brisbane, Australia
- x 535
Vertex Cache Optimisation
I've just finished a semi broken implementation of Tom Forsyth's "Linear-Speed Vertex Cache Optimisation" paper in Ogre (started yesterday, just before my psu died).
Most people probably don't think of it, but graphics cards have a very limited cache for transformed vertices. Different hardware has different sizes, but it's probably something small like 32. Using a rendering primitive like an indexed triangle list can give good performance gains because the indices reuse vertices rather than having 3 unique ones per vertex, but with only around 32 slots available in the cache, many vertices will still have to be recalculated (have their vertex shader run) more than once.
This paper covers a technique for high speed rearranging of the index buffer of a triangle list to make it more cache friendly, without knowing what cache size your target hardware is.
The paper is a bit vague on some bits so my version is slower than it should be (currently takes about 2 minutes to brute force process a high poly mesh, should take maybe a second or less with the proper shortcuts implemented) and I got the cache direction backwards when calculating the score for a vertex, but already the results are interesting.
So far to test it I've used the Stanford Bunny model. It has 35947 vertices and 69451 faces.
Loading the bunny's vertex and index data into a SimpleRenderable in ogre and rendering it (nothing else) I'm getting 785fps (measured in fraps on my geforce 7800gtx).
Loading the bunny's data the same way but giving the index buffer to my optimiser first, the same mesh is now rendering at 1250fps.
This is just fixed function, I'd assume heavy vertex shader use could gain even more from it.
Of course it's not suited to small meshes as much, it won't hurt but the gains wouldn't be as noticeable.
The code needs a ton of work (like speeding up, I'm doing an O(n^2) version because Tom's O(n) version was a little confusing, but I think I get it now), but I think it has potential.
Plus this is only stage 1 of the optimisation process, the full version has 2 more stages.
There's 4 possible places for this when (if) I finish it.
- stand alone code, stick it in your program to optimise what you want
- as an option in OgreXMLConverter
- as an option in meshmagic
- added to Ogre's MeshManager or some other suitable place (only if I REALLY clean it up and can show some consistent benefits).
The advantage of having it in Ogre itself instead of just in the tools would be so you could call it on a ManualObject, StaticGeometry or other runtime constructed mesh.
Damn, I've got to go out. When I get home I think I'll implement another Tom Forsyth paper I've been staring at for ages.
Down with teapots! Long live the Stanford Bunny!
Most people probably don't think of it, but graphics cards have a very limited cache for transformed vertices. Different hardware has different sizes, but it's probably something small like 32. Using a rendering primitive like an indexed triangle list can give good performance gains because the indices reuse vertices rather than having 3 unique ones per vertex, but with only around 32 slots available in the cache, many vertices will still have to be recalculated (have their vertex shader run) more than once.
This paper covers a technique for high speed rearranging of the index buffer of a triangle list to make it more cache friendly, without knowing what cache size your target hardware is.
The paper is a bit vague on some bits so my version is slower than it should be (currently takes about 2 minutes to brute force process a high poly mesh, should take maybe a second or less with the proper shortcuts implemented) and I got the cache direction backwards when calculating the score for a vertex, but already the results are interesting.
So far to test it I've used the Stanford Bunny model. It has 35947 vertices and 69451 faces.
Loading the bunny's vertex and index data into a SimpleRenderable in ogre and rendering it (nothing else) I'm getting 785fps (measured in fraps on my geforce 7800gtx).
Loading the bunny's data the same way but giving the index buffer to my optimiser first, the same mesh is now rendering at 1250fps.
This is just fixed function, I'd assume heavy vertex shader use could gain even more from it.
Of course it's not suited to small meshes as much, it won't hurt but the gains wouldn't be as noticeable.
The code needs a ton of work (like speeding up, I'm doing an O(n^2) version because Tom's O(n) version was a little confusing, but I think I get it now), but I think it has potential.
Plus this is only stage 1 of the optimisation process, the full version has 2 more stages.
There's 4 possible places for this when (if) I finish it.
- stand alone code, stick it in your program to optimise what you want
- as an option in OgreXMLConverter
- as an option in meshmagic
- added to Ogre's MeshManager or some other suitable place (only if I REALLY clean it up and can show some consistent benefits).
The advantage of having it in Ogre itself instead of just in the tools would be so you could call it on a ManualObject, StaticGeometry or other runtime constructed mesh.
Damn, I've got to go out. When I get home I think I'll implement another Tom Forsyth paper I've been staring at for ages.
Down with teapots! Long live the Stanford Bunny!
-
- Orc
- Posts: 498
- Joined: Thu Mar 16, 2006 5:27 pm
Re: Vertex Cache Optimisation
Hey, very cool work there Kojack, and quite helpful on the optimization end. I can definitely see this being extended, and definitely proving useful when dealing with large poly counts.
-
- OGRE Moderator
- Posts: 7157
- Joined: Sun Jan 25, 2004 7:35 am
- Location: Brisbane, Australia
- x 535
Re: Vertex Cache Optimisation
Looks like at the moment (from some debug info which shows cache hits during the optimise) I'm achieving about 1.48 vertices processed per triangle, instead of the worst case scenario of 3 per triangle. That matches the framerate change, so it's probably about right.
-
- Silver Sponsor
- Posts: 605
- Joined: Fri Dec 14, 2007 11:44 am
- Location: Northern Ireland
- x 16
Re: Vertex Cache Optimisation
Wow, some impressive gains there! Would be great to see it in Ogre
Thanks Kojack!
All the best,
Ash
Thanks Kojack!
All the best,
Ash
-
- Gremlin
- Posts: 185
- Joined: Sat May 07, 2005 3:27 pm
Re: Vertex Cache Optimisation
You may want to have a look at one "related" thread and paper in http://www.ogre3d.org/forums/viewtopic. ... on#p248884
The related paper is "Fast Triangle Order Optimization for Vertex Locality and Reduced Overdraw" (SIGGRAPH 2007)
Which you can find in Pedro Sander's Page at http://www.cse.ust.hk/~psander/.
Please note the mention under it :
In the mean-time, well done !
PS : are you sure there's not already some kind of Mesh optimization in MeshMagick ?
The related paper is "Fast Triangle Order Optimization for Vertex Locality and Reduced Overdraw" (SIGGRAPH 2007)
Which you can find in Pedro Sander's Page at http://www.cse.ust.hk/~psander/.
Please note the mention under it :
Sander also wrote a newer related paper "Efficient Traversal of Mesh Edges using Adjacency Primitives" with the following abstract extract:Source code: Now available for researchers and game developers upon request.
It should be noted that I don't think Ogre uses "adjacency primitives" (except perhaps in the DX10 RenderSystem) so this paper may be not as generrally usefull.In addition, we extend two existing vertex cache optimization algorithms to produce cache-efficient traversal orderings for adjacency primitives.
In the mean-time, well done !
PS : are you sure there's not already some kind of Mesh optimization in MeshMagick ?
-
- Gold Sponsor
- Posts: 446
- Joined: Fri May 02, 2003 10:05 am
- Location: UK
Re: Vertex Cache Optimisation
Wow, sounds impressive Kojack. So if this was somehow integrated into Ogre it 'could' give speed enhancements on bigger meshes for free - no catches?
-
- Gremlin
- Posts: 185
- Joined: Sat May 07, 2005 3:27 pm
Re: Vertex Cache Optimisation
It's "for free" once the optimization has taken place. The optimization in itself is computation intensive (less for the O(n) version of course), and I don't think it can/should be done "real-time" during rendering.
Or in a separate thread ? let's use those multi-cores we have ... For example a "worker thread" could work on an optimized version, while the original is displayed, and once the optimization is done, switch to the new one ...
Though at preprocess, at the scene's loading, ... these are good points in time I think.
Perhaps it can also be applied to StaticGeometry at it's creation too...
In addition to TomF's Algorithm, an other pass may be inserted, that "degrades" the Vertex Cache optimization, in favor of reduced overdraw: the sequence is split in smaller sub-sequences, and theses sub-sequences are sorted so that "outside looking" fragments are displayed first. I don't know how much impact this pass has, the paper is not completely clear about that
Or in a separate thread ? let's use those multi-cores we have ... For example a "worker thread" could work on an optimized version, while the original is displayed, and once the optimization is done, switch to the new one ...
Though at preprocess, at the scene's loading, ... these are good points in time I think.
Perhaps it can also be applied to StaticGeometry at it's creation too...
In addition to TomF's Algorithm, an other pass may be inserted, that "degrades" the Vertex Cache optimization, in favor of reduced overdraw: the sequence is split in smaller sub-sequences, and theses sub-sequences are sorted so that "outside looking" fragments are displayed first. I don't know how much impact this pass has, the paper is not completely clear about that
-
- Minaton
- Posts: 973
- Joined: Fri Dec 28, 2007 4:35 pm
- Location: Germany
- x 1
Re: Vertex Cache Optimisation
Hi!
My feeling is that the easiest way would be to integrate this into Meshmagick and hopefully every exporter.
There could be a log message in debug mode which reminds the user of updating the meshes (as it already happens with older mesh versions).
A "clean" application startUp shouldn´t bother around with precomputable optimisations...
xad
My feeling is that the easiest way would be to integrate this into Meshmagick and hopefully every exporter.
There could be a log message in debug mode which reminds the user of updating the meshes (as it already happens with older mesh versions).
A "clean" application startUp shouldn´t bother around with precomputable optimisations...
xad
-
- OGRE Moderator
- Posts: 7157
- Joined: Sun Jan 25, 2004 7:35 am
- Location: Brisbane, Australia
- x 535
Re: Vertex Cache Optimisation
The whole thing is pretty tiny so far (main code is only like 150 lines, and another 20 or so for some variables and stuff).
Thanks for the link Shadow. I had a quick search on here but didn't spot anything similar.
No idea if mesh magic has something similar already. This was a spur of the moment thing. Yesterday I was going to implement something from tom's page as an experiment, but as I started to write it I discovered that a bunch of Ogre's demo meshes are broken. So I started thinking about fixing them (I've got a post on the problems written, but I haven't posted it yet because I wanted to make replacements myself rather than just saying someone else should fix it), which reminded me of the optimiser article by tom, so I decided coding it for a few hours was more fun than modelling. If it turns out useful, cool. If not, or I can't get it fast, it's an interesting distraction.
Thanks for the link Shadow. I had a quick search on here but didn't spot anything similar.
No idea if mesh magic has something similar already. This was a spur of the moment thing. Yesterday I was going to implement something from tom's page as an experiment, but as I started to write it I discovered that a bunch of Ogre's demo meshes are broken. So I started thinking about fixing them (I've got a post on the problems written, but I haven't posted it yet because I wanted to make replacements myself rather than just saying someone else should fix it), which reminded me of the optimiser article by tom, so I decided coding it for a few hours was more fun than modelling. If it turns out useful, cool. If not, or I can't get it fast, it's an interesting distraction.
-
- Gremlin
- Posts: 185
- Joined: Sat May 07, 2005 3:27 pm
Re: Vertex Cache Optimisation
I just had a quick look at MeshMagick, and it indeed has a mesh "optimization" process,
but as far as I see it mostly unduplicates vertices...
So it's far from Vertex Cache Optimisation, although as a first "prep" pass it could well do...
Which means VCOptimization is a "go"...
I think, you could perhaps integrate the code in one of the tools (perhaps MeshMagick ?), let it "mature" there, and when you've got something sufficiently clean/performant, see with Sinbad about importing it into OgreMain... That way it could be used any of the tools or at runtime if needed.
but as far as I see it mostly unduplicates vertices...
So it's far from Vertex Cache Optimisation, although as a first "prep" pass it could well do...
Which means VCOptimization is a "go"...
I think, you could perhaps integrate the code in one of the tools (perhaps MeshMagick ?), let it "mature" there, and when you've got something sufficiently clean/performant, see with Sinbad about importing it into OgreMain... That way it could be used any of the tools or at runtime if needed.
-
- OGRE Retired Moderator
- Posts: 2653
- Joined: Wed Sep 24, 2003 8:07 am
- Location: Haute Garonne, France
- x 4
Re: Vertex Cache Optimisation
You can check ATI tootle on amd website, which is "sanders" paper implementation. there is a "ogretootle" integration, and I added it to ceguimeshviewer. Sanders tootle one is very fast, if option are used wisely. code is nearly all in the paper anyway.
It's hard to do benchmark on VC benefits as you might need several mesh (some are badly ordered at start, others aren't), shaders, and even scenes where bottleneck is "Vertex" bound (vc benefits) or "pixel" bound (overdraw reduce count bene
Ogre also already has a vc cache optimizer i think, check Ogre::IndexBuffer::optimize() or something.
its)
It's hard to do benchmark on VC benefits as you might need several mesh (some are badly ordered at start, others aren't), shaders, and even scenes where bottleneck is "Vertex" bound (vc benefits) or "pixel" bound (overdraw reduce count bene
Ogre also already has a vc cache optimizer i think, check Ogre::IndexBuffer::optimize() or something.
its)
-
- OGRE Moderator
- Posts: 7157
- Joined: Sun Jan 25, 2004 7:35 am
- Location: Brisbane, Australia
- x 535
Re: Vertex Cache Optimisation
Damn, seems everyone has beaten me to it.
(I did search, but nothing seemed obvious)
Tootle does sound interesting. I'll take a look.
I just tested the built in ogre optimiser (IndexData::optimiseVertexCacheTriList).
It bumps my test scene from 785fps to 810fps.
So my one at 1200+fps isn't obsolete yet. But as you say, measuring this stuff realistically is near impossible.
Edit: Ok, I just fixed the cache direction bug (I ordered my cache as index zero is least recently used, Tom's scoring code used index zero as most recently used).
Now when optimised by my code, the scene is 1815fps.
Still a bit useless if I can't speed up the code. Time to sleep then stare at it a bit more.
(I did search, but nothing seemed obvious)
Tootle does sound interesting. I'll take a look.
I just tested the built in ogre optimiser (IndexData::optimiseVertexCacheTriList).
It bumps my test scene from 785fps to 810fps.
So my one at 1200+fps isn't obsolete yet. But as you say, measuring this stuff realistically is near impossible.
Edit: Ok, I just fixed the cache direction bug (I ordered my cache as index zero is least recently used, Tom's scoring code used index zero as most recently used).
Now when optimised by my code, the scene is 1815fps.
Still a bit useless if I can't speed up the code. Time to sleep then stare at it a bit more.
-
- Gnoll
- Posts: 677
- Joined: Tue Sep 19, 2006 6:09 pm
- x 5
Re: Vertex Cache Optimisation
That's a good improvement!
I don't find it useless at all, also because yours can be more integrated in Ogre... the question is, how much time it requires to optimize a Mesh?
Giving that it's always useful, and has to run only once for each file, for me it could be integrated at least in the mesh converter, which should use it by default.
Maybe, OgreMain could give an advice like "your mesh is unoptimized" when loading...
I don't find it useless at all, also because yours can be more integrated in Ogre... the question is, how much time it requires to optimize a Mesh?
Giving that it's always useful, and has to run only once for each file, for me it could be integrated at least in the mesh converter, which should use it by default.
Maybe, OgreMain could give an advice like "your mesh is unoptimized" when loading...
-
- Minaton
- Posts: 973
- Joined: Fri Dec 28, 2007 4:35 pm
- Location: Germany
- x 1
Re: Vertex Cache Optimisation
I fully agree with _tommo_.
-
- Gremlin
- Posts: 185
- Joined: Sat May 07, 2005 3:27 pm
Re: Vertex Cache Optimisation
I agree wth _tommo_ too...
But of course, if you could get from this O(n^2) to O(n) it would be much better.
I think the "cheat" that makes it work in O(n) is to only check for inclusion the triangles that are referenced from the LRU cache list... Except of course when no triangles are left, in which case it tests the whole list.
But of course, if you could get from this O(n^2) to O(n) it would be much better.
I think the "cheat" that makes it work in O(n) is to only check for inclusion the triangles that are referenced from the LRU cache list... Except of course when no triangles are left, in which case it tests the whole list.
-
- OGRE Moderator
- Posts: 7157
- Joined: Sun Jan 25, 2004 7:35 am
- Location: Brisbane, Australia
- x 535
Re: Vertex Cache Optimisation
After way too many hours trying to track down a bug in my new version (which turned out to be a stupid mistake), I've finally got the new optimised optimiser working.
Some performances for comparison.
Model: Standford Bunny (35947 vertices and 69451 triangles) with normals, stored in standard mesh file and loaded as an entity. One light source.
Computer: i7 940, geforce 7800gtx, vista64
No optimisation: 410fps
Ogre's optimiseVertexCacheTriList(): 704fps, 11 seconds to process
My version of Tom's optimiser: 958fps, less than 1 second to process
(previous rates were higher because model was unlit)
Not too bad. Not only is this technique giving better results than the built in one, but it's also near instant instead of taking 11 seconds (or several minutes like my last version).
Now that I have it working faster (and I haven't chucked it in my vtune demo yet), I'm going to restructure it a bit better, and put some easier support in for getting to the buffers (currently it needs to be given an IndexBuffer pointer, I'd rather give it an entity and let it find the index buffers for all sub meshes).
I tried using OgreTootle, but I can't get it to run (even if I build it myself). might be a vista problem, I'll try it on xp later.
Some performances for comparison.
Model: Standford Bunny (35947 vertices and 69451 triangles) with normals, stored in standard mesh file and loaded as an entity. One light source.
Computer: i7 940, geforce 7800gtx, vista64
No optimisation: 410fps
Ogre's optimiseVertexCacheTriList(): 704fps, 11 seconds to process
My version of Tom's optimiser: 958fps, less than 1 second to process
(previous rates were higher because model was unlit)
Not too bad. Not only is this technique giving better results than the built in one, but it's also near instant instead of taking 11 seconds (or several minutes like my last version).
Now that I have it working faster (and I haven't chucked it in my vtune demo yet), I'm going to restructure it a bit better, and put some easier support in for getting to the buffers (currently it needs to be given an IndexBuffer pointer, I'd rather give it an entity and let it find the index buffers for all sub meshes).
I tried using OgreTootle, but I can't get it to run (even if I build it myself). might be a vista problem, I'll try it on xp later.
-
- Silver Sponsor
- Posts: 605
- Joined: Fri Dec 14, 2007 11:44 am
- Location: Northern Ireland
- x 16
Re: Vertex Cache Optimisation
Wow, good work Kojack, can't wait to try it out, it will be a great addition to ogre!
-
- OGRE Retired Moderator
- Posts: 4823
- Joined: Fri Jun 18, 2004 1:40 pm
- Location: Berlin, Germany
- x 7
Re: Vertex Cache Optimisation
Sounds very good. Congratulations!Kojack wrote: No optimisation: 410fps
Ogre's optimiseVertexCacheTriList(): 704fps, 11 seconds to process
My version of Tom's optimiser: 958fps, less than 1 second to process
Don't forget to also let it work on a mesh too. If you want to include it in one of the offline tools (as e.g. meshmagick optimise, hint, hint.), then you don't have an Entity.Kojack wrote: I'd rather give it an entity and let it find the index buffers for all sub meshes).
-
- OGRE Moderator
- Posts: 7157
- Joined: Sun Jan 25, 2004 7:35 am
- Location: Brisbane, Australia
- x 535
Re: Vertex Cache Optimisation
Yep, it will definitely work on non entities. Renderops, index buffers, meshes, etc should all be easy to handle.
I think I might try stage two of Tom's optimiser, which manipulates the vertex buffer (so far I've only touched the index buffer). It's a much simpler and smaller algorithm, but he says the performance increase can be as big as what stage one did (assuming you've already run stage one, so the index buffer is nice).
Vertex Buffers are messy though, not only all that varied data in them, but they can be split into multiple parts. I might leave stage two until later.
I think I might try stage two of Tom's optimiser, which manipulates the vertex buffer (so far I've only touched the index buffer). It's a much simpler and smaller algorithm, but he says the performance increase can be as big as what stage one did (assuming you've already run stage one, so the index buffer is nice).
Vertex Buffers are messy though, not only all that varied data in them, but they can be split into multiple parts. I might leave stage two until later.
-
- Gremlin
- Posts: 185
- Joined: Sat May 07, 2005 3:27 pm
Re: Vertex Cache Optimisation
Well done !!!
Before attacking stage two, you could also test the "Reduced Overdraw" reordering stage from Pedro Sander's paper. It works by ordering some "clusters" with a criterion based on the "occlusion potential" of each cluster. "highly occluding" clusters should be drawn before less occluding ones which could be discarded by the depth test before the pixel shader computations.
It's main advantage is for highly pixel shader intensive models.
The problem being that contrary to Sander's use of the ACMR to insert "soft boundaries", the ACMR is not computed in TomF's algorithm, so a different heuristic needs to be found if you wanted to insert that stage.
Of course such a "reduced overdraw" stage can also be inserted in later (or not at all).
Before attacking stage two, you could also test the "Reduced Overdraw" reordering stage from Pedro Sander's paper. It works by ordering some "clusters" with a criterion based on the "occlusion potential" of each cluster. "highly occluding" clusters should be drawn before less occluding ones which could be discarded by the depth test before the pixel shader computations.
It's main advantage is for highly pixel shader intensive models.
The problem being that contrary to Sander's use of the ACMR to insert "soft boundaries", the ACMR is not computed in TomF's algorithm, so a different heuristic needs to be found if you wanted to insert that stage.
Of course such a "reduced overdraw" stage can also be inserted in later (or not at all).
-
- OGRE Retired Moderator
- Posts: 20570
- Joined: Thu Jan 22, 2004 10:13 am
- Location: Denmark
- x 179
Re: Vertex Cache Optimisation
Wow, Kojack! That's really aggressive improvements.
Summer holidays down under?
Summer holidays down under?
/* Less noise. More signal. */
Ogitor Scenebuilder - powered by Ogre, presented by Qt, fueled by Passion.
OgreAddons - the Ogre code suppository.
Ogitor Scenebuilder - powered by Ogre, presented by Qt, fueled by Passion.
OgreAddons - the Ogre code suppository.
-
- OGRE Retired Team Member
- Posts: 19269
- Joined: Sun Oct 06, 2002 11:19 pm
- Location: Guernsey, Channel Islands
- x 66
Re: Vertex Cache Optimisation
Great work Kojack, we'd love to have a patch once you're happy with it.
-
- OGRE Moderator
- Posts: 7157
- Joined: Sun Jan 25, 2004 7:35 am
- Location: Brisbane, Australia
- x 535
Re: Vertex Cache Optimisation
Yep. Ends in a couple of weeks, so I'm doing some mad coding to distract me from the not as fun mad coding I should be doing to get ready for the start of classes.Summer holidays down under?
I've packaged it up a bit easier (add one cpp to project, include one header and call the optimise function on an entity, mesh, renderoperation or hardwareindexbuffer).
But it's acting a little odd. On some ogre meshes it's fine (like ogrehead). But on others like jaiqua and athene, the optimise crashes in release mode but runs fine in debug. Argh, I hate release mode only errors. Probably an uninitialised variable some where.
Hmm, ogre says jaiqua.mesh is old and should be updated with the mesh update tool, but when I run the mesh update tool on it, it also crashes.
Optimising the knot.mesh file actually makes it slower. It was probably already pretty much perfectly ordered, since it looks like a mathematically generated continuous surface, not a polygon soup. I might need to add some form of estimate to see if the existing ordering is already optimal, so it can reject the optimise in the cases where it gives worse results.
Most of the ogre demos I've tried don't show much change, the models just aren't high enough poly to have much of a difference. I'll have to try it with some heavy vertex shaders.
-
- Halfling
- Posts: 46
- Joined: Sun Jul 13, 2008 12:21 pm
- Location: Brussels
Re: Vertex Cache Optimisation
impressive work Kojack!
-
- Gnoblar
- Posts: 2
- Joined: Wed Nov 10, 2010 11:48 am
Re: Vertex Cache Optimisation
Hi,
I'm just wondering about the progress of this project. As far as I understand , this kind of vertex cache optimization is just as interesting as ever.
Kojak seemed to be on good way implementing Tom's algoritm which is probably the preferred method but then it's just silent?
Would it be possible to get the code and try it out, even though it might not be ready for adding to Ogre?
/PÅ
I'm just wondering about the progress of this project. As far as I understand , this kind of vertex cache optimization is just as interesting as ever.
Kojak seemed to be on good way implementing Tom's algoritm which is probably the preferred method but then it's just silent?
Would it be possible to get the code and try it out, even though it might not be ready for adding to Ogre?
/PÅ