Performance research using Ogre

Oogst · Post by **Oogst** » Tue Jul 04, 2006 1:08 pm

For a research project I have tested the parameters that influence performance of the GPU. I did this together with another student (Olaf, also my co-programmer for The Blob). The research project is for the Computer Sciene master Game & Media Technology at Utrecht University. Now what we are looking for, is your opinions about our conclusions. We have done a lot of tests and have ideas about what causes the results and now we would like to know whether our ideas are correct. It is a long post, but we are hoping some are willing to fully read it and react to it.

The goal of our research was to find out how different parameters influence performance. So if I double the number of polygons, what happens to the framerate? Does it halve, or is the effect less, or maybe even worse? We did not try to see how fast Ogre is or how fast different GPU's are compared to each other. It is all about asymptotic behaviour. If I change this parameter and leave everything the same, how does the framerate change?

To do this, we made a simple application that generates meshes and materials depending on a set of parameters. We have some 20 parameters to set, like polygon-count, number of texture-units, resolution of the textures, number of different textures, mip-mapping, filtering, etc. We then made some 200 tests and runned them on 7 different computers, 3 ATI, 3 Nvidia, one onboard GPU. The tests save the time it takes to render frame 100 to 600 of a camera rotating around the simple scene. The first 100 frames are ignored to make sure everything has been loaded.

So, the point of this post is that I want your opinions on our conclusions. These are the things we are concluding right now and I would like to know whether they are correct. We left he very obvious things like "more polygons is less framerate" out, by the way, these are only the interesting things.

1.
Our graphs of testing different scenes with all objects doing alpha add, multiply or replace show that only when there are many overlapping objects in the z-order it makes a serious difference. If there are only a few objects that hardly overlap, than it hardly makes a difference whether it is multiply, add or replace. Our conclusion: alpha blend is only expensive because it increases the number of pixels that need to be drawn, not because of the blending itself.

2.
We implemented a very simply shader 1.1 version of fixed function, doing only one texture, one light, no speculars and no fogging. In a scene with only one texture, one light, no speculars and no fogging, our shader is faster than fixed function. Is this because current day GPU's are faster with shaders, or is it because fixed function does a lot of things anyway, even if they are not required for the current material?

3.
We did a test where we do a varying number of render-to-textures of an empty scene before actually rendering the scene to the screen. The graph for this shows that if we do some empty render-to-textures, the framerate hardly changes. Even if we do 32 of them each frame, the framerate still lowers only 25% in a scene with 64000 polygons total for 40 objects and 1 texture for all of them. Because 32 is an incredible amount of render-to-textures, we conclude that the setup cost and pipeline flush needed for a render-to-texture is very small and only becomes interesting if extreme amounts of render-to-textures are used.

4.
If no mip-mapping is used, the framerate drops dramatically if the texture resolution is increased. If mip-mapping is used, the framerate stays almost constant and only increases very slightly. We conclude that mip-mapping is not only for decreasing aliasing, but also for increasing performance dramatically. We also conclude that if mip-mapping is used, texture resolution is hardly important for performance, but only for memory usage and loading times.

5.
We tested turning mip-mapping and filtering on and off in some different scenes. Mip-mapping always increased performance. Bilinear filtering against no filtering hardly made a difference, while trilinear filtering costs 20% performance. We conclude that bilinear filtering is almost for free, while trilinear filtering is very expensive.

6.
We did a test with using many textures. We used twenty textures of increasing resolution. On one specific computer, having a 6600 GT 128mb, the rendering time for 1024x1024's exploded suddenly (something like 2000% of 512x512). We have no idea why this happened, because this should only take 80mb of texture space on a 128mb card that is not doing anything else. However, with DXT1 texture compression, the framerate gets back to what we expected. Also, a 64mb GPU did the test just fine and was much faster than this broken one. Why is this?

7.
The performance actually increases when we go from textures at 1024x1024 to 16384x64. This seems strange, but we think this might be because the texture is actually scaled back to 2048x8 by Ogre, the driver or the GPU. Is this true?

Thamas · Post by **Thamas** » Tue Jul 04, 2006 2:21 pm

(Hi everybody. First post here. Oogst asked me to reply

)

1. Our conclusion: alpha blend is only expensive because it increases the number of pixels that need to be drawn, not because of the blending itself.

So you conclude that you basically pay for the overdraw, and not the blending? (Does Ogre do good z sorting? How about drawing back-to-front and then comparing blending or no. This would give a better comparison of the raw cost of blending. (For a realistic comparison though, your current setup is good.))

2. Is this because current day GPU's are faster with shaders, or is it because fixed function does a lot of things anyway, even if they are not required for the current material?

On recent cards the fixed function pipeline is implemented in their shader hardware anyway iirc. This would suggest that the difference is that your shader is simpler than their the fixed function pipeline. (Question would remain why they wouldn't `optimize' their shader for the current render state.)

3. Because 32 is an incredible amount of render-to-textures, we conclude that the setup cost and pipeline flush needed for a render-to-texture is very small and only becomes interesting if extreme amounts of render-to-textures are used.

I agree that 32 is an incredible amount of render-to-textures, and that you show that setup cost is epsilon. I don't think these measurements support your claim that pipeline flushes aren't expensive though: there's hardly anything to flush. I'm not so sure these measurements mean anything useful. Perhaps you could just render the same scene to lots of textures, instead of rendering nothing?

4. We conclude that mip-mapping is not only for decreasing aliasing, but also for increasing performance dramatically.

Yes. This would be because for minified textures the stride through texture memory is much greater. The cache might also factor in here. Not sure which is the real killer.

Point 5 corresponds to my experience. Afaik bilinear filtering is ``hardware that's just there, so you might as well use it.''

6. the rendering time for 1024x1024's exploded suddenly

No idea. Driver issue perhaps?

7. Is this true?

I would suppose it is, because last time I looked no videocard did textures over 2048 in size. (Which is admittedly some time ago, but it would be a reasonable explanation.

Hope to help,
Thamas

Post by **DWORD** » Tue Jul 04, 2006 6:10 pm

Interesting tests.

Overall I agree with your and Thamas' explanations.

Oogst wrote:1. (...) Our conclusion: alpha blend is only expensive because it increases the number of pixels that need to be drawn, not because of the blending itself.

I'm not convinced about this. The only reason alpha blending would cause more overdraw is because Ogre sorts transparent objects from back to front. I think you should make a test where the render order is exactly the same for both transparent and opaque objects, so you can compare the results directly.

Oogst wrote:2. (...) Is this because current day GPU's are faster with shaders, or is it because fixed function does a lot of things anyway, even if they are not required for the current material?
Thamas wrote:On recent cards the fixed function pipeline is implemented in their shader hardware anyway iirc. This would suggest that the difference is that your shader is simpler than their the fixed function pipeline. (Question would remain why they wouldn't `optimize' their shader for the current render state.)

Regarding optimization, I think this is for speed.

Although the fixed-function "shader" may be slower, it would probably be even slower to optimize it for each change in render state. And it's probably too many different combinations to have them pregenerated. Just my guess, though.

Oogst wrote:7. The performance actually increases when we go from textures at 1024x1024 to 16384x64. This seems strange, but we think this might be because the texture is actually scaled back to 2048x8 by Ogre, the driver or the GPU. Is this true?

I think you're right. The maximum texture size is more like 2048-4096 than 16384. I'm not sure whether it's the driver or Ogre that does the downsampling, though.

Post by **sinbad** » Tue Jul 04, 2006 6:27 pm

Thamas covered most of it, here's my additions:

1. Really it's the read-modify-write pixel overhead required when using any kind of scene_blend. It's also why alpha_rejection is such a good idea, since it can discard a lot of RMW pixels when alpha is over the threshold.

4. Yes, I've pointed this out elsewhere, it's the impact of the texture cache mainly, smaller mips fit into the texture cache whilst sparsely accessing pixels in a larger texture causes tons of cache misses.

6. Sounds like texture thrashing. Not sure why the 64Mb card performed better though, time to get nvPerfKit on the case.

7. Yes, the texture is actually smaller once resized.

Oogst · Post by **Oogst** » Wed Jul 05, 2006 11:04 am

OK, thanks for the reactions and taking the time to read my long post!

1. Alpha blend also requires read/write overhead when blending with the background, so why is it that there is hardly a difference between blend and replace when objects hardly overlap? As for sorting: the objects in our scene are so intertangled, that distance order is semi-random. It hardly changes anything because of the kind of scene we are using. I added a screenshot of the add-version of this test to show what I am talking about. My point is: with alpha replace, there is a serious chance that a closer pixel has already been drawn and the z fails. With alpha-blend, this never happens. So when there is a lot of z-direction overlap, replace actually requires much less pixels to be drawn.

Hardly any overlap, hardly any difference in performance between blend and replace.

Lots of overlap, replace is much faster than blend.

The same test with several settings. The difference between blend and replace is only large when there are many objects. (time is the time it takes to render 500 frames, so lower is faster)

7. OK, that is clear, just for completeness in the report: who does the resizing? Hardware, driver, or Ogre?

A new point:

8.
We did some tests looking at the number of textures in the scene and the results of this suggested that sorting render calls by texture is not as useful as it may seem. Take a look at these two graphs:

This test shows the same scene, with 64 objects, with a varying number of different textures. All objects have a texture, only all have the same texture or which texture differs. It shows it only for Ati because my broken test with my Nvidia clutters the results too much.

This shows the effect of using texture atlassen, so using a few high-res textures against many low-res textures. The mapping is done so that at high-res, each object takes only a part of the texture, as is done with texture atlasses.

These two graphs suggest that the number of textures is actually not very important, because the performance hardly changes. This makes me wonder: why does Ogre do material-sorting and not depth-sorting? Is this backed up by tests that show that depth-sorting gives less performance?

Thamas · Post by **Thamas** » Wed Jul 05, 2006 1:02 pm

This makes me wonder: why does Ogre do material-sorting and not depth-sorting?

I have a few possible explanations for this, but they are all from what people are saying you should do; not based on actual experience. So yeah, I would also be interested to know whether the choice to do it like this in Ogre is based on `conventional wisdom' or actual measurements.

a) How many objects are you rendering? You can't change textures within a draw call and for interesting amounts of objects the increase in draw calls might kill you. (This would be a point is favour of atlasses. (Though atlasses have issues. (But you know that.)))

b) You are really going to take a hit once your textures don't fit into memory anymore. (But you will probably say that you don't want to go over vid memory anyway.)

c) Perhaps because changing materials can mean a lot more than just textures, and those other changes could be more expensive. Changing shader is supposed to be pretty expensive.

Post by **sinbad** » Wed Jul 05, 2006 2:01 pm

Changing textures is usually the most expensive state change you can make, but knowing this, it's also the one which has had significant optimisation done on in in more recent cards - so I'm not surprised it's not a big deal if your cards are recent. Plus, 64 textures really isn't very many so I don't think the test was that realistic as a high water mark - ten times that would have been a much more realistic test. You certainly shouldn't expect any variation between 1, 4 and 8, that's far too small.

However, even if the results are still fairly flat, don't extrapolate this to 'use as many textures as you like'. The benefits of a texture atlas are not the reduction in texture numbers, but in the increasing amount of batching you can do. Modern cards are heavily affected by batching performance, and you can't batch if your objects are using different texture sets.

Depth sorting is irrelevant when you're majoring on batching performance. You can only usefully depth sort if you break things down into discrete sortable chunks, which defeats batching, which is far more important. We give you high-level sorting facilities in the render queues for things like skies and other large blocks of rendering you would like to sort to avoid overdraw. Plus, we give you the ability to do a depth-first pass in materials (with colour_write off) which pretty much eliminates overdraw (for non-transparent objects) at the expense of extra passes.

There is never a cast-iron set of rules for optimisation, just guides. Exactly what blend of sorting, batching, LOD etc you need to get best performance depends massively on the scene, there is no silver bullet. That's why you must use tools like nvPerfKit to identify bottlenecks.

Varying some factor and finding it does not affect frame rate is not a solid basis for asserting it is not relevant to speed, because it may not make any difference until a bottleneck elsewhere is cleared. If your scene is fillrate limited, reducing or increasing the vertex count won't make any difference (within upper bounds), but that doesn't mean it's irrelevant. Eliminate the fillrate clog and suddenly it will make a difference. So extrapolate results with care, particularly those which seem not to alter. Linear they are not

Oogst · Post by **Oogst** » Wed Jul 05, 2006 2:45 pm

I had not checked it and thought 64 textures was a nice amount for a level. I checked back with the Blob though, which has more like a 'real-world' level-size and complexity and was surprised that we were doing 274 different textures during gameplay. So the test really should have been extended a lot to be interesting.

As for batching: many objects in games are dynamic. Batching is not of much use for dynamic objects, right? Or would you say that altering vertices is faster when it reduces the object count significantly? I recently heard from a tech-artist of a games-company in the Netherlands that render-calls really are not that exciting anymore on recent hardware, but I guess he is wrong there, right? With the Blob we were not exceding 400 render-calls on high-setting (which had a very far viewing distance), which I figured was a lot already.

If batching is only interesting on static objects, then am I correct to think that atlasses are much less interesting on dynamic objects? The article about them on Gamasutra showed character-textures in an atlas, but this seems less interesting then.

As for extrapolating: I know I cannot do that, the whole point of our research is that theoretical stuff is not that relevant on GPU's, so we wanted to see things more closely in practice. Stating things here very subtly does not bring up that much discussion and interesting reactions, though.

Post by **sinbad** » Wed Jul 05, 2006 8:38 pm

I think your friend needs to read this: http://download.nvidia.com/developer/pr ... zation.pdf

Like anything, its no big deal until you hit the ceiling of the particular limited resource you have, then it's a huge deal. 400 batches won't hit the ceiling on most machines unless they have under 1Ghz CPU (batches are eventually CPU limited since it's all about driver submission overhead, and the driver runs on the CPU). Again, don't assume, measure using tools like nvPerfKit.

Yes, dynamic objects are harder to batch, but not impossible by any means. Once again it's all a balancing act, beware of sweeping generalisations. You can deal with the dynamic elements by using hardware instancing (true or faked through shaders), shader use generally, or splitting dynamic / static objects up. Like anything, over-batching has problems of its own (culling efficiency, the munging required to batch in the first place) - all these factors need to be balanced to get best performance.

Post by **xavier** » Wed Jul 05, 2006 9:06 pm

Oogst wrote: As for batching: many objects in games are dynamic. Batching is not of much use for dynamic objects, right?

A "batch" is a single draw call (call to D3D's DrawIndexedPrimitive() or OpenGL's glDrawElements() functions, for example).

The API has no concept of "dynamic" or "static" geometry -- it only knows polygons (primarily triangles). The geometry that is "dynamic" to you is just one of a series of draw calls each frame to the API.

Oogst · Post by **Oogst** » Mon Jul 24, 2006 4:22 pm

A bit late, but: thanks all for the help! This is the final result we handed in (we have not yet received a grade for it):

Real-time scene complexity - An empirical study of scene complexity versus performance in real-time computer graphics
(English, DOC, 0.5mb)

And the abstract:

Abstract
By performing a large number of tests on different computers the influence of the complexity of the graphics in a game on the frame rate is analyzed. The conclusion is that artists and programmers should keep these guidelines in mind while creating a game:
- keep the polygon-count low;
- keep the vertex-count low;
- keep the number of separate objects/render-calls low;
- mip-mapping dramatically increases the performance of high-resolution textures;
- as long as mip-mapping is used and all textures fit in memory, the texture resolution hardly influences the performance;
- bilinear filtering is almost for free, but trilinear filtering has a significant cost;
- multi-passing is quite expensive and should be used as little as possible;
- multi-texturing is much less expensive than multi-passing;
- alpha-blending is very expensive if there is a lot of object-overlap;
- environment mapping using a cube-map or spherical map is hardly more expensive than a standard 2D UV-mapped texture;
- vertex and pixel shaders are not expensive, only the complexity of a specific shader might make it expensive.

raicuandi · Post by **raicuandi** » Mon Jul 24, 2006 4:38 pm

Thanks for the tips man!
If its speed you want, you might consider writing the Direct3D10 render system for Ogre. It is said that it will really speed up Ogre.
(I *know* I wanted to tell you something more, but I just can't remember it now... uh, so tired...

)

Ogre Forums

Performance research using Ogre

Performance research using Ogre

Re: Performance research using Ogre