Concerns about performances (CPU)

Problems building or running the engine, queries about how to use features etc.
Post Reply
Crashy
Google Summer of Code Student
Google Summer of Code Student
Posts: 996
Joined: Wed Jan 08, 2003 9:15 pm
Location: Lyon, France
x 20
Contact:

Concerns about performances (CPU)

Post by Crashy » Thu Sep 05, 2019 3:11 pm

Ogre Version: 2.0
Operating System: Windows 7
Render System: Gl3+
Hardware: Core i7 6700, GTX 1070


Hi,

After doing a bunch of optimizations on my shaders, I've been able to drastically reduce the GPU usage on my game(~6ms according to nSight)
However, Root::renderOneFrame still takes ~10ms, which I think is quite high.


Note: to have stable timings while profiling, I temporarily moved most of Worker thread tasks to the main thread, which doesn't reduce overall performances for my "simple" scene, and even improves them a little bit

Here is a picture from Optick showing one frame:

Image

_updateAllRenderTargets takes 7.556ms in total here.

If we only look at the part I called "main scene" in the picture:
  • 99 draw calls
  • RenderPhase02: 1.938ms
  • _renderVisibleObjects: 1.155ms .Doing more precise profiling tells me RenderSystem::_setPass takes around 30% of the time, rest is SceneManager::renderSingleObject
  • If I comment the ultimates glDraw**** calls in RenderSystem::_render, I save 0.5ms.
Basically, I find it problematic to spend more than 1ms on CPU only for 99 draw calls.


GL State Cache Manager is enabled, disabling it slightly reduces the performances, nothing magical here.

Compositors seems to take more CPU time than I initially expected too.

Am I doing something wrong somewhere or does these timings looks normal to you ?
1 x
Follow la Moustache on Twitter or on Facebook
Image

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 4138
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 267
Contact:

Re: Concerns about performances (CPU)

Post by dark_sylinc » Thu Sep 05, 2019 3:54 pm

Hi!

The parts you're highlighting were all the portions which were the focus of Ogre 2.1 to optimize. Like you pointed out it was outrageous the amount of overhead needed to even draw just one element.

The OpenGL RenderSystem was particularly bad at performance before 2.1 because of all the stalls, inefficient communication with the shaders, and massive amount of pointless state changes. D3D9 used to perform way better than GL though.

Heavy use of InstanceManager was a way to workaround some of the issues because it would bypass the slow path. But it was clunky and the performance still was far from ideal.

Regarding Compositors, it shouldn't appear too much in the graph and that looks odd, but it could be due to the stalls in OpenGL draw calls, or because of the RenderTarget changes if there's a lot of postprocessing going on.

But you're including GUI rendering alongside compositor, and GUI libs often take a lot of time. Sadly it requires a lot of experience and dedication to write a GUI library that doesn't hog performance.

Thus given the circumstances, "it sounds about right" for Ogre 2.0 (and 1.x). I'm afraid to improve this numbers my recommendation would be "start looking into Ogre 2.1"
Btw I suspect you probably use a lot of custom shaders (which we call them low level materials), we do now support using low level materials on regular objects for rendering (which we didn't when you moved from 1.9 to 2.0). Still, adapting your shaders for Hlms will allow you to get maximum performance, but being able to use low level materials with regular objects could ease porting for you.
A few features are not available though, such as FFP (all low level materials must use shaders) and multipass.

Cheers
Matias
1 x

Crashy
Google Summer of Code Student
Google Summer of Code Student
Posts: 996
Joined: Wed Jan 08, 2003 9:15 pm
Location: Lyon, France
x 20
Contact:

Re: Concerns about performances (CPU)

Post by Crashy » Thu Sep 05, 2019 4:38 pm

Thanks for you reply.
The parts you're highlighting were all the portions which were the focus of Ogre 2.1 to optimize. Like you pointed out it was outrageous the amount of overhead needed to even draw just one element.
I must admit that it's what I feared the most.
The OpenGL RenderSystem was particularly bad at performance before 2.1 because of all the stalls, inefficient communication with the shaders, and massive amount of pointless state changes. D3D9 used to perform way better than GL though.
Back in 1.9+, GL was slower than Dx9 but faster than DX11, that's why I picked Gl3+ (and to ease linux support of course).
Heavy use of InstanceManager was a way to workaround some of the issues because it would bypass the slow path. But it was clunky and the performance still was far from ideal.
Instancing did helped a lot indeed !
But you're including GUI rendering alongside compositor, and GUI libs often take a lot of time. Sadly it requires a lot of experience and dedication to write a GUI library that doesn't hog performance.
Rght now I only use Dear imGui, which looks very efficient.
Regarding Compositors, it shouldn't appear too much in the graph and that looks odd, but it could be due to the stalls in OpenGL draw calls, or because of the RenderTarget changes if there's a lot of postprocessing going on.
;
Yes,'ve got a few render targets that I use in a ping-pong style, I know gl can be slow when switching between RT's
Otherwise, what could call the stalls ?
Thus given the circumstances, "it sounds about right" for Ogre 2.0 (and 1.x). I'm afraid to improve this numbers my recommendation would be "start looking into Ogre 2.1"
Btw I suspect you probably use a lot of custom shaders (which we call them low level materials), we do now support using low level materials on regular objects for rendering (which we didn't when you moved from 1.9 to 2.0). Still, adapting your shaders for Hlms will allow you to get maximum performance, but being able to use low level materials with regular objects could ease porting for you.
A few features are not available though, such as FFP (all low level materials must use shaders) and multipass.
I can't afford right now to switch to 2.1, unfortunately I switched too soon to 2.x and 2.1 came just as I was working on the game engine, too much work to switch for me. In a close future if everything goes as expected for me, maybe, especially if there is now support for custom shaders :)

I think it'll be faster for me to write some "quick path" to setup draw calls, but what's astonishing is that content of _setPass (and all the underlying rendersystem calls) isn't that complicated, and with state cache manager many redundent gl calls are bypassed, but I suspect it to be quite inefficient in the way it manages the cache.


I've also merged some recent changes about VAO that were made to the Gl3+ RS in the 1.1x branch which improved perfs a little bit.

I guess I'm going to re-run some nSight analysis to see what calls are made and why.
1 x
Follow la Moustache on Twitter or on Facebook
Image

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 4138
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 267
Contact:

Re: Concerns about performances (CPU)

Post by dark_sylinc » Thu Sep 05, 2019 4:52 pm

Crashy wrote:
Thu Sep 05, 2019 4:38 pm
Otherwise, what could call the stalls ?
Usually the stalls were because of glMap* calls.

Ogre relied on GL_MAP_INVALIDATE_BUFFER_BIT to use discard semantics like in D3D9 and 11, but how GL drivers manage the invalidate bit is very sketchy and often end up stalling.

In Ogre 2.1 we switched almost everything to unsynchronized or persistent mapping, which relies on us doing all the synch work (instead of relying on the driver).
Crashy wrote:
Thu Sep 05, 2019 4:38 pm
I guess I'm going to re-run some nSight analysis to see what calls are made and why.
Good luck! I hope you get the performance you need.

Cheers
Matias
1 x

Crashy
Google Summer of Code Student
Google Summer of Code Student
Posts: 996
Joined: Wed Jan 08, 2003 9:15 pm
Location: Lyon, France
x 20
Contact:

Re: Concerns about performances (CPU)

Post by Crashy » Thu Sep 05, 2019 5:08 pm

Right, thank you.

Some relevant infos from nsight for the record:

A view of a few draw calls.
Image
State cache manager does it's job and there is only a few state changes between each glDraw*, but there is also a lot of time "wasted" between each call. Not that much time in absolute, but repeat this hundred of times...
Common calls are glActiveTexture, glBindTexture, glBindProgramPipeline and glBindVertexArray.

And for compositors:
Image
glBindBuffer is taking a lot of time but I think I won't be able to do anything about it, and there is also time spent doing other things. Hopefully the time for compositors is constant so if I can just render more objects faster this won't be such a big deal.
1 x
Follow la Moustache on Twitter or on Facebook
Image

Crashy
Google Summer of Code Student
Google Summer of Code Student
Posts: 996
Joined: Wed Jan 08, 2003 9:15 pm
Location: Lyon, France
x 20
Contact:

Re: Concerns about performances (CPU)

Post by Crashy » Fri Sep 06, 2019 6:27 pm

Some more insights following my various profiling session and tests.

Image

glBindFrameBuffer take the most time, I don't know how to truly solves this. I've tried not to attach any depth buffer when unnecessary, without any improvement. Only thing I could see now is to merge multiple targets into one single large "atlas" and use smaller viewports to render to the right area.
This is sometimes suggested, but I know that on some low-end gpus rendering to a large texture drastically reduces the fillrate :?
Anyway, I could save a few FB changes when calculating HDR luminance tex, bloom and some other effects that are using low res RTs.

Then comes glTexSubImage2D. The reason it'scalled so often is HW VTF instancing. I've tried using PBO instead of basic glTexImage, as everybody recommends it but still, no gain, even when using double buffering. I guess new drivers are doing the same thing under the hood.

One idea I've got is to merge multiple VTF into a single larger texture whenever possible, thus uploading it only one time to the GPU, but larger textures also take more time to upload.


After that the various glDraw commands. No idea here.

Finaly the glBindTexture calls. Some of them are caused by the HW VTF updates too. Could be reduced for other materials using a texture array, but this requires a lot of work to setup dynamically.

That's all for today, I'm stuck but I've learnt a lot of thing about modern OpenGl today :lol:
1 x
Follow la Moustache on Twitter or on Facebook
Image

Crashy
Google Summer of Code Student
Google Summer of Code Student
Posts: 996
Joined: Wed Jan 08, 2003 9:15 pm
Location: Lyon, France
x 20
Contact:

Re: Concerns about performances (CPU)

Post by Crashy » Sat Sep 07, 2019 12:05 pm

More metrics, from a sample from this https://github.com/g-truc/ogl-samples

I've modified the sample gl-320-texture-2d to do 1000 draw calls per frame, and watched the results.
Image
Image

On average all api calls are taking the same time as in Ogre, but the overall performances for 1000 draw calls is better. Without nSight attached, a single frame takes 2 ms, vs 1.1 ms for only 99 draw calls in my case in Ogre.

Finally, I guess I'll need to do a fast pass setup function to avoid spending too much time between each api call.
1 x
Follow la Moustache on Twitter or on Facebook
Image

Crashy
Google Summer of Code Student
Google Summer of Code Student
Posts: 996
Joined: Wed Jan 08, 2003 9:15 pm
Location: Lyon, France
x 20
Contact:

Re: Concerns about performances (CPU)

Post by Crashy » Tue Sep 10, 2019 1:33 pm

Hi Guys it's me again :)

Some news.

I've gained a little bit by disabling RTSS. I was almost not using it and it had some callbacks that may be costly, in my old implementation at least.
I've also disabled sorting for a few transparent materials, avoiding many calls of "_setPass"
Reducing VTF batch size to 100 entities (don't really need more for now) also helped a lot.
Finally I added a kind of "fast pass pipeline" to do every setting in one single func avoiding going back and forth in the stack. Gains were barely noticeable, but still, about 0.1ms was saved.
In total, I've gained about 1.1ms, which isn't that bad.

My future trials will be to replace some of the post processes by compute shaders to avoid switching render targets too often. A +++ feature would be to implement truly shared Uniform Blocks to send various matrices or shared params without doing it for each renderable.

Another interesting thing: Until now I was only using VTF instancing, but as I have some unanimated objects I decided to use basic HW instancing for them, hoping some gains.
I was wrong, Basic Instancing was a little bit slower than VTF.

I did some profiling to see what's wrong:

Image
(This test scene is much simpler than the previous one, hence the lower timings.)

Three cases:
1-Hardware basic: 4.7 ms/frame
2-Hardware VTF : 4.2ms/frame
3-Hardware VTF+ glFlush : 4.8ms/frame

Why this test with glFlush ? Well as one can see, the final draw call in the case of VTF is taking a looong time, 0.8ms.
However other draw calls or even glBindFrameBuffer were much faster, resulting in a faster frame.

I suspected something related to command queue was happening, so I added a glFlush call before glBindFrameBuffer, and boom, everything was much more comparable to the results using HW Basic instancing.
These timings are also comparable to the ones I had before in the previous more complex scene.

What the drivers toes with the command queue is a bit of a mystery for me



EDIT:: I've done a quick test implementation of vertex buffers with GL_MAP_PERSISTENT_BIT.
Now my HW Basic test scene runs between 3.8ms and 4.0ms/frame. :D
Increasing num instances per batch up to 400 instead of 100 improved everything too, whereas without the persistent bit increasing batch size reduces performances. :D
1 x
Follow la Moustache on Twitter or on Facebook
Image

Post Reply