[2.1] Render 10 000 objects per frame - Number of tris per object?

Discussion area about developing with Ogre-Next (2.1, 2.2 and beyond)


Post Reply
123iamking
Gremlin
Posts: 152
Joined: Sat Aug 12, 2017 4:16 pm
x 4

[2.1] Render 10 000 objects per frame - Number of tris per object?

Post by 123iamking »

According to What version to choose?
Version: 2.1
Why choose it?: You need many objects (order of 10 000) per frame.
To create a good looking human model, the number of tris can reach to ~400.000 tris.
Image
So that number of tris is way too higher than the number of tris of a low poly model.
So I want to render about 10 000 objects per frame, like Ogre 2.1 offer. How many is the target number of tris for each model?

Thanks for reading :)
al2950
OGRE Expert User
OGRE Expert User
Posts: 1227
Joined: Thu Dec 11, 2008 7:56 pm
Location: Bristol, UK
x 157

Re: [2.1] Render 10 000 objects per frame - Number of tris per object?

Post by al2950 »

123iamking wrote: Fri Dec 01, 2017 3:27 am To create a good looking human model, the number of tris can reach to ~400.000 tris.
Really.....!?
123iamking wrote: Fri Dec 01, 2017 3:27 am So that number of tris is way too higher than the number of tris of a low poly model.
So I want to render about 10 000 objects per frame, like Ogre 2.1 offer. How many is the target number of tris for each model?
In short GPUs are way more complicated than 'GPU x can render 1,000,000 tris per second'. In fact there are so many variables its basically pointless thinking about tris. So there is no sensible answer to your question. However....

If you have high poly meshes you will have a high vertex count. So this might put pressure on your vertex shaders, which will be made worse if you are using skeleton animations. This can be alleviated with decent mesh LODs which Ogre does support. You may also want to look at various tools, like Tootle, which optimize meshes so they are laid out in memory in a more friendly way for GPUs

The other thing you might want to worry about is pixel overdraw, given you have many many objects. Again Tootle can help with this on a per mesh bases. But keep an eye out the pressure on your pixel shaders, again there are so many variables, like what BRDF you are using, how many lights, transparent objects and if you use a pre z buffer pass.

To be blunt, I have barely touched on the info that may have an impact on your use case, so I suggest start trying stuff. If you hit a bottle neck then post in the forums and we can try and advise.

One thing I high not mentioned is the rasterization part of the pipeline because I don't actually know too much about it (Magic black box!). Perhaps dark_sylinc might have some more pointers as he knows way more than me!
Hrenli
Halfling
Posts: 73
Joined: Tue Jun 14, 2016 12:26 pm
x 19

Re: [2.1] Render 10 000 objects per frame - Number of tris per object?

Post by Hrenli »

I just want to add that "good looking" is very subjective and has next to no relation to polygon count. "Fast enough" is also not set in stone. Some people (or even whole platforms) are happy with 30 fps, some are calling anything bellow 100 fps unplayable. Add to this variety in hardware specs and it makes no sense to talk numbers of polygons or objects in the scene. It all depends on a lot of stuff. You have to try and experiment doing some POCs if you have concerns about viability of your idea. And if we talk characters, I would be more worried about animations than polygon counts. A wooden dummy made of 500k polygon is still a wooden dummy. If you want to bring character to life, you need to animate it and here come things like the amount of bones, complexity of morph shapes, how you animate clothes and hair, all that stuff. And it has to be done good. Polygon count? Irrelevant. It just has to be possible to run on your target hardware, that's it.

And yes, there are AAA titles which already use more than 400k tris for characters (first thing that comes to mind is Horizon Zero Dawn). But having so detailed characters doesn't mean that either a lot of them are in the scene at the same time or even that all of the details of one model are used in the same scene.

BTW, character on your screenshot would look the same from that angle with 10k triangles and even less. Most of modern characters are sculpted in high poly originally, but it's not the the models which are used in runtime.
123iamking
Gremlin
Posts: 152
Joined: Sat Aug 12, 2017 4:16 pm
x 4

Re: [2.1] Render 10 000 objects per frame - Number of tris per object?

Post by 123iamking »

al2950 wrote: Fri Dec 01, 2017 10:33 am
123iamking wrote: Fri Dec 01, 2017 3:27 am To create a good looking human model, the number of tris can reach to ~400.000 tris.
Really.....!?
mmm... Maybe that's the worst situation. If optimize, maybe below 100.000 tris (I took a look at this article)
al2950 wrote: Fri Dec 01, 2017 10:33 am In short GPUs are way more complicated than 'GPU x can render 1,000,000 tris per second'. In fact there are so many variables its basically pointless thinking about tris. So there is no sensible answer to your question. However....
I know there is no exact answer, but I just want to have a number of tris range to keep in mind when I create the mesh model. I want to get closer to have 10.000 NPCs in my scene (hope)
al2950 wrote: Fri Dec 01, 2017 10:33 am If you have high poly meshes you will have a high vertex count. So this might put pressure on your vertex shaders, which will be made worse if you are using skeleton animations. This can be alleviated with decent mesh LODs which Ogre does support. You may also want to look at various tools, like Tootle, which optimize meshes so they are laid out in memory in a more friendly way for GPUs

The other thing you might want to worry about is pixel overdraw, given you have many many objects. Again Tootle can help with this on a per mesh bases. But keep an eye out the pressure on your pixel shaders, again there are so many variables, like what BRDF you are using, how many lights, transparent objects and if you use a pre z buffer pass.

To be blunt, I have barely touched on the info that may have an impact on your use case, so I suggest start trying stuff. If you hit a bottle neck then post in the forums and we can try and advise.

One thing I high not mentioned is the rasterization part of the pipeline because I don't actually know too much about it (Magic black box!). Perhaps dark_sylinc might have some more pointers as he knows way more than me!
Thanks for the tool recommendation :)
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5296
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1278
Contact:

Re: [2.1] Render 10 000 objects per frame - Number of tris per object?

Post by dark_sylinc »

I see my name, I'm being summoned!

<Casts dark_sylinc spell>

I am writing a book, so I'll just copy paste the relevant section:
Bottlenecks of today and tomorrow
10 years ago if someone asked me what are the usual rendering bottlenecks I would've asked no doubt drawcall overhead and render state changes. A lot of time was spent calling the functions, preparing the stack, validating commands, managing hardware specific commands inside the drivers, etc.

If you follow more or less the guidelines written here, you'll notice these problems are no longer a concern. The CPU will take far less than what the GPU spends rendering.
Nonetheless you may notice your performance is still directly tied to the number of objects submitted, regardless of the number of vertices or triangles on screen, textures, etc.
This would suggest drawcall overhead, yet profiling tools say the CPU is idle. What's going on?

Minimum vertices per draw
Modern GPUs process many vertices in parallel, and that often means there's a minimum number of vertices per draw. Having less than will result in GPU doing idle/wasted work. In other words, rendering 60 objects of 60 vertices each takes more or less the same as rendering 60 objects of 240 vertices each, despite each object having four times as many vertices.

Just to clarify, this is not about number of drawcalls. You could fire one drawcall with instancing value set to 10.000 with 1 triangle per instance, and you will still hit this problem. The only solution is to set triangle count to 10.000 and instance count to 1.
Certain drivers are able to optimize your instancing (i.e. turn tri_count = 1 instance_count = 10.000 into tri_count = 10.000 instance_count = 1) if several conditions are met that guarantee the results will be the same and that the hardware is capable of doing it. But don't count on this happening automatically for you for all but the simplest cases.

Christophe Riccio performed a few tests in 2014 and he concluded a safe minimum sweetspot is around 256 triangles per draw. That's triangles.

Each triangle has 3 vertices. So the minimum is around 768 vertices per draw. That's a lot of vertices!

This has a lot of different implications, but the most obvious one is that object LOD gives you very little benefit once you're below that threshold.
Another less obvious implication is that if you're planning on building a procedural house where each wall is its own draw of 8 vertices, your performance won't be good.

Game engines are taking more aggressive approaches to address this problem.
For example Just Cause 3 takes several buildings in the same block and merges them together. This way 6 buildings become one draw, with enough number of vertices to fully utilize the hardware. And because all the objects are in the same block, frustum vs aabb culling still works reasonably well.

Merge Instancing
But this now introduced a new problem: memory. Just Cause 3, like many games, reuses assets a lot. And I mean A LOT. Same building geometry with different texture or randomized colour, or with different window decorations, or doors placed somewhere else, or with crates placed to hide the repetition.
One reason for doing this is reducing art costs and meeting deadlines. But another reason is that this keeps VRAM consumption low since the underlying vertex geometry can be shared. This is no longer possible when you merge the buildings into one bigger vertex buffer, unless we reuse the whole block (good luck on hiding the repetition there!)... or can't we?

That's where “merge instancing” comes to the rescue! Which is by the way what Just Cause 3 uses.
Under the hood merge instancing is just one application of manual vertex pulling.
Manual vertex pulling consists in manually fetching the vertices from the index and vertex buffers within the vertex shader using SV_VertexID/gl_VertexID instead of relying on input assembly definition doing it for you.
TBD write small sample
Merge instancing uses this approach to not only fetch the vertices, but also to select at runtime, what submeshes to render by using offsets and having all meshes laid out contiguously in memory.

Note: Ogre 2.1 does not implement merge instancing (yet)

Triangles too small
Another modern performance problem is triangles being too small. By nature pixel shaders need to be executed on at least blocks of 2x2. This isn't a problem if the triangle covered all 4 pixels. But if it is only one pixel, the GPU must run three additional pixels, often called 'helper invocations’. These helpers are needed in order to properly calculate derivatives, which are needed for proper mipmapping LOD calculation, trilinear and anisotropic filtering.
So when a triangle covers only one pixel, your pixel shader runs for four pixels. That's a lot of waste!
But what if the triangle is so small it doesn't cover any pixel? Well… modern GPUs still use a fixed function chip, the rasterizer, which happens right before the pixel shader. If you've got too many triangles that aren't actually visible the rasterizer may become the bottleneck!
This issue is so severe AMD created a compute shader solution called 'GeometryFX’ that removes invisible triangles every frame. The solution relies on the fact that there's a lot more shader cores that can perform the job faster than what the rasterizer can do. The rasterizer can't just that automatically for you because it performs other tasks that aren't very shader/parallel friendly. And besides GeometryFX requires storing results somewhere, and it is only a performance win when you've got lots of tiny triangles, something the rasterizer can't know beforehand.

Triangles being too small can also be fought with LODs. But beware too aggressive LODding can result in fewer vertices than the ideal per draw.

Inefficient wave utilization
The previous problems we've covered are actually subsets of this one. It just happens that those problems were very specific, well understood, and have known solutions.
GPUs are massively parallel machines. This means that maximum efficiency is achieved when all shader cores are utilized (i.e. not idle waiting) and all of them are producing useful work that won't be thrown or discarded. Amdahl’s Law points out how brutal an efficiency issue can cost you, as 99% of parallel work means you'll never exceed 20x scalability no matter how many cores you throw at it

Few vertices per batch? That's an inefficient wave utilization problem
Triangles covering 1 pixel consume 4 pixels worth of work? That's an inefficient wave utilization problem

This problem is so general that there is not a single advice to give. AMD is pushing async compute in Vulkan and D3D12 as means to fully utilize idle resources. Inefficient wave utilization can happen because:
  • Shader waits for data (e.g. texture reads). The GPU couldn't hide all the latency, so it stalls. A common solution is to improve wave occupancy (shader code should use less registers) or pack data more tightly or using smaller formats (e.g. using 16 bit halfs instead of floats)
  • L1 & L2 frequent cache misses. This can happen if your texture reads are too sparse (e.g. a common problem with SSAO) or your wave occupancy is too high (because the GPU keeps jumping between shaders, thrashing the caches)
  • You're issuing independent work separately between other work that force dependencies, and could be packed together.
There are no golden tools unfortunately, unless you're a console developer.
Additional to that, you can read my older post Vertex count vs Poly count.
TL;DR: the triangle count is mostly meaningless (has little value). It's more important to know the in-engine vertex count (after it has been exported to a game engine, which may have higher vertex count than what Maya reports)

OK, now after all that heavy reading, the TL;DR answer is:
  1. GPU performance is complex. Don't ask us. Just try it. Take a few generic humans that resemble your target and spawn them like crazy. Basically, make a very quick mockup that resembles the payload the real game will have to handle. It doesn't matter that it looks ugly.
  2. Skeleton animation is expensive. So that will be an issue. Our v2 skeleton is highly optimized, but that doesn't mean it won't be a problem.
  3. Most games focus a lot of detail on what you will see most, and then give shitty vertex counts to everything else. For example Warcraft III in 2002 fooled your brain into thinking everything was high quality because the avatar of the selected character looked very high poly, while the actual unit (that you often see in the distance) had less detail than a lego model.
Cheers
Matias
al2950
OGRE Expert User
OGRE Expert User
Posts: 1227
Joined: Thu Dec 11, 2008 7:56 pm
Location: Bristol, UK
x 157

Re: [2.1] Render 10 000 objects per frame - Number of tris per object?

Post by al2950 »

dark_sylinc wrote: Fri Dec 01, 2017 5:18 pm I see my name, I'm being summoned!

<Casts dark_sylinc spell>
Yay, I assume that there are limited number of times that spell can be used :D
dark_sylinc wrote: Fri Dec 01, 2017 5:18 pm I am writing a book, so I'll just copy paste the relevant section:
Interesting, although it means yet another book I will have to buy and read! :lol:
123iamking
Gremlin
Posts: 152
Joined: Sat Aug 12, 2017 4:16 pm
x 4

Re: [2.1] Render 10 000 objects per frame - Number of tris per object?

Post by 123iamking »

al2950 wrote: Fri Dec 01, 2017 10:33 am If you have high poly meshes you will have a high vertex count. So this might put pressure on your vertex shaders, which will be made worse if you are using skeleton animations.
dark_sylinc wrote: Fri Dec 01, 2017 5:18 pm Skeleton animation is expensive. So that will be an issue. Our v2 skeleton is highly optimized, but that doesn't mean it won't be a problem.
This is new to me. But if I understand correctly, doesn't skeleton animation is the only way to animate character in Ogre? So does that mean animate character in Ogre is expensive.
I want to animate cloth instead of using physic engine to simulate cloth, like this video:
Animating cloth physics in Maya
Or making breeze effect with Maya
Normally, I think animate cloth statically (instead of simulate it in run time) will be more effective. But I have my doubt now. Please confirm :)
Post Reply