Bottlenecks of today and tomorrow
10 years ago if someone asked me what are the usual rendering bottlenecks I would've asked no doubt drawcall overhead and render state changes. A lot of time was spent calling the functions, preparing the stack, validating commands, managing hardware specific commands inside the drivers, etc.
If you follow more or less the guidelines written here, you'll notice these problems are no longer a concern. The CPU will take far less than what the GPU spends rendering.
Nonetheless you may notice your performance is still directly tied to the number of objects submitted, regardless of the number of vertices or triangles on screen, textures, etc.
This would suggest drawcall overhead, yet profiling tools say the CPU is idle. What's going on?
Minimum vertices per draw
Modern GPUs process many vertices in parallel, and that often means there's a minimum number of vertices per draw. Having less than will result in GPU doing idle/wasted work. In other words, rendering 60 objects of 60 vertices each takes more or less the same as rendering 60 objects of 240 vertices each, despite each object having four times as many vertices.
Just to clarify, this is not about number of drawcalls. You could fire one drawcall with instancing value set to 10.000 with 1 triangle per instance, and you will still hit this problem. The only solution is to set triangle count to 10.000 and instance count to 1.
Certain drivers are able to optimize your instancing (i.e. turn tri_count = 1 instance_count = 10.000 into tri_count = 10.000 instance_count = 1) if several conditions are met that guarantee the results will be the same and that the hardware is capable of doing it. But don't count on this happening automatically for you for all but the simplest cases.
Christophe Riccio performed a few tests in 2014 and he concluded a safe minimum sweetspot is around 256 triangles per draw. That's triangles.
Each triangle has 3 vertices. So the minimum is around 768 vertices per draw. That's a lot of vertices!
This has a lot of different implications, but the most obvious one is that object LOD gives you very little benefit once you're below that threshold.
Another less obvious implication is that if you're planning on building a procedural house where each wall is its own draw of 8 vertices, your performance won't be good.
Game engines are taking more aggressive approaches to address this problem.
For example Just Cause 3 takes several buildings in the same block and merges them together. This way 6 buildings become one draw, with enough number of vertices to fully utilize the hardware. And because all the objects are in the same block, frustum vs aabb culling still works reasonably well.
Merge Instancing
But this now introduced a new problem: memory. Just Cause 3, like many games, reuses assets a lot. And I mean A LOT. Same building geometry with different texture or randomized colour, or with different window decorations, or doors placed somewhere else, or with crates placed to hide the repetition.
One reason for doing this is reducing art costs and meeting deadlines. But another reason is that this keeps VRAM consumption low since the underlying vertex geometry can be shared. This is no longer possible when you merge the buildings into one bigger vertex buffer, unless we reuse the whole block (good luck on hiding the repetition there!)... or can't we?
That's where “merge instancing” comes to the rescue! Which is by the way what Just Cause 3 uses.
Under the hood merge instancing is just one application of manual vertex pulling.
Manual vertex pulling consists in manually fetching the vertices from the index and vertex buffers within the vertex shader using SV_VertexID/gl_VertexID instead of relying on input assembly definition doing it for you.
TBD write small sample
Merge instancing uses this approach to not only fetch the vertices, but also to select at runtime, what submeshes to render by using offsets and having all meshes laid out contiguously in memory.
Note: Ogre 2.1 does not implement merge instancing (yet)
Triangles too small
Another modern performance problem is triangles being too small. By nature pixel shaders need to be executed on at least blocks of 2x2. This isn't a problem if the triangle covered all 4 pixels. But if it is only one pixel, the GPU must run three additional pixels, often called 'helper invocations’. These helpers are needed in order to properly calculate derivatives, which are needed for proper mipmapping LOD calculation, trilinear and anisotropic filtering.
So when a triangle covers only one pixel, your pixel shader runs for four pixels. That's a lot of waste!
But what if the triangle is so small it doesn't cover any pixel? Well… modern GPUs still use a fixed function chip, the rasterizer, which happens right before the pixel shader. If you've got too many triangles that aren't actually visible the rasterizer may become the bottleneck!
This issue is so severe AMD created a compute shader solution called '
GeometryFX’ that removes invisible triangles every frame. The solution relies on the fact that there's a lot more shader cores that can perform the job faster than what the rasterizer can do. The rasterizer can't just that automatically for you because it performs other tasks that aren't very shader/parallel friendly. And besides GeometryFX requires storing results somewhere, and it is only a performance win when you've got lots of tiny triangles, something the rasterizer can't know beforehand.
Triangles being too small can also be fought with LODs. But beware too aggressive LODding can result in fewer vertices than the ideal per draw.
Inefficient wave utilization
The previous problems we've covered are actually subsets of this one. It just happens that those problems were very specific, well understood, and have known solutions.
GPUs are massively parallel machines. This means that maximum efficiency is achieved when all shader cores are utilized (i.e. not idle waiting) and all of them are producing useful work that won't be thrown or discarded.
Amdahl’s Law points out how brutal an efficiency issue can cost you, as 99% of parallel work means you'll never exceed 20x scalability no matter how many cores you throw at it
Few vertices per batch? That's an inefficient wave utilization problem
Triangles covering 1 pixel consume 4 pixels worth of work? That's an inefficient wave utilization problem
This problem is so general that there is not a single advice to give. AMD is pushing async compute in Vulkan and D3D12 as means to fully utilize idle resources. Inefficient wave utilization can happen because:
- Shader waits for data (e.g. texture reads). The GPU couldn't hide all the latency, so it stalls. A common solution is to improve wave occupancy (shader code should use less registers) or pack data more tightly or using smaller formats (e.g. using 16 bit halfs instead of floats)
- L1 & L2 frequent cache misses. This can happen if your texture reads are too sparse (e.g. a common problem with SSAO) or your wave occupancy is too high (because the GPU keeps jumping between shaders, thrashing the caches)
- You're issuing independent work separately between other work that force dependencies, and could be packed together.
There are no golden tools unfortunately, unless you're a console developer.