Vertex & Index Buffer refactoring

Post by **dark_sylinc** » Mon May 26, 2014 3:23 am

So, I'm refactoring the OpenGL 3+ code so that it's no longer dead slow (and by extension, GLES 2, although to a lesser extent as some features aren't available).

I've hit a big giant wall, which are the VAOs.
First, a couple of background: VAOs are mandatory in GL3+, they force coupling of vertex buffers (VBO in GL jargon), index buffers (Element arrays in GL jargon) and vertex declaration (VertexDeclaration in Ogre jargon). Which is sad.

In Vertex Array Performance benchmark, we're shown that baked VAOs beat a single global VAO. Which is to be expected. The problem is that...
Like G-Truc says, if vertex declaration were to be decoupled from VBOs, we could bind the VAO's vertex declaration once, and use it for a thousand different meshes (we would only bind the VBOs and index buffers per mesh) like D3D does, and achieve a similar performance, but more flexibility. Sadly, bindless GPU vertex buffers aren't in Core (supplying wrong GPU address = system hang). I have several ideas on how this can easily fixed in OpenGL, but I'm not part of the board. We're stuck with VAOs.
In an Anteru's blog post, a user explains that they've adopted the VAO model, and forced D3D11 to mimic it's behavior. I like this idea.

Now, the Ogre problem is that building the VAOs require a lot of information. Namely, we need:

All that is inside VertexData (The VertexBuffer, VertexDeclaration, VertexBufferBinding)
The Index Buffer (IndexData)

All of this is in one place: RenderOperation. But there's a problem: RenderOperation is quite inconsistent and mutable across Ogre.
For Entities, the SubEntity grabs the RenderOp from the SubMesh, and modifies it on the fly depending on whether SW Skeletal animation, SW morph, or HW morph is being used. In SimpleRenderable, it owns the RenderOp. Same with InstancedBatch.
If the RenderOp is modified, the RenderSystem is not notified.

There's also another problem we need to tackle: Rendering multiple meshes from a single call. Multi Draw Indirect can do this, but it would need to pack all the data in a single vertex buffer. D3D12 should be getting something similar too, but I'm afraid I'm only speculating. I don't know how Mantle approaches it.

There is one way to solve that should work in all 3 though (GL, D3D12, Mantle... and should help a lot in D3D11!): Packing everything together.
For Ogre, it would see a VertexBuffer with 500 vertices, that starts at vertex 0 and finishes at vertex 499.
Under the hood, a big fat VBO is created (i.e. with capacity for 10000 vertices), and this particular mesh goes from vertex 2521 to vertex 3020. Same with the index buffer.
When 10000 vertices is reached, another VBO is created. VBOs would be created in categories, so that Ogre's STATIC_WRITE_ONLY buffers don't go to the same VBO pool as DYNAMIC buffers.
We would have different tweakable pools with different settings (i.e. default vertex count per batch, update flags, etc), just like I did with the HLMS Texture Manager.

To render each mesh, we just bind the one VAO that rules them all (probably there will be a couple VAOs), and supply specific vertexStart and indexStart parameters. In D3D11, GLES2 & GL3+, this would reduce API overhead significantly as the binding is made very infrequently; and the render calls are made per object.
In GL 4.4; with MultiDrawIndirect, one single draw call would suffix (API overhead is reduced even more)

This can work wonders, but works fundamentally different than how Ogre treats their buffers. In other words, it needs a big refactor in the following areas:

VertexData (all of it)
IndexData
RenderOperation
RenderSystem (so that it can control the changes to RenderOperation)
All the classes that deal with it

This could take a while, and longer than I've anticipated. And also the changes could be groundbreaking. But the result could be very sweet.

Any feedback?

holocronweaver · Post by **holocronweaver** » Mon May 26, 2014 5:06 am

First, I want to thank you for giving me the kick in the rear needed to finally start thinking about this again.

For MultiDrawIndirect my bible is a blog post by Graham Sellers and the associated AZDO slides. As you will see, using shader storage block arrays with individual buffers at each index or else packing VBOs are the only paths currently available for MultiDrawIndirect. There are some other limitations from this approach, all outlined in the blog post. I list the main ones here:

users will have to use uber-shaders
all vertex data needs to be available from the outset if you choose to pack the vertex buffers, or else you have to handle buffer mapping
vertex data formats must be the same within a draw, or else use shader storage buffers to store vertex data
uniform and shader storage blocks and texture arrays are a necessity for most per-entity constants storage
state changes cannot be made within a draw, though I believe HLSM will handle this by batching draws with the same parameters

I am in favor of the shader storage buffer array approach which allows us to have individual vertex buffers per subentity. I suspect this is roughly the approach GL5, D3D12, and Mantle will take. Is there an equivalent for D3D11?

I want to draw particular attention to the Generating Draws section in the above blog post. The options outlined are

Traverse a scene graph on the CPU in the traditional way and bucket draw calls rather than executing the. Once the tree is fully traversed, send the bucketed draws to the GPU using MultiDrawIndirect. This allows parallel scene graph threads without fear of driver sync issues.
Make the scene graph GPU accessible so a compute shader or transform feedback can traverse it and generate draw bucket lists.

I know the last is a bit extreme, but I was floored by the notion that it was currently doable. The first option does not require dramatic changes to the traditional scene graph traversal on the CPU. If I had my way, we would go with compute shader approach, but I am not sure how feasible this is in D3D11.

So the real question here is how well will this all translate into D3D11 so we can keep the rendering engine unified. So...please step up D3D peoples!

Post by **dark_sylinc** » Mon May 26, 2014 5:54 pm

holocronweaver wrote:As you will see, using shader storage block arrays with individual buffers at each index or else packing VBOs are the only paths currently available for MultiDrawIndirect. There are some other limitations from this approach, all outlined in the blog post. I list the main ones here:
(...)
I am in favor of the shader storage buffer array approach which allows us to have individual vertex buffers per subentity. I suspect this is roughly the approach GL5, D3D12, and Mantle will take. Is there an equivalent for D3D11?

I get what you mean but I haven't found a single example of the "shader storage block arrays with individual buffers at each index" method, so I cannot assess properly.
How is even the index buffer being dealt with?

As for the cons, you mention, the Hlms can deal with them with no problem.

holocronweaver · Post by **holocronweaver** » Mon May 26, 2014 8:30 pm

dark_sylinc wrote: I get what you mean but I haven't found a single example of the "shader storage block arrays with individual buffers at each index" method, so I cannot assess properly.
How is even the index buffer being dealt with?

No matter your approach, you have to index your per-draw data somehow. Current GL does NOT iterate per-draw data for you. Instead, the user must use a per-draw index in the shader. There are several methods of creating a per-draw index, but I prefer gl_DrawID since it is standardized, clear and should soon be fast across all vendors.

Here are C++ and GLSL snippets exampling per-draw indexed shader storage buffers:

Code: Select all

// Bind per-draw buffer data to shader storage block.  Assumes one buffer per entity, each packed with all pertinent data.
const uint blockIndexOffset = 0;
const uint bufferOffset = 0;
for (int index = 0; index < entities.size(); ++index)
{
    glBindBufferRange(GL_SHADER_STORAGE_BUFFER, blockIndexOffset + index, entities[index].getBuffer(), bufferOffset, entities[index].bufferSize);
}

Code: Select all

buffer DrawData
{
    mat4 MVP;
    vec3 vertex;
    vec2 texCoord;
    float myProperty;
} draw[];

out gl_PerVertex
{
	vec4 gl_Position;
};

out block
{
   vec2 texCoord;
} out;

void main()
{
    gl_Position =  draw[gl_DrawID].MVP * vec4(draw[gl_DrawID].vertex, 1);
    out.texCoord = draw[gl_DrawID].texCoord;
}

Note that the draw block has no defined array length. The length is determined by the bound buffer.

I can expand on this example if it is still not clear how this would work.

Post by **dark_sylinc** » Mon May 26, 2014 8:47 pm

I see that point, but I fail to see how the Index Buffer is being managed. As far as I know, the index buffer would have to be merged. Unless this is non-indexed geometry.

Perhaps a simple GL sample showing (no need for Ogre, pure GL) showing diferent meshes being drawn in the same call could convince me (I already like the idea, but I'm not seeing the whole picture).

lunkhound · Post by **lunkhound** » Wed May 28, 2014 8:37 pm

The idea of a major refactor like this to optimize for OpenGL seems like a bad idea, given the sorry state of OpenGL.

To quote from Rich's blog:

Most devs will take the easy path and port their PS4/Xbone rendering code to D3D12/Mantle. They will not bother to re-write their entire rendering pipeline to use super-aggressive batching, etc. like the GL community has been recently recommending to get perf up. GL will be treated like a second-class citizen and porting target until the API is modernized and greatly simplified.

It sounds like this is really a problem with OpenGL. If we embark on a huge refactor of OGRE's architecture to work around this one problem with OpenGL and then it gets fixed, then it will have been a waste.

Post by **dark_sylinc** » Wed May 28, 2014 11:15 pm

lunkhound wrote:If we embark on a huge refactor of OGRE's architecture to work around this one problem with OpenGL and then it gets fixed, then it will have been a waste.

This is something I fear.

However, there are three points:
1. I've been thinking around the problem. A huge refactor is suicide and would take too long (or let's just write another engine from scratch). So... let's keep doing what we've been doing in 2.0: code the new stuff, and slowly start phasing out the old stuff (i.e. OldBone vs Bone)
The idea revolves about creating a new structure that is very similar to RenderOperation, but designed around VAOs (let's call it RenderPackage) ; and create a new Entity (which uses these VAO structs, and hardware morph, hardware skeleton animation, etc), and the old entity becomes OldEntity.
When a RenderOperation is used, the global VAO method is used. When this RenderPackage is used, it just sets the VAO and goes.
This means evaluating whether to use RenderOperation or RenderPackage for every object that needs to rebind, in other words RenderPackage's performance won't be affected (since changing the binding rarely happens) and RenderOperation would be slower (it is already slow anyway)

This approach is much more incremental, and doesn't involve porting huge efforts. Plus, we may get to see the benefits sooner. Eventually RenderOperation would be erradicated, in the long term.

2. GLES2 is a huge market. It's the only way to render to Android and iOS. No Mantle, no D3D12 there. I was convinced that Ogre needs to focus on Mobile segment to survive. Let's admit having the title of "fastest engine in the mobile space" has a nice ring to it.
At the rate OpenGL moves forward, I wouldn't expect OpenGL ES 4 (or OpenGL 5) to include a solution to this problem.
Also, this method should help D3D11 too. D3D12 works on D3D11 capable hardware, but let's not forget you may want to stay at D3D11 if you want to support 10 & 10.1-level hardware.

3. This technique (aggressive batching) would still work in D3D12 & Mantle.

Post by **TheSHEEEP** » Wed May 28, 2014 11:49 pm

lunkhound wrote:If we embark on a huge refactor of OGRE's architecture to work around this one problem with OpenGL and then it gets fixed, then it will have been a waste.

I do not really see much of an alternative, though.
If we have to focus on VAOs to determine how rendering works in Ogre (and it seems like we have to), it can only be OpenGL.
DirectX does not and never will work on all platforms. Mantle may or may not take off, so modeling anything after it is just too risky. So only OpenGL remains.
If I got dark_sylinc right, it just so happens that the model that fits OpenGL will also work for DirectX, and even improve things!

We could say "aw, screw it, OpenGL remains slower than DirectX". But that would not really be wise, considering the mobile market, plus the growing non-Windows desktop market (mainly OSX, linux more slowly).

So... let's keep doing what we've been doing in 2.0: code the new stuff, and slowly start phasing out the old stuff (i.e. OldBone vs Bone)

Yep!

Post by **dark_sylinc** » Thu May 29, 2014 12:08 am

TheSHEEEP wrote:
lunkhound wrote:If we have to focus on VAOs to determine how rendering works in Ogre (and it seems like we have to), it can only be OpenGL.
DirectX does not and never will work on all platforms. Mantle may or may not take off, so modeling anything after it is just too risky. So only OpenGL remains.
If I got dark_sylinc right, it just so happens that the model that fits OpenGL will also work for DirectX, and even improve things!

Yep. This boils down to a typical management decision scenario where we have to take a risk and choose, because there is no clear "winner" (each one has a benefit, a risk, and an uncertainty factor):

Mantle - Benefit: It's superfast to the metal programing. Every programmer's dream. Could even replace OpenGL - Risk & Uncertainty: may not take off / it's vendor specific. - Uncertainty: No public API nor documentation available.
Direct3D12 - Benefit: It's to the metal (not as much as Mantle). Traditionally DX has better tools. - Risk: Growing market of non-MS platforms. Release is 1 and half year away from us. - Risk & Uncertainty: May only work on Windows 8.1 (or only Windows 9???) - Uncertainty: No public API nor documentation available (though this video is quite enlightening).
OpenGL - Benefit: It's here. Not as powerful as the other two but we know how to keep on the fast track for minimal API overhead. Work can be reused for D3D11 - Risk: Requires more work. OpenGL's evolution is rather slow and could stagnate compared to the other 2 - Uncertainty: work being done to adapt to OpenGL may become wasted if D3D12 or Mantle win by a big margin.

As someone who has studied management, the worst thing to do here is not making a decision. There is no right answer here. Just leveraging options, gambling and taking a risk.
I'm personally deciding for OpenGL. It's the only one we can work right now, can aim at most of the platforms we target (Mobile, Desktop, Windows XP/7, Linux & Mac) and its work helps D3D11 too. Time will judge if I'm wrong.

Even if I would consider OpenGL my enemy and DX12 and Mantle my allies, today I know far more about my enemies than my supposed allies.

lunkhound · Post by **lunkhound** » Thu May 29, 2014 7:57 pm

Yep, Mantle is not worth investing any effort in as long as it remains AMD only. D3D12 won't be very useful as long as it is locked to Windows 8 or later. Even D3D11.1 and 11.2 is not really worth bothering with at the moment for the same reason.
The only APIs worth bothering with for the immediate future are the various OpenGLs and D3D11.

Here's a question: If we switch OGRE over to doing this VAO thing, how confident are we that it will work across all OpenGL implementations that are of interest?

Another question: will this VAO stuff make OGRE's instancing stuff obsolete?

Post by **dark_sylinc** » Fri May 30, 2014 2:42 am

D3D 11.1 & .2 have minor (although important for some) features. I'm pretty sure Assaf will be maintaining them since he needs them. Not really a big deal.

lunkhound wrote:we switch OGRE over to doing this VAO thing, how confident are we that it will work across all OpenGL implementations that are of interest?

Very. Even on GL implementations without VAOs (i.e. GL ES 2.0 without extensions, Direct3D) we benefit from having one huge buffer (which equals infrequent binding & rebinding of buffers)

lunkhound wrote:will this VAO stuff make OGRE's instancing stuff obsolete?

Yes and no. If we're lucky, it would make it useless as instancing would be automatic (and for GL4 cards, even instancing with different meshes). No need to setup anything.
However the InstanceManager can make a few assumptions (i.e. knows beforehand all its objects have/don't have skeletons, etc; can avoid through hierarchy culling from submitting all of its instances if the whole batch is not visible, etc) and may have its niche usages.

Zonder · Post by **Zonder** » Fri May 30, 2014 7:45 am

After reading through the stuff so far I think it would be daft not to implement VAO.

holocronweaver · Post by **holocronweaver** » Fri May 30, 2014 2:45 pm

Just want to mention that I am finishing up a pure OpenGL sample which demonstrates my suggested approach for indexed draws (via MultiDrawElementsIndirect). Should be done Friday or Saturday.

gsellers · Post by **gsellers** » Fri May 30, 2014 7:37 pm

I applaud this effort.

Here are a few high level thoughts:

1) You don't need to support multiple APIs on the same target hardware. If it were me, I'd make a mobile renderer with OpenGL ES with limited capabilities and a desktop renderer with all the bells and whistles for OpenGL 4.3+ and whatever extensions you need. I'd actually drop DX11, and certainly drop DX versions prior. Implementing a DX12 or Mantle renderer as well is just going to be duplicated effort for limited gain on a reduced target platform set. In addition to the extra effort, maintaining support for multiple APIs will mean one of three things; a lowest-common-denominator feature set, an inconsistent feature set where some features are available on one API and not another, or one API significantly outperforming another as the second attempts to emulate the first.

2) You need to be aggressive. The stuff we've been advocating - batching, etc. - isn't a workaround for the API. It's really about how the hardware actually works. State changes really are quite expensive. That expense varies from vendor to vendor and what's more expensive might also be different between them, but messing with new APIs really only addresses the software cost of state updates, not the hardware cost. Traditional APIs which generally have a function call per state change encourage bad behavior as seen by the GPU. Wrapping blobs of state into state objects or pushing the work of building them onto other threads only addresses the CPU side of the problem. The GPU still eats the same work. In some cases, it will eat more - the big, monolithic state object approach is likely to push a lot of redundancy into the pipe because a large number of states will be the same between objects. Feeding the GPU efficiently is going to require engine-level architectural changes that might be hard to hide behind a rendering abstraction layer.

3) While batching is important and we've been talking a lot about uber-shaders, multi-draw indirect, and that stuff, there comes a point of diminishing returns. You don't have to be able to draw the whole scene in a single API call. There's going to be a cutoff point where the GPU can hide the state change overhead behind useful work. On recent AMD hardware, that's going to be the equivalent of around 50K vertices, 400K pixels or somesuch, unless you're bound by something else (ALU, texel fetch, etc.). If you can do at least that much work between state changes, then those state changes won't be the bottleneck.

Now, on the topic of VAO, vertex buffers, memory management and that stuff. Here are the key points:

1) Memory - in particular represented by buffers - is just memory. Think of a buffer object as a big memory allocation. You can put whatever you want in there. There is nothing to stop you from putting vertex data, indices, and even constants (via UBO) in the same buffer. Don't say "{vertex buffer} object", say "vertex {buffer object}"... the stress on the words can make all the difference.

2) There is nothing to stop you from pointing multiple VAOs at the same buffer object(s).

3) Zero copy is _really_ important. I can't stress that enough. A lot of people have been going on about asynchronous uploads, DMA queues and that sort of thing. Those are great ways to deal with copies. The one thing that's faster than the fastest copy in the world is not having to do that copy at all. That's why we put persistent maps into buffer storage. If you want to load data into a buffer object, put it into a persistent mapped buffer and then do CopyBufferSubData. Likewise, you can persistent map a PBO and do TexSubImage out of it. The driver will choose the best method of getting your data into place. If you can use the data in-place, then great!

Given this, what I would do is create a manager that deals with GPU memory allocations in pools using buffer objects. It would allocate large buffers (say 128MB at a time or so - maybe larger, maybe smaller) and when a request for GPU memory comes in, hand back an object that contains a reference to a buffer object and an offset. If you can't satisfy an allocation request out of a particular buffer, allocate a new one. Put all the data for a particular item in the engine into chunks allocated this way - very likely landing in the same buffer object.

For VAO, you can create one VAO _per vertex format_, all pointed at offset zero in the same buffer objects. You do not need to create a VAO per format/buffer permutation because there is only one buffer. The base vertex parameter to indexed draw calls can be used to start at the appropriate offset in the buffer. You only need to make sure that the offsets at which you store the vertex attribute data in the buffers are aligned correctly to the stride of the vertex attribute specified in the VAO. Likewise, you can store the index data for a mesh in the same buffer as its vertex data, using the <indices> parameter to a regular DrawElements* call or firstIndex in the *DrawElementsIndirect call.

Vertex fetch is rarely a bottleneck. It's almost certainly going to be a win if you can reduce the number of VAO switches even at the expense of fetching unused vertex data, for example. It'd be good to settle on a small number of vertex formats or layouts. Of course, that's an application issue, but in practice, it seems that most applications do this even if they end up using multiple VAOs with the same format (or don't use VAO at all).

Even if you can bake all your vertex data together, you're still left with two big ticket items: textures and shaders. For textures, bindless is your friend. Commit to that. There's discussion of DX12 and Mantle here - anything that can run those can do bindless. Now, the bindless handles end up in the UBO (which is baked together with the vertex and index data, right?), so that's effectively a solved problem if you can commit to that hardware level. Array textures also open up some possibilities. There's a couple of ways to deal with array textures:

In the first method, you could have pre-bake step in asset build that takes all the textures used in certain segments of a scene and packs them into array textures ahead of time. This is then loaded by the engine. The second idea is to combine with sparse texture, and have your texture manager dynamically allocate layers of texture on the fly. This is similar to the buffer object approach. When you need a texture, ask the texture manager to allocate it for you. It creates very large, un-populated array textures ahead of time and at run-time, selects a texture of the appropriate dimension, finds an unused array layer and makes it resident. It then hands back the texture object and the base layer to the rest of the engine. If you need textures of multiple formats, you can always create texture views to alias the data. If you have chunks of engine code that expect regular non-array textures for some reason, you can hand back a non-array view of the appropriate layer. You'd end up with one or more (array) textures for each size of texture. This doesn't even need to be bindless. You have 16 texture units. If you have 16 or fewer unique size/format permutations at one time, you can statically bind the arrays and be done.

The shader thing is more of an issue. Right now, we don't have an answer beyond uber-shader. It's not as bad as it seems though. For starters, branching is not nearly as bad as it once was. In particular, if branching is coherent based on the value of a constant (pulled from a UBO, of course), then the cost is negligible. You don't need one huge uber-shader, but a small palette of shaders, where each is effectively an number of similarly-complex shaders merged together. Again, an offline tool should be able to deal with this for you - take all the shaders, sort by relative complexity, rename entry points and merge, generate stub entry-point that branches to the correct variant.

I think you'd be surprised what can be achieved with very few shaders. In particular, this kind of approach is well suited to fully deferred, Forward+, and tiled-deferred architected rendering. In these kinds of renderers, the first pass is generally very similar for all objects. There may be changes in the front end (especially if you start getting into tessellation and that kind of stuff), but the back end shader generally fetches a few textures and fills some g-buffers. This is where the "many small draws" occur, and is ideal for MultiDrawIndirect. The lighting passes are also similar. Often, there's a shader permutation per screen-space region, or per light combination (or count), or something. These passes draw over large screen space areas and so it shouldn't be an issue to make a couple of state changes between.

Now, even for regular forward rendering, there can be big wins here. Consider a depth-only pre-pass. This doesn't generally even use the fragment shader. Push all the meshes together and draw the entire depth pre-pass with a single call to MultiDraw*Indirect. You can use that same indirect command buffer later without re-traversing the scene hierarchy, even if you don't draw it all at once. For each state bucket, keep a list of ranges of where those draws lie in the indirect commands and issue a bunch of indirect draws. Beware, though, that there's a startup cost to the indirect draws where the GPU has to start fetching the commands. If you can keep a system memory copy of the commands, you might want to send down regular draws when the indirect count would be less than some threshold.

Most of this boils down to sorting into buckets and then rendering in state order. It's the classic "batch, batch, batch", which still holds today. We're just trying to make it easier to do that (bindless, arrays, sparse, etc.), and when you _do_ do that, make it fast to render those batches (multi-draw, etc.). A lot of render states are also going to be order independent. The obvious stuff is blending off, depth test on, etc.. However, even with blending on, many states are commutative (ADD, SRC_ALPHA, ONE). The engine should be able to keep track of when those states are order independent. Besides, most of the time, you'd want to sort geometry and at least draw the transparent stuff back to front, after the opaque geometry, so presumably there's some kind of state sorting anyway. If you are doing OIT, then it's presumably order-independent anyway.

Post by **dark_sylinc** » Fri May 30, 2014 9:00 pm

Whoa! Thanks for taking the time for such a detailed explanation.

gsellers wrote:1) You don't need to support multiple APIs on the same target hardware. If it were me, I'd make a mobile renderer with OpenGL ES with limited capabilities and a desktop renderer with all the bells and whistles for OpenGL 4.3+ and whatever extensions you need. I'd actually drop DX11, and certainly drop DX versions prior.

Well, I'd agree, but AFAIK the only way to render to Windows Phone 8 is through DirectX. Also I know from Assaf (another Ogre dev) that DX11.1 has been so far the only API that allowed him to render VSync-tear-free stutter-free in a multi-monitor setup in Windows 8.1 (only OS that allowed this)
I'm not keen on the details, but his application is not the mass market (not games).

Wrapping blobs of state into state objects or pushing the work of building them onto other threads only addresses the CPU side of the problem. The GPU still eats the same work. In some cases, it will eat more - the big, monolithic state object approach is likely to push a lot of redundancy into the pipe because a large number of states will be the same between objects. Feeding the GPU efficiently is going to require engine-level architectural changes that might be hard to hide behind a rendering abstraction layer.

Thanks for describing this. It's really useful to understand a few things behind the hardware, and good to hear something else besides the typical DX talk as if PSOs (big blobs, as you called them) would solve everything. To be honest, I suspected the big blobs could involve a bigger memcpy inside the GPU. Your words confirmed it

2) There is nothing to stop you from pointing multiple VAOs at the same buffer object(s).

Yep, I'm writting the new code around that.

Given this, what I would do is create a manager that deals with GPU memory allocations in pools using buffer objects. It would allocate large buffers (say 128MB at a time or so - maybe larger, maybe smaller) and when a request for GPU memory comes in, hand back an object that contains a reference to a buffer object and an offset. If you can't satisfy an allocation request out of a particular buffer, allocate a new one. Put all the data for a particular item in the engine into chunks allocated this way - very likely landing in the same buffer object.

Yes, this is exactly what I had in mind. Preallocate a huge chunk, then start leasing it as requests start coming (requests need to be preprocessed so that we have enough information on how to handle it, that's on our end) Interestingly, I was thinking of the magic number 128MB too.

3) Zero copy is _really_ important. I can't stress that enough. A lot of people have been going on about asynchronous uploads, DMA queues and that sort of thing. Those are great ways to deal with copies. The one thing that's faster than the fastest copy in the world is not having to do that copy at all. That's why we put persistent maps into buffer storage. If you want to load data into a buffer object, put it into a persistent mapped buffer and then do CopyBufferSubData. Likewise, you can persistent map a PBO and do TexSubImage out of it. The driver will choose the best method of getting your data into place. If you can use the data in-place, then great!

Thanks a lot!!! Yesterday I was thinking about exactly this problem (async uploads and downloads to/from GPU). I was with a blank stare against the monitor on how to approach it. I knew how to do it in D3D11, but not in GL. While this article is great, I was disappointed that there are 3 different methods mentioned for downloading (28.4 28.5 & 28.6) and the last one seemed most promising but the AMD drivers which were tried are now rather old. It's good to hear this is now the recommended approach (combined with persistent buffers)

For VAO, you can create one VAO _per vertex format_, all pointed at offset zero in the same buffer objects. You do not need to create a VAO per format/buffer permutation because there is only one buffer. The base vertex parameter to indexed draw calls can be used to start at the appropriate offset in the buffer. You only need to make sure that the offsets at which you store the vertex attribute data in the buffers are aligned correctly to the stride of the vertex attribute specified in the VAO. Likewise, you can store the index data for a mesh in the same buffer as its vertex data, using the <indices> parameter to a regular DrawElements* call or firstIndex in the *DrawElementsIndirect call.

YOU BLEW MY MIND

I did not notice that the VAO was that flexible (actually it seemed rather very inflexible to me). Again, yesterday I was met with the problem that if Mesh A uses two vertex buffers (i.e. one for Position & Normals -> stride = 24, another for UVs stride = 8 ) and Mesh B uses one vertex buffer (stride = 32), we would need 3 pools of VBOs as the offsets would not match (I'm talking about vertexStart & vertexCount). But you made me realize I can still put them all in the same pool and by switching VAOs/declarations the offsets can now match, and I may just need to add some padding to make the start of the stride match.
Also the index buffer living in the VBO is another mind blower.

Array textures also open up some possibilities. There's a couple of ways to deal with array textures:
In the first method, you could have pre-bake step in asset build that takes all the textures used in certain segments of a scene and packs them into array textures ahead of time. This is then loaded by the engine. (...)

That is exactly what the HlmsTextureManager is currently doing (still in development). I'll be honest that the array section hasn't been tested yet. I like this approach because it scales to mobile very well: On Mobile we're using UV atlas and pile new textures next to each other, on Desktop we just use the next free slice. With this approach number of textures being rebound reduces dramatically.
Besides, I don't need to be a rocket science to realize bindless looks great and easy on paper but requires an extra level of indirection while arrays' memory location can be calculated.

The shader thing is more of an issue (...)

I am honestly not worrying about it, since the trend is to use "one shader to rule them all" thanks to PBS approaches and the ability of modern hardware to index almost anything in memory at runtime. Even with non-photorealistic rendering, artistic expressions rely on being consistent, so not much of a big deal (i.e. not expecting many different shaders).
Furthermore, the HLMS is handling of creating different shaders on the fly if it's required (pretty much what the drivers did 10 years ago).

Push all the meshes together and draw the entire depth pre-pass with a single call to MultiDraw*Indirect. You can use that same indirect command buffer later without re-traversing the scene hierarchy, even if you don't draw it all at once.

Not a mind blower, but when you say it out loud is quite shocking. Mind you, it's probably going to need two calls because alpha-tested objects are a common way to render some transparent objects with depth. But still I get the idea.
Also makes me wonder about Forward+. This could certainly reduce the disadvantage of the prepass.
I was planning on implementing an idea in my mind for a new Forward lighting similar to Forward+ that doesn't require the pre-pass; but I need to write the proof of concept first.

The engine should be able to keep track of when those states are order independent.

About that... something that is not entirely clear to me... MultiDrawIndirect, does it respect in the order in which the DrawArraysIndirectCommand array is sorted? or the GPU/driver may render in any random order?

Thanks a lot for this helpful post. It was very useful in sorting out a few implementation details that were stalling me.

gsellers · Post by **gsellers** » Fri May 30, 2014 10:12 pm

dark_sylinc wrote:
The engine should be able to keep track of when those states are order independent.
About that... something that is not entirely clear to me... MultiDrawIndirect, does it respect in the order in which the DrawArraysIndirectCommand array is sorted? or the GPU/driver may render in any random order?

It will render in the order that the commands appear in the buffer. It is exactly as if you had made the same sequence of draws in regular old GL. You can even sort the commands in the buffer using a compute shader or something to get a more efficient draw order.

dark_sylinc wrote:Thanks a lot for this helpful post. It was very useful in sorting out a few implementation details that were stalling me.

Sure, no problem. Let me know if you need anything else.

slime73 · Post by **slime73** » Fri May 30, 2014 10:41 pm

dark_sylinc wrote:Well, I'd agree, but AFAIK the only way to render to Windows Phone 8 is through DirectX.

Microsoft has developed a fork of ANGLE which works on WP8 / WinRT: https://github.com/MSOpenTech/angle

Post by **TheSHEEEP** » Sat May 31, 2014 9:42 pm

Reading this thread increases anyone's knowledge by +2.
That is truly awesome.

That MS ANGLE version is interesting. Is it used somewhere?

Post by **dark_sylinc** » Sat May 31, 2014 11:35 pm

gsellers wrote:Sure, no problem. Let me know if you need anything else.

Thanks! whether MultiDrawIndirect respected the order was really bugging me.

I've started writing the GL specific code for the buffer management, and I've got a couple of follow up questions:

1. It is normally advised to split certain vertex elements into multiple buffers. In our case, it's one VBO with multiple offsets in the VAO (unless you see a different approach).
For example detaching position from the rest. The reasoning behind it is that it should perform faster on shadow mapping passes and early Z-prepass, while having a negligible impact during the normal pass. Sometimes the reason is that one of the attributes is calculated in the CPU and sent every frame.

So we would have at some point in the buffer "pos pos pos pos pos pos" and in another offset "uv uv uv uv uv uv", and so on.
What I'm struggling with, is that this doesn't bode well with the one VBO approach, few VAOs:

It requires us to reserve chunks for X amount of vertices. If we surpassed the reserved amount, we'll need another chunk in which the distance from the start of the "position" attribute to the start of the "uv" attribute remains the same, otherwise we'll need another VAO.
If mesh A uses one buffer for Position, and one for UVs, we can use the reserved chunks mentioned in the previous point. But if mesh B uses one buffer for Position + normals, and one for UVs; we'll have to use another chunk because otherwise the distance between the buffers get out of sync (it matters if we then were to create another Mesh C which its buffer layout is exactly the same as Mesh A; unless we are careful to load mesh A and C first, but this is a lot of work and this information is not always available).

May be I'm overestimating the importance of separating elements apart, because like you said, rarely applications are vertex fetch bound. But shadow mapping is very common and the biggest risk of getting vertex fetch bound. And there's always going to be more than an Ogre user who will want this feature.

If I understand correctly with the ARB_shader_storage_buffer_object extension, this isn't a problem. I can load to uniforms the distances between each attribute (position, uvs, etc) and load the vertices by "reinterpreting its meaning" in the shader (in other words, making glVertexAttribPointer almost useless), something like this (not tested):

Code: Select all

struct vertex
{
	vec4 data;
};

layout(binding = VERTEX) buffer mesh
{
	vertex Vertex[];
} Mesh;

layout(binding = DISTANCES) uniform distances
{
	//int distPosition; //not needed, equals 0
	int distNormals;
	int distUv;
} Distances;

void main()
{
	vec3 vPosition = Mesh.Vertex[gl_VertexID].data.xyz;
	vec3 vNormals = Mesh.Vertex[gl_VertexID + Distances.distNormals].data.xyz;
	vec2 UVs = Mesh.Vertex[gl_VertexID + Distances.distUv].data.xy;
	/* ... */
}

If my understanding is correct, I can live with this. The gains from having one VAO, one VBO, but a packed vertex layout, far outweight decoupling the vertex layout so that shadow map passes end up a little faster (which may be negated by regular passes being a little slower...).

And by the time we need to **push more** performance, we can use this ARB_shader_storage_buffer_object pattern to decouple the vertex layout. I'm thinking on the long term here, eventually 50K draws per frame is going to sound too little, and the bottleneck will be somewhere else.

So, the questions are...

Is there anything that would help preventing VAO and memory management of becoming a nightmare if these "decoupled" vertex layouts are requred?
Am I correct in that ARB_shader_storage_buffer_object could be used in the way I described?

2. PERSISTENT STORAGE.
I've read the ARB_buffer_storage spec, read the Wiki, and analyzed G-Truc's samples. I'm going to be talking here from client-to-server transfers perspective (i.e. write CPU to GPU).
So far I see there are two types of persistent storage: Coherent and Non-coherent. Unfortunately, gtruc's code doesn't include examples of coherent versions.

If I've understood it correctly, non-coherent mapping must glFlushMappedBufferRange after I'm done updating the buffer. And that's it. On the next frame I'll update the same region of memory again and call glFlushMappedBufferRange. Repeat every frame. This strongly suggest that the persistent map I'm getting is not truly a pointer to GPU memory, but rather some CPU driver memory that gets memcpy'd on glFlushMappedBufferRange (or a GPU pointer that gets memcpy'd from GPU to GPU).
The advantage here is that there is no need to ask the OS to allocate a virtual address for every map (and if the memory is on GPU, GPU-to-GPU memcpy enjoys the faster transfer rates of GDDR). Another advantage is that I don't need to care about synchronization at all.
The disadvantage is that this memcpy is not free (whether GPU-GPU or CPU-GPU), and I'm at the whim of poorly implemented drivers, who (worst case) could stall me on every glFlushMappedBufferRange. In other words, this is just masking an upload using a staging buffer and a copy behind persistent mapped pointers.

On the other end, coherent mapping. I need to allocate ~3x memory, use fences and wait objects to avoid overwritting memory that may currently be in use by the GPU. There is *nothing* behind my back. I'm truly getting a pointer to the GPU resource. I don't need to call anything: glFlushMappedBufferRange or any other function. Everything is instaneous (which makes trying to read GPU to CPU from this kind of object a minefield of fences and memory barriers, better not do it)

The advantage is that there is no hidden memcpy, the fastest path available for highly dynamic data.
The disadvantage is that I need to track resource usage, and use fences. Sure, poor driver implementations could stall me on every call to glFenceSync or glClientWaitSync; but the former would be a surprise if it happens (really? stalling on a glFenceSync?) and most importantly the latter IS expected to possibly stall (even geniunely if we're GPU bound); but at least I'm in control of when and where this happens. Also, I'll most likely end up having 3 fences, one per frame (up to 3 frames) for all objects. The driver cannot screw me up too much. I like that.

Another disadvantage I see is that coherent persistent buffers cannot be accurately traced by tools like VOGL, apitrace, etc. Sure they can use aggressive copies on every glDraw* call, heuristics, put hardware write memory breakpoints at certain region intervals to trap changes to the buffer; but they may not be able to faithfully record exactly what I wrote to the buffer without severe performance impact (specially the bugs, i.e. writting to a location without checking if the GPU could be using it. With a bit of bad luck, a race-condition could end unnoticed for a long time).
As far as I can see coherent mapping doesn't seem to be tracer-friendly. There is no way for me to notify an attached tracer that I'm done modifying a region of a persistent buffer.
In other words, it would be wise for me to leave a toggle so that the code switches to using persistent non-coherent mapping (with calls to glFlushMappedBufferRange) to check for bugs and to be friendly with GL tracers.

Is all of this understanding and deductions about persistent storage correct?

Thanks in advance,
Matias

Owen · Post by **Owen** » Sun Jun 01, 2014 12:26 am

dark_sylinc wrote:
gsellers wrote:Sure, no problem. Let me know if you need anything else.
Thanks! whether MultiDrawIndirect respected the order was really bugging me.

I've started writing the GL specific code for the buffer management, and I've got a couple of follow up questions:

1. It is normally advised to split certain vertex elements into multiple buffers. In our case, it's one VBO with multiple offsets in the VAO (unless you see a different approach).
For example detaching position from the rest. The reasoning behind it is that it should perform faster on shadow mapping passes and early Z-prepass, while having a negligible impact during the normal pass. Sometimes the reason is that one of the attributes is calculated in the CPU and sent every frame.

So we would have at some point in the buffer "pos pos pos pos pos pos" and in another offset "uv uv uv uv uv uv", and so on.
What I'm struggling with, is that this doesn't bode well with the one VBO approach, few VAOs:

It requires us to reserve chunks for X amount of vertices. If we surpassed the reserved amount, we'll need another chunk in which the distance from the start of the "position" attribute to the start of the "uv" attribute remains the same, otherwise we'll need another VAO.

If mesh A uses one buffer for Position, and one for UVs, we can use the reserved chunks mentioned in the previous point. But if mesh B uses one buffer for Position + normals, and one for UVs; we'll have to use another chunk because otherwise the distance between the buffers get out of sync (it matters if we then were to create another Mesh C which its buffer layout is exactly the same as Mesh A; unless we are careful to load mesh A and C first, but this is a lot of work and this information is not always available).
May be I'm overestimating the importance of separating elements apart, because like you said, rarely applications are vertex fetch bound. But shadow mapping is very common and the biggest risk of getting vertex fetch bound. And there's always going to be more than an Ogre user who will want this feature.

If I understand correctly with the ARB_shader_storage_buffer_object extension, this isn't a problem. I can load to uniforms the distances between each attribute (position, uvs, etc) and load the vertices by "reinterpreting its meaning" in the shader (in other words, making glVertexAttribPointer almost useless), something like this (not tested):
Code: Select all
struct vertex
{
	vec4 data;
};

layout(binding = VERTEX) buffer mesh
{
	vertex Vertex[];
} Mesh;

layout(binding = DISTANCES) uniform distances
{
	//int distPosition; //not needed, equals 0
	int distNormals;
	int distUv;
} Distances;

void main()
{
	vec3 vPosition = Mesh.Vertex[gl_VertexID].data.xyz;
	vec3 vNormals = Mesh.Vertex[gl_VertexID + Distances.distNormals].data.xyz;
	vec2 UVs = Mesh.Vertex[gl_VertexID + Distances.distUv].data.xy;
	/* ... */
}
If my understanding is correct, I can live with this. The gains from having one VAO, one VBO, but a packed vertex layout, far outweight decoupling the vertex layout so that shadow map passes end up a little faster (which may be negated by regular passes being a little slower...).

And by the time we need to **push more** performance, we can use this ARB_shader_storage_buffer_object pattern to decouple the vertex layout. I'm thinking on the long term here, eventually 50K draws per frame is going to sound too little, and the bottleneck will be somewhere else.

So, the questions are...
Is there anything that would help preventing VAO and memory management of becoming a nightmare if these "decoupled" vertex layouts are requred?

Am I correct in that ARB_shader_storage_buffer_object could be used in the way I described?

Why not use two VBOs? The cost of binding both shouldn't be too expensive, to my understanding, especially if we are doing it rarely?

(Incidentally, I think this kind of highlights some of the complaints about VAOs being tightly bound to specific VBOs. If they weren't, it would be a case of
bind VAO
foreach(object) bind VBO, bind uniforms, render

Which is what we effectively end up simulating. Yes, the approach we end up taking is cheaper, but much of the cost is because of things which should be dropped from the spec.).

dark_sylinc wrote: 2. PERSISTENT STORAGE.
I've read the ARB_buffer_storage spec, read the Wiki, and analyzed G-Truc's samples. I'm going to be talking here from client-to-server transfers perspective (i.e. write CPU to GPU).
So far I see there are two types of persistent storage: Coherent and Non-coherent. Unfortunately, gtruc's code doesn't include examples of coherent versions.

If I've understood it correctly, non-coherent mapping must glFlushMappedBufferRange after I'm done updating the buffer. And that's it. On the next frame I'll update the same region of memory again and call glFlushMappedBufferRange. Repeat every frame. This strongly suggest that the persistent map I'm getting is not truly a pointer to GPU memory, but rather some CPU driver memory that gets memcpy'd on glFlushMappedBufferRange (or a GPU pointer that gets memcpy'd from GPU to GPU).
The advantage here is that there is no need to ask the OS to allocate a virtual address for every map (and if the memory is on GPU, GPU-to-GPU memcpy enjoys the faster transfer rates of GDDR). Another advantage is that I don't need to care about synchronization at all.
The disadvantage is that this memcpy is not free (whether GPU-GPU or CPU-GPU), and I'm at the whim of poorly implemented drivers, who (worst case) could stall me on every glFlushMappedBufferRange. In other words, this is just masking an upload using a staging buffer and a copy behind persistent mapped pointers.

On the other end, coherent mapping. I need to allocate ~3x memory, use fences and wait objects to avoid overwritting memory that may currently be in use by the GPU. There is *nothing* behind my back. I'm truly getting a pointer to the GPU resource. I don't need to call anything: glFlushMappedBufferRange or any other function. Everything is instaneous (which makes trying to read GPU to CPU from this kind of object a minefield of fences and memory barriers, better not do it)

The advantage is that there is no hidden memcpy, the fastest path available for highly dynamic data.
The disadvantage is that I need to track resource usage, and use fences. Sure, poor driver implementations could stall me on every call to glFenceSync or glClientWaitSync; but the former would be a surprise if it happens (really? stalling on a glFenceSync?) and most importantly the latter IS expected to possibly stall (even geniunely if we're GPU bound); but at least I'm in control of when and where this happens. Also, I'll most likely end up having 3 fences, one per frame (up to 3 frames) for all objects. The driver cannot screw me up too much. I like that.

Another disadvantage I see is that coherent persistent buffers cannot be accurately traced by tools like VOGL, apitrace, etc. Sure they can use aggressive copies on every glDraw* call, heuristics, put hardware write memory breakpoints at certain region intervals to trap changes to the buffer; but they may not be able to faithfully record exactly what I wrote to the buffer without severe performance impact (specially the bugs, i.e. writting to a location without checking if the GPU could be using it. With a bit of bad luck, a race-condition could end unnoticed for a long time).
As far as I can see coherent mapping doesn't seem to be tracer-friendly. There is no way for me to notify an attached tracer that I'm done modifying a region of a persistent buffer.
In other words, it would be wise for me to leave a toggle so that the code switches to using persistent non-coherent mapping (with calls to glFlushMappedBufferRange) to check for bugs and to be friendly with GL tracers.

Is all of this understanding and deductions about persistent storage correct?

Thanks in advance,
Matias

Things to be aware of:

Coherent storage probably has a cost. Either it will be in CPU RAM (CLIENT_STORAGE_BIT or driver whim), in which case CPU caches will work (on x86 platforms* - all bets are off for ARM platforms with non-Mali GPUs, and possibly even then), or it will be in GPU RAM, and CPU caches will be disabled (with the possible exception of ARM platforms with Mali GPUs**). Basically, if you're doing dynamic updates, hit every variable once and in order if at all possible (to take advantage of the CPU write combiners, which have hopefully been enabled) or try and memcpy cache-line sized slabs into the buffer.
The text for the buffer storage extension seems to imply to me that even when you don't pass the coherent flag, the pointer you get back might be the actual buffer. The difference is the permitted cache-ability of the RAM. I'd assume that the driver might mapping the RAM with write back caching (i.e. normal caching), so any writes and reads you do are unpredictable until the memory barrier and any fence sync operations are invoked.
I'd retry any attempts to make a coherent buffer with a non-coherent buffer before hard failing. It's not hard to imagine cases where coherent mapping might not be possible but non-coherent might be (GPUs with limited MMUs might not be able to coherently map large buffers when there isn't contiguous system RAM available, for example)

Assume coherent buffers will be faster if you're doing per-frame updates, non-coherent faster if you're doing more "occasional" stuff. Watch out for address space exhaustion on 32-bit platforms.

Is it my understanding that HLMS can do sparse array textures on modern hardware now? That's cool.

* For x86 CPUs, DMA accesses to RAM are snooped by the caches.
** Mali GPUs implement AXI coherency, so should be cache coherent with ARM CPUs. I have no idea for other GPUs. Don't count on it.
I reference ARM in this post for future-proofing reasons, mainly, because this kind of stuff is likely to land in ES in future, or full OpenGL will start appearing in mobile hardware (I'm looking at Tegra, even with how few sales I predict it to do)

Post by **dark_sylinc** » Sun Jun 01, 2014 1:30 am

Owen wrote:Why not use two VBOs? The cost of binding both shouldn't be too expensive, to my understanding, especially if we are doing it rarely?

The problem remains the same. As soon as we start having vertex formats that are not contiguous (i.e. decoupled), we have a hard time maintaining the "one vao per vertex format" philosophy.

It's not a really huge deal, but when I'm designing/writing the system that manages memory and buffers, I'd like knowing a few backgrounds so that I don't accidentally shoot myself in the foot.

Owen wrote:Things to be aware of:
Coherent storage probably has a cost. (...) CPU caches will be disabled (...)

I already know that. I'm counting on write combining to do its job. The question here is more about knowing if non-coherent mapping works as I believe.

Owen wrote:
[*]I'd retry any attempts to make a coherent buffer with a non-coherent buffer before hard failing. It's not hard to imagine cases where coherent mapping might not be possible but non-coherent might be (GPUs with limited MMUs might not be able to coherently map large buffers when there isn't contiguous system RAM available, for example) [/list]
Good point. GPU memory is tiled, persistent mapping may force some linearity, etc. I'll take this into consideration.

The text for the buffer storage extension seems to imply to me that even when you don't pass the coherent flag, the pointer you get back might be the actual buffer. The difference is the permitted cache-ability of the RAM. I'd assume that the driver might mapping the RAM with write back caching (i.e. normal caching), so any writes and reads you do are unpredictable until the memory barrier and any fence sync operations are invoked.

Looking at G-Truc's code clearly says otherwise. Basically it looks like this:

Code: Select all

void everyFrame()
{
    *uniformPtrPersistentlyMapped = worldViewProjMatrix;
     glFlushMappedBufferRange( ... );
     glDraw( ... )
}

If the GPU can't consume fast enough, "*uniformPtrPersistentlyMapped = worldViewProjMatrix;" will overwrite a buffer that is currently in use. The only way to prevent this is to either stall inside the previous glFlushMappedBufferRange (or previous glDraw) or copy the contents from uniformPtrPersistentlyMapped to another buffer when glFlushMappedBufferRange gets called.
Unless the driver remaps the virtual address of uniformPtrPersistentlyMapped to a different physical region on the fly (I highly doubt this is happening).

Owen · Post by **Owen** » Sun Jun 01, 2014 2:22 am

Does G-Truc's example feature some form of implicit flush? (e.g. glFinish (IIRC?) between frames would force serialization)

holocronweaver · Post by **holocronweaver** » Sun Jun 01, 2014 2:29 am

Seems gtruc already implemented my basic approach to accessing vertex data via shader storage buffers - see his gl-440-multi-draw-indirect-*-arb.cpp examples and accompanying shaders. One version uses gl_DrawID, while another uses a custom rolled draw ID. (Latter shouldn't be necessary on Radeon 5000+ and GTX 400+, IIRC, which is roughly the target audience for GL3+ and OGRE 2.0).

A buffer is packed with the indices and vertex data of every renderable which is to be drawn (in this case they are separate buffers, but they don't have to be). The index buffer is bound to a VAO while the vertex buffer is bound to a shader storage block. Another buffer is filled with a struct of parameters for each draw:

Code: Select all

typedef  struct {
        // Number of indices (a.k.a. elements of the index array).
        uint  count;
        // Number of instances to render.  Per-instance data is indexed in shader using gl_InstanceID.
        uint  instanceCount;
        // Offset in index buffer for this renderable.
        uint  firstIndex;
        // Offset in vertex buffer for this renderable.
        uint  baseVertex;
        // Offset in vertex attribute buffers for the first instance.
        uint  baseInstance;
    } DrawElementsIndirectCommand;

Then a MultiDrawElementsIndirect command is called using the index and draw parameter buffers as follows:

Code: Select all

void glMultiDrawElementsIndirect(GLenum mode, GLenum type, const void *indirect, GLsizei drawcount, GLsizei stride);

Here we have the primitive type (mode), index data type (type), offset into draw parameters buffer (indirect), the number of draws in the buffer (drawcount), and the stride of the draw parameters buffer (stride).

The shader has access to three very useful indices:

gl_DrawID == renderable index
gl_VertexID == vertex index (from the indices bound to the VAO)
gl_InstanceID == instance index

With these, you can index any arrays available to the shader via uniform and shader storage blocks, including per-mesh data. Here is an example (adapted from gtruc):

Code: Select all

buffer indirection
{
    int Transform[DRAW_MAX];
} Indirection;

uniform transform
{
    mat4 MVP[DRAW_MAX];
} Transform;

struct vertex
{
    vec2 Position;
    vec2 Texcoord;
    vec3 Normal;
};

buffer mesh
{
    vertex Vertex[];
    bool Purty[];
} Mesh;

out block
{
    vec2 Texcoord;
    vec3 Normal;
    bool Purty;
} Out;

void main()
{
    Out.Texcoord = Mesh.Vertex[gl_VertexID].Texcoord.st;
    Out.Normal = Mesh.Vertex[gl_VertexID].Normal;
    Out.Purty = Mesh.Purty[gl_VertexID];
    gl_Position = Transform.MVP[Indirection.Transform[gl_DrawID]] * vec4(Mesh.Vertex[gl_VertexID].Position, 0.0, 1.0);
}

Note that this approach makes vertex attributes unnecessary and indeed annoying. The only thing the VAO should contain is indices.

Basically, we do not need a Index or VertexBuffer class. A buffer is a buffer is a buffer. What we DO need is Buffer and BufferManager abstract classes which render systems can implement and OgreMain can call with buffer data and an intent (vertex, index, etc.) and receive back a Buffer object which acts as a virtual buffer (though in reality the data is usually packed inside a much larger buffer). Renderables countain a Buffer object for each associated property.

gsellers · Post by **gsellers** » Mon Jun 02, 2014 10:58 pm

dark_sylinc wrote:
gsellers wrote:Sure, no problem. Let me know if you need anything else.
Thanks! whether MultiDrawIndirect respected the order was really bugging me.

I've started writing the GL specific code for the buffer management, and I've got a couple of follow up questions:

1. It is normally advised to split certain vertex elements into multiple buffers. In our case, it's one VBO with multiple offsets in the VAO (unless you see a different approach).
For example detaching position from the rest. The reasoning behind it is that it should perform faster on shadow mapping passes and early Z-prepass, while having a negligible impact during the normal pass. Sometimes the reason is that one of the attributes is calculated in the CPU and sent every frame.

So we would have at some point in the buffer "pos pos pos pos pos pos" and in another offset "uv uv uv uv uv uv", and so on.
What I'm struggling with, is that this doesn't bode well with the one VBO approach, few VAOs:

It requires us to reserve chunks for X amount of vertices. If we surpassed the reserved amount, we'll need another chunk in which the distance from the start of the "position" attribute to the start of the "uv" attribute remains the same, otherwise we'll need another VAO.

If mesh A uses one buffer for Position, and one for UVs, we can use the reserved chunks mentioned in the previous point. But if mesh B uses one buffer for Position + normals, and one for UVs; we'll have to use another chunk because otherwise the distance between the buffers get out of sync (it matters if we then were to create another Mesh C which its buffer layout is exactly the same as Mesh A; unless we are careful to load mesh A and C first, but this is a lot of work and this information is not always available).
May be I'm overestimating the importance of separating elements apart, because like you said, rarely applications are vertex fetch bound. But shadow mapping is very common and the biggest risk of getting vertex fetch bound. And there's always going to be more than an Ogre user who will want this feature.

Let's assume that you know ahead of time how much vertex data you have. So, with interleaved attributes, you'd have "pos uv norm pos uv norm pos uv norm" or something. You'd allocate a big enough chunk for all attributes. What you really want to do is refactor the data such that it's "pos pos pos uv norm uv norm uv norm", with all the position data packed together at the beginning of the chunk, and everything else interleaved towards the end. Now, create _two_ VAOs. One has a single attribute, tightly packed just for the position data. The second has multiple attributes, position, uv, normal, etc. During the depth pre-pass, bind the "position-only" VAO, during the final pass(es), bind the "everything enabled" VAO. Like I said, we're not trying to collapse the whole scene into a single draw. We're just trying to minimize state updates.

dark_sylinc wrote:If I understand correctly with the ARB_shader_storage_buffer_object extension, this isn't a problem. I can load to uniforms the distances between each attribute (position, uvs, etc) and load the vertices by "reinterpreting its meaning" in the shader (in other words, making glVertexAttribPointer almost useless), something like this (not tested):
Code: Select all
struct vertex
{
	vec4 data;
};

layout(binding = VERTEX) buffer mesh
{
	vertex Vertex[];
} Mesh;

layout(binding = DISTANCES) uniform distances
{
	//int distPosition; //not needed, equals 0
	int distNormals;
	int distUv;
} Distances;

void main()
{
	vec3 vPosition = Mesh.Vertex[gl_VertexID].data.xyz;
	vec3 vNormals = Mesh.Vertex[gl_VertexID + Distances.distNormals].data.xyz;
	vec2 UVs = Mesh.Vertex[gl_VertexID + Distances.distUv].data.xy;
	/* ... */
}
If my understanding is correct, I can live with this. The gains from having one VAO, one VBO, but a packed vertex layout, far outweight decoupling the vertex layout so that shadow map passes end up a little faster (which may be negated by regular passes being a little slower...).

I'm not sure what you're trying to achieve there. It doesn't seem right, though. You can't just alias data like that because "Distances.distNormals" is going to be in units of vec4, not bytes or anything. I guess it would work, but you'd be burning space between verts. Also, you'd have a hard time with anything but basic float data. I presume that you want to be able to support integers and lower bit-depth vertex attributes, right?

dark_sylinc wrote:And by the time we need to **push more** performance, we can use this ARB_shader_storage_buffer_object pattern to decouple the vertex layout. I'm thinking on the long term here, eventually 50K draws per frame is going to sound too little, and the bottleneck will be somewhere else.

Yes. That's likely, unless your draws are _really_ small and simple. However, given the simplest possible draw, I expect you can hit 100K draws/frame at 60Hz fairly easily. I'd shoot for 200K draws as a reasonable target with a pathalogically simple shader.

dark_sylinc wrote:So, the questions are...
Is there anything that would help preventing VAO and memory management of becoming a nightmare if these "decoupled" vertex layouts are requred?

Am I correct in that ARB_shader_storage_buffer_object could be used in the way I described?

I don't think this would affect memory management, no. It's going to be a problem if you want to physically switch vertex buffers without changing vertex format, but that kind of defeats the point. The idea here is to use one buffer (or as few buffers as possible), avoid buffer switches, and use the various offset parameters that are already part of the API and basically free. You could use buffer_storage like you describe, but it wouldn't be efficient. Given point sampling implicit in buffer fetches, you could represent everything as integers and implement bit packing explicitly using unpack* functions, which are mostly no-ops (just re-interpreting the data).

dark_sylinc wrote: 2. PERSISTENT STORAGE.
I've read the ARB_buffer_storage spec, read the Wiki, and analyzed G-Truc's samples. I'm going to be talking here from client-to-server transfers perspective (i.e. write CPU to GPU).
So far I see there are two types of persistent storage: Coherent and Non-coherent. Unfortunately, gtruc's code doesn't include examples of coherent versions.

If I've understood it correctly, non-coherent mapping must glFlushMappedBufferRange after I'm done updating the buffer. And that's it. On the next frame I'll update the same region of memory again and call glFlushMappedBufferRange. Repeat every frame. This strongly suggest that the persistent map I'm getting is not truly a pointer to GPU memory, but rather some CPU driver memory that gets memcpy'd on glFlushMappedBufferRange (or a GPU pointer that gets memcpy'd from GPU to GPU).

That's not necessarily the case. It may be, and that would be a reasonable implementation. However, it could just simply be a signal to the driver to flush CPU caches for writable regions and invalidate source caches on the GPU side so that the GPU 'sees' the updated data. If you're on a truly UMA (cache-coherent) system, this could be a no-op. On many platforms, there is some form of uncached, snooped or otherwise coherent mapping available that would be selected by a driver when you ask for a coherent map, but it would likely come with some non-negligible performance penalty that can be avoided with a non-coherent map and explicit flushes. That is the point of that flag.

dark_sylinc wrote:The advantage here is that there is no need to ask the OS to allocate a virtual address for every map (and if the memory is on GPU, GPU-to-GPU memcpy enjoys the faster transfer rates of GDDR). Another advantage is that I don't need to care about synchronization at all.
The disadvantage is that this memcpy is not free (whether GPU-GPU or CPU-GPU), and I'm at the whim of poorly implemented drivers, who (worst case) could stall me on every glFlushMappedBufferRange. In other words, this is just masking an upload using a staging buffer and a copy behind persistent mapped pointers.

Maybe, but probably not. In all likelihood, a non-coherent map is simply going to be the same as a coherent map, but with non-coherent caches enabled. It's unlikely that you'd see a full copy. In particular, after the Flush returns, you're entitled to trash the data. Any copy would have to have completed before then, and that would imply a fence, which would be bad for performance. I don't think anyone would implement it that way.

dark_sylinc wrote:On the other end, coherent mapping. I need to allocate ~3x memory, use fences and wait objects to avoid overwritting memory that may currently be in use by the GPU. There is *nothing* behind my back. I'm truly getting a pointer to the GPU resource. I don't need to call anything: glFlushMappedBufferRange or any other function. Everything is instaneous (which makes trying to read GPU to CPU from this kind of object a minefield of fences and memory barriers, better not do it)

The advantage is that there is no hidden memcpy, the fastest path available for highly dynamic data.
The disadvantage is that I need to track resource usage, and use fences. Sure, poor driver implementations could stall me on every call to glFenceSync or glClientWaitSync; but the former would be a surprise if it happens (really? stalling on a glFenceSync?) and most importantly the latter IS expected to possibly stall (even geniunely if we're GPU bound); but at least I'm in control of when and where this happens. Also, I'll most likely end up having 3 fences, one per frame (up to 3 frames) for all objects. The driver cannot screw me up too much. I like that.

This is basically what a driver does, though. The big difference is that the driver has to do it totally transparently to you, get it right _always_, no matter what the application does, without penalizing applications that don't do this, with no fore-knowledge of what might be about to be used. It's very, very hard and given that we have to be correct, we're going to err on the side of conservatism rather than performance. Handing control to the application is the right thing to do here.

dark_sylinc wrote:Another disadvantage I see is that coherent persistent buffers cannot be accurately traced by tools like VOGL, apitrace, etc. Sure they can use aggressive copies on every glDraw* call, heuristics, put hardware write memory breakpoints at certain region intervals to trap changes to the buffer; but they may not be able to faithfully record exactly what I wrote to the buffer without severe performance impact (specially the bugs, i.e. writting to a location without checking if the GPU could be using it. With a bit of bad luck, a race-condition could end unnoticed for a long time).
As far as I can see coherent mapping doesn't seem to be tracer-friendly. There is no way for me to notify an attached tracer that I'm done modifying a region of a persistent buffer.
In other words, it would be wise for me to leave a toggle so that the code switches to using persistent non-coherent mapping (with calls to glFlushMappedBufferRange) to check for bugs and to be friendly with GL tracers.

Yep. Asynchronous behavior and UMA is going to be hard to trace, debug and so on. It's like single stepping a multi-threaded program. Debuggers are good, but not that good, and the more convoluted you make things, the harder it is to debug. Such is the price of performance.

dark_sylinc wrote:Is all of this understanding and deductions about persistent storage correct?

Yup, pretty much.

dark_sylinc wrote:Thanks in advance,
Matias

Post by **dark_sylinc** » Tue Jun 03, 2014 4:28 pm

Thanks again!

Thanks for clarifying the coherent vs non-coherent differences. It makes a lot of sense and much clearer now.
So glFlush* commands don't stall nor take off the burden of sync'ing from us.
I missed examining the apitest repository from the AZDO talk, and now makes even more sense. That repo is gold.

Some of gtruc's samples seem to be broken. I first assumed they were correct and hence caused so much confussion.
I don't see glFinish or explicit fences on some of the maps that were unsynchronized or persistent. "Coincidentally" all these samples aren't working with AMD hardware as per his own May 2014 report. The problem could be there.

Yep. Asynchronous behavior and UMA is going to be hard to trace, debug and so on. It's like single stepping a multi-threaded program. Debuggers are good, but not that good, and the more convoluted you make things, the harder it is to debug. Such is the price of performance.

Oh, good to know. This means I'm doing good in leaving a flip to switch to synchronized, non-persistent mapping.

Let's assume that you know ahead of time how much vertex data you have.

I've been thinking a lot about it and I see it, now. If I know ahead of time about ALL the meshes I'm going to load, it's doable. Otherwise it's a nightmare (if mesh A & B have Position and UVs, mesh A has 100 vertices, mesh B 235 vertices, then mesh A's offset between position and UV is 1200 bytes but for mesh B it's 2820 bytes. If I reserve space for 335 vertices, then I can put A & B together while maintaining the same offset for the start of UV)
Now it looks obvious, making it me feel a little dumb.

I'll just have to rethink a few things.

Thanks again!

Ogre Forums

Vertex & Index Buffer refactoring

Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring

Re: Vertex & Index Buffer refactoring