gsellers wrote:Sure, no problem. Let me know if you need anything else.
Thanks! whether MultiDrawIndirect respected the order was really bugging me.
I've started writing the GL specific code for the buffer management, and I've got a couple of follow up questions:
1. It is normally advised to split certain vertex elements into multiple buffers. In our case, it's one VBO with multiple offsets in the VAO (unless you see a different approach).
For example detaching position from the rest. The reasoning behind it is that it should perform faster on shadow mapping passes and early Z-prepass, while having a negligible impact during the normal pass. Sometimes the reason is that one of the attributes is calculated in the CPU and sent every frame.
So we would have at some point in the buffer "pos pos pos pos pos pos" and in another offset "uv uv uv uv uv uv", and so on.
What I'm struggling with, is that this doesn't bode well with the one VBO approach, few VAOs:
- It requires us to reserve chunks for X amount of vertices. If we surpassed the reserved amount, we'll need another chunk in which the distance from the start of the "position" attribute to the start of the "uv" attribute remains the same, otherwise we'll need another VAO.
- If mesh A uses one buffer for Position, and one for UVs, we can use the reserved chunks mentioned in the previous point. But if mesh B uses one buffer for Position + normals, and one for UVs; we'll have to use another chunk because otherwise the distance between the buffers get out of sync (it matters if we then were to create another Mesh C which its buffer layout is exactly the same as Mesh A; unless we are careful to load mesh A and C first, but this is a lot of work and this information is not always available).
May be I'm overestimating the importance of separating elements apart, because like you said, rarely applications are vertex fetch bound. But shadow mapping is very common and the biggest risk of getting vertex fetch bound. And there's always going to be more than an Ogre user who will want this feature.
If I understand correctly with the ARB_shader_storage_buffer_object extension, this isn't a problem. I can load to uniforms the distances between each attribute (position, uvs, etc) and load the vertices by "reinterpreting its meaning" in the shader (in other words, making glVertexAttribPointer almost useless), something like this (not tested):
Code: Select all
struct vertex
{
vec4 data;
};
layout(binding = VERTEX) buffer mesh
{
vertex Vertex[];
} Mesh;
layout(binding = DISTANCES) uniform distances
{
//int distPosition; //not needed, equals 0
int distNormals;
int distUv;
} Distances;
void main()
{
vec3 vPosition = Mesh.Vertex[gl_VertexID].data.xyz;
vec3 vNormals = Mesh.Vertex[gl_VertexID + Distances.distNormals].data.xyz;
vec2 UVs = Mesh.Vertex[gl_VertexID + Distances.distUv].data.xy;
/* ... */
}
If my understanding is correct, I can live with this. The gains from having one VAO, one VBO, but a packed vertex layout, far outweight decoupling the vertex layout so that shadow map passes end up a little faster (which may be negated by regular passes being a little slower...).
And by the time we need to
**push more** performance, we can use this ARB_shader_storage_buffer_object pattern to decouple the vertex layout. I'm thinking on the long term here, eventually 50K draws per frame is going to sound too little, and the bottleneck will be somewhere else.
So, the questions are...
- Is there anything that would help preventing VAO and memory management of becoming a nightmare if these "decoupled" vertex layouts are requred?
- Am I correct in that ARB_shader_storage_buffer_object could be used in the way I described?
2. PERSISTENT STORAGE.
I've read the
ARB_buffer_storage spec, read the
Wiki, and analyzed G-Truc's samples. I'm going to be talking here from client-to-server transfers perspective (i.e. write CPU to GPU).
So far I see there are two types of persistent storage: Coherent and Non-coherent. Unfortunately, gtruc's code doesn't include examples of coherent versions.
If I've understood it correctly, non-coherent mapping must glFlushMappedBufferRange after I'm done updating the buffer. And that's it. On the next frame I'll update the same region of memory again and call glFlushMappedBufferRange. Repeat every frame. This strongly suggest that the persistent map I'm getting is not truly a pointer to GPU memory, but rather some CPU driver memory that gets memcpy'd on glFlushMappedBufferRange (or a GPU pointer that gets memcpy'd from GPU to GPU).
The advantage here is that there is no need to ask the OS to allocate a virtual address for every map (and if the memory is on GPU, GPU-to-GPU memcpy enjoys the faster transfer rates of GDDR). Another advantage is that I don't need to care about synchronization at all.
The disadvantage is that this memcpy is not free (whether GPU-GPU or CPU-GPU), and I'm at the whim of poorly implemented drivers, who (worst case) could stall me on every glFlushMappedBufferRange. In other words, this is just masking an upload using a staging buffer and a copy behind persistent mapped pointers.
On the other end, coherent mapping. I need to allocate ~3x memory,
use fences and wait objects to avoid overwritting memory that may currently be in use by the GPU. There is *nothing* behind my back. I'm truly getting a pointer to the GPU resource. I don't need to call anything: glFlushMappedBufferRange or any other function. Everything is instaneous (which makes trying to read GPU to CPU from this kind of object a minefield of fences and memory barriers, better not do it)
The advantage is that there is no hidden memcpy, the fastest path available for highly dynamic data.
The disadvantage is that I need to track resource usage, and use fences. Sure, poor driver implementations could stall me on every call to glFenceSync or glClientWaitSync; but the former would be a surprise if it happens (really? stalling on a glFenceSync?) and most importantly the latter
IS expected to possibly stall (even geniunely if we're GPU bound); but at least I'm in control of when and where this happens. Also, I'll most likely end up having 3 fences, one per frame (up to 3 frames) for all objects. The driver cannot screw me up too much. I like that.
Another disadvantage I see is that coherent persistent buffers cannot be accurately traced by tools like VOGL, apitrace, etc. Sure they can use aggressive copies on every glDraw* call, heuristics, put hardware write memory breakpoints at certain region intervals to trap changes to the buffer; but they may not be able to faithfully record exactly what I wrote to the buffer without severe performance impact (specially the bugs, i.e. writting to a location without checking if the GPU could be using it. With a bit of bad luck, a race-condition could end unnoticed for a long time).
As far as I can see coherent mapping doesn't seem to be tracer-friendly. There is no way for me to notify an attached tracer that I'm done modifying a region of a persistent buffer.
In other words, it would be wise for me to leave a toggle so that the code switches to using persistent non-coherent mapping (with calls to glFlushMappedBufferRange) to check for bugs and to be friendly with GL tracers.
Is all of this understanding and deductions about persistent storage correct?
Thanks in advance,
Matias