HlmsBufferManager binds const buffers to vs and ps

Discussion area about developing with Ogre-Next (2.1, 2.2 and beyond)


User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

Why the HlmsBufferManager::mapNextConstBuffer binds the const buffer to vertex and pixel shader? I checked the connection to HlmsPbs and HlmsUnlit and both are implemented in the way that the const buffers are same in both shaders, but it doesn't make sense to me. I mean the HlmsXXX can be implemented that way, but why HlmsBufferManager forces such implementation?

If I need to extend the Hlms by using also geometry shader and use data there via const buffer, the HlmsBufferManager doesn't help me and either I have to touch the source code (modify and recompile OGRE) or make my own implementation - actually reinventing wheel by copying code from HlmsBufferManager.

I see that HlmsPbs uses the const buffers through HlmsBufferManager to store worldMaterialIdx there which is used by vertex and pixel shaders, but the HlmsBufferManager shouldn't force the bindings - it is up to Hlms implementation how the const buffers will be used and bound. Same is also for tex buffers through mapNextTexBuffer. It shouldn't bind the buffer to particular shader and specific hard-coded slot/register.

Am I missing some important details?

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

Hi!

I don't fully understand the point you're trying to make, but first I'll make a brief context and then I'll ask a couple questions to try to understand what are you asking:

Ultimately I suspect your grip is with, is that shaders and C++ have a problem: They need to be in agreement of how data is passed around (which for VS & PS it is mostly CPU -> GPU rather than the other way around).

Unlike CUDA, we can't pass pointers or call functions directly from C++, so we have to bind buffers to slots (well it's more complex than that because Vulkan internally doesn't have a concept of slots; but OgreNext exposes bindings as slots for all backends).

There is no one true way to organize this. What decides the slot idx for a buffer should be?

  • Hardcoded to slot 2? (Arbitrary). Both C++ & Shader code must be in sync.
  • C++ sets the slot and the shader reads it?
  • The shader sets the number and C++ reads it?

OgreNext supports all of those approaches but each one has their advantages and disadvantages. The only thing is clear is that if one isn't careful, it can become a mess when you have lots of buffers to bind (and IMO it's already a mess).

Harcoding slot to 2 (or any other number)

The advantage of this approach is that it's simple and fast: both C++ and shaders know slot 2 is taken.
This approach makes sense for a buffer that is (almost) always present for all shaders.

It is also the most efficient because it reduces the number of indirections and need for retrieval; so it makes sense for the hottest paths (i.e. the buffers containing world matrix data and worldMaterialIdx). "Hottest paths" means something that is executed for every object every frame.

Just 4 bytes of data with 100k objects at 60FPS becomes 22MB/s of bandwidth. In VR at 90hz (one for each eye) 4 bytes becomes 70MB/s.

The disadvantage is that if a shader doesn't need that slot idx, that slot idx is taken and you can't use it. A reasonable trade off if you expect > 90% of your objects to need it. You still have ~120 other free slots on PC (around ~30 on some platforms like iOS).

C++ sets the slot and the shader reads it

This is what Hlms implementations do for most bindings, either via setProperty( "my_buffer_slot_idx", n ); for normal buffers (and then use @value( my_buffer_slot_idx ) in Hlms) or setTextureReg() for textures and tex buffers.

Then it's just a matter of C++ properly organizing the assigned slots based on the features enabled; and ensuring we don't get mixed them up during actual binding.

Most of this happens in preparePassHash, notifyPropertiesMergedPreGenerationStep and the beginning of fillBuffersFor.

The advantage is that C++ is very flexible for organizing all the slots and meeting the complex demands of different feature configurations.

The disadvantage is that this sucks for bindings that should be per object. Because we would somehow have to figure out what slot idx to use for the next object. Note that to generate a shader you need the following information:

  • Object's info (like vertex format)
  • Material info (i.e. HlmsDatablock)
  • Pass info (e.g. does this pass use shadow maps? this is handled in preparePassHash)

Therefore this data is not directly available anywhere except the Hlms hash as it is the result of the combination of these 3 sources of data.

Which means that for every object, every frame we would need to do (pseudocode):

Code: Select all

const auto &data = data_map.find( renderable[i].mHlmsHash );
if( !buffer.is_bound( data.slot_idx ) )
    buffer.bind( data.slot_idx );

And we also have to ensure that binding buffer at data.slot_idx doesn't mess up a binding that is meant to be global (i.e. set inside if( OGRE_EXTRACT_HLMS_TYPE_FROM_CACHE_HASH( lastCacheHash ) != mType )). Otherwise we would later have to restore that again for the next object.

Note that in some cases C++ might set a base value, and assumes the shader will generate the rest consecutively.

The shader sets the number and C++ reads it

It's basically the same thing as the previous one (C++ sets it, shader reads it), except now C++'s knowledge about the slot indices gets delayed until after shader generation; which makes it harder to write the logic because the Hlms syntax is not the best for this sort of thing.

Why the HlmsBufferManager::mapNextConstBuffer binds the const buffer to vertex and pixel shader? I checked the connection to HlmsPbs and HlmsUnlit and both are implemented in the way that the const buffers are same in both shaders, but it doesn't make sense to me. I mean the HlmsXXX can be implemented that way, but why HlmsBufferManager forces such implementation?

I'm having trouble understanding what you're asking here:

  1. Are you asking why is it hardcoded? (explained above in my post).
  2. Are you asking why is it both vertex & pixel shader when there are cases when e.g. only the vertex shader uses it?
  3. Are you asking why is both vertex & pixel share the same value when they don't have to and could be different numbers? (this one is simple: driver bugs on certain platforms; plus mental simplicity by knowing a specific buffer is always bound to the same slot in all shader stages).
  4. Are you asking why is it only VS & PS, and not also GS/DS/HS?
  5. Something else...?

Cheers

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

but the HlmsBufferManager shouldn't force the bindings - it is up to Hlms implementation how the const buffers will be used and bound

I forgot to mention something: HlmsBufferManager has a task that sounds simple: Guarantee that there is a mapped void*/float* pointer (i.e. mCurrentMappedConstBuffer / mCurrentMappedTexBuffer) always available to write data that corresponds to the current Renderable being iterated, and the shader will see that data when indexed by bound_buffer[inVS_drawId]).

However what made this task awfully difficult is that D3D11 does not support binding const buffers by relative offsets. It must always start from 0 (binding by relative offset was added later IIRC D3D11.1 but for unknown reasons it is optional unfortunately).

That is troublesome because we may have created a 64kb const buffer but for some reason after just writing 16kb we were forced to split the draw call and inVS_drawId will be reset to 0.
Instead of binding region 16kb..64kb, we must throw away that const buffer (wasting 48kb) and use a new one from 0. Also if the tex buffer was almost full, it's problematic because soon we will be forced to split the draw call again and do all of this again.

The other thing that made this task awfully difficult are skeleton animated objects: With regular objects we know that every object needs 64 bytes of data for a float4x4 matrix. Then we know bound_buffer[inVS_drawId] is just from C++'s perspective uint8* buffer[renderable_i * 64].

However when Mesh A may have 3 bones per instance and Mesh B can have 10 bones per instance; it becomes a nightmare to track & keep the offsets correctly. And we may run out of space (e.g. we have space for 8 more bones but we need 10) so we must break the bindings even though the buffer isn't full.

It took months to discover & fix the edge cases where the bindings would be off by 1 and things like that.

Originally I wanted HlmsBufferManager to be a generic class that would help developers maintain buffers that keep track of the current object and easly send per-object data. However the complexities of everything involved (different API limitations, skeleton objects coming before/after regular objects) made HlmsBufferManager much more inflexible and much less reusable than originally planned.

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

Hi, thanks for very detailed explanation. I think we are speaking about the same - my point was only to move the responsibility from HlmsBufferManager to Hlms implementation. I think it possible to reimplement HlmsBufferManager so it doesn't inherit from Hlms class - the usage in HlmsPbs and HlmsUnlit will be by composition (using a member of HlmsBufferManager) instead of inheritance (but this is a detail - it would work in both cases).

The logic described by you - hard-coded slots and binding to vs/ps should be split from the class - I would expect there is a base class like BaseHlmsBufferManager which contains only logic of creating the buffers, mapping/unmapping, but there is no connection to CommandBuffer and bindings to shaders. Then there will be HlmsBufferManager which contains logic needed by HlmsPbs and HlmsUnlit (hard-coded slots, bindings of the buffers to slot 0 and 2). Then it is possible to reuse the logic of buffer management by other implementations of Hlms which are independent of HlmsPbs and HlmsUnlit - including e.g. binding the buffers to GS (and possibly other stages). The only complication of implementing BaseHlmsBufferManager is to handle the tex buffers - in HlmsBufferManager by unmapping the currently mapped tex buffer, updates the offsets in the command inserted by mapNextTexBuffer which wouldn't be available there. But it is possible to implement it in HlmsBufferManager (inherited from BaseHlmsBufferManager) by hiding the methods from BaseHlmsBufferManager and adding new methods (like the one which are there now) which accepts also command buffer.

I would like to see HlmsBufferManager to behave like ConstBufferPool - it just manages the buffers, but nothing further. When I implement Hlms, I decide in the shaders how the buffers will be used/mapped and then reflect that usage/mapping in c++. Like you wrote both sides must be in sync - but it is responsibility of Hlms implementation (like HlmsPbs) to make the "sync" and not HlmsBufferManager. I suppose the HlmsBufferManager was written/implemented specifically for the needs of HlmsPbs and HlmsUnlit, but it is not very suitable for the techniques which I have in mind - at the moment it disturbs me only the missing binding of the const buffers to GS, but I want to keep focus on generality of the class so I don't want to blindly add that line in HlmsBufferManager::mapNextConstBuffer - again that command should be part of my Hlms implementation and not buffer management.

I would like to bring it little bit further - I see that in GL3Plus RS the ConstBufferPacked, ReadOnlyBufferPacked and TexBufferPacked have all identical bindings for all stages - actually the command CbShaderBuffer( PixelShader, 2, constBuffer, 0, 0 ) in HlmsBufferManager::mapNextConstBuffer is redundant in the case of GL3Plust RS but needed when D3D11 RD is used. According to the comments in those mentioned classes, it is not recommended to bind different buffers to same slots in different stages (like VS use buffer A at slot 0 and PS use buffer B at slot 0), because it won't work with GL3Plus RS. From that perspective, the CbShaderBuffer should rather accepts a bit-set of shaders where it should be bound - the GL3Plus implementation will bind it once (and asserts that the bit-set cannot be 0) and D3D11 will set the bindings according to the bit-set value. The BaseHlmsBufferManager can set have members to which stages the const buffers and read-only buffers must be bound (if the binding must be part of the implementation) and same for the slots - whether it is hard-coded in c++ or set by HlmsPbs/HlmsUlit to a member I think, make no difference in the performance.

Even better would be also to decouple the const buffer/read-only buffers to have them wrapped in some kind of "pool". The class manages the creation/destruction of the buffers and allows to set externally the bindings to the stages and a slot. This way if Hlms implementation needs e.g. 2 const buffers, it is possible to add 2 members of such pool, set the bindings to the stages and slots. Same for read-only buffers.

If it is still too unclear what I mean, I will write the implementation and send it to you. I am trying to implement my own Hlms (basically rewriting my shader generator which I wrote for GL and another engine, but now using ogre-next and I would like to target d3d11). I am studying the HlmsPbs and HlmsUnlit and trying to extract all those pre-conditions and hard-coded values/binding as they are not well documented. Hopefully I will come with something useful at the end :)

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

Hi!

In one hand, I want to ask "why?".

But on the other hand, it sounds like your work can produce an actually reusable component and you seem to have thought this through. Best case scenario it can replace HlmsBufferManager. Worst case scenario users can have your code to handle an alternate buffer they want to use for having per-instance data without worrying about the details.

So if you feel strongly about it, go ahead and keep me posted of your progress so I can point out the potential issues.

Cheers

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

Thanks, I will give it a try. When I will have something useful, I will post it somewhere :-)

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

The HlmsBufferManager::rebindTexBuffer with resetOffset = false seems strange to me. I am trying to "debug" it just looking into the code so I didn't reproduce any bug yet, but let me try to explain my thoughts.

The function with resetOffset set to false is called only from HlmsPbs::fillBuffersFor and HlmsUnlit::fillBuffersFor when change of hlms is detected - there are identical code for that in both classes:

Code: Select all

            //layout(binding = 2) uniform InstanceBuffer {} instance
            if( mCurrentConstBuffer < mConstBuffers.size() &&
                (size_t)((mCurrentMappedConstBuffer - mStartMappedConstBuffer) + 4) <=
                    mCurrentConstBufferSize )
            {
                *commandBuffer->addCommand<CbShaderBuffer>() =
                        CbShaderBuffer( VertexShader, 2, mConstBuffers[mCurrentConstBuffer], 0, 0 );
                *commandBuffer->addCommand<CbShaderBuffer>() =
                        CbShaderBuffer( PixelShader, 2, mConstBuffers[mCurrentConstBuffer], 0, 0 );
            }

        rebindTexBuffer( commandBuffer );

Let's consider rendering of objects in the following order: pbsA, unlitB, pbsC, unlitD - so there is change of current hlms between each object.

  • HlmsPbs::fillBuffersFor(..., pbsA, ...) starts filling of the data from the beginning of the buffers created by HlmsBufferManager and fills something to const and tex buffer - but far from filling them to full capacity. --> the const and text buffers are mapped (this is important)
  • HlmsUnlit::fillBuffersFor(..., unlitB, ...) - here the change of hlms doesn't binds const buffers, but tex buffer will be bound through command buffer - seem not necessary, but it won't cause any problems. It starts filling of the data from the beginning of the buffers created by HlmsBufferManager - it fills something and same as above - doesn't even reach the full capacity.
  • HlmsPbs::fillBuffersFor(..., pbsC, ...) - here the change of hlms is already I think wrong: The const buffer is mapped from beginning (offset 0), where data for pbsA are written, but instancing was broken/restarted - the pbsC will get drawId 0, right? Same for rebindTexBuffer does the same - it binds the tex buffer using offset where pbsA wrote some data.
  • very similar problems are with next rendering of pbsD

In other words - the command CbShaderBuffer before draw call for pbsA binds tex buffer from offset X to X+n and before draw call pbsC from offset again X to X+n+n (expecting to write same amount of data for both objects).

Is it working in different way? If any command in inserted in fillBuffersForV2, then the instancing is restarted in RenderQueue::renderGL3.

If it is ensured that first all objects from HlmsPbs are rendered and then from HlmsUnlit, then inside fillBuffersFor when hlms change is detected, the buffers are empty - but in this case it is not necessary to bind anything (except pass buffer) while the filling of the data further to const and tex buffer will detect, that the sizes of both buffers are 0 and mapNextConstBuffer/mapNextTexBuffer are called anyway.

I thinks there it is sufficient to call unmapConstBuffer and rebindTexBuffer with resetOffset=true (which means that resetOffset is not needed as all the call would set it to true).

This particular functionality I have problem to re-implement in by own buffer management due to above "problems". If there is no problem at all, then I think I missed somewhere a very important detail.

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

No. There are two issues with your assumptions:

HlmsUnlit::fillBuffersFor(..., unlitB, ...) - here the change of hlms doesn't binds const buffers, but tex buffer will be bound through command buffer - seem not necessary, but it won't cause any problems

Here's your first issue: Binding the tex buffer is absolutely necessary. The tex buffer that HlmsUnlit is managing is a different one from the one HlmsPbs is managing.
Therefore binding another one becomes necessary.

HlmsPbs::fillBuffersFor(..., pbsC, ...) - here the change of hlms is already I think wrong: The const buffer is mapped from beginning (offset 0), where data for pbsA are written, but instancing was broken/restarted - the pbsC will get drawId 0, right?

Actually no. That's what's causing you your biggest issue. Thanks to some API functionality called "baseInstance", we can resume the drawID. We do not have to restart it.

We can pretend that drawID starts from an arbitrary number (like say drawId = 54) as long as all our buffer indexing isn't out of bounds (i.e. buffer[drawId * whatever]).
The drawId is derived from the return value of Hlms::fillBuffersFor:

Code: Select all

uint32 HlmsPbs::fillBuffersFor( /*...*/ )
{
// >> 2u is the same as dividing by 4. We divide by 4 because in the shader we use uint4 worldMaterialIdx[] and mCurrentMappedConstBuffer is an uint32*.
// - 1 because we must return the drawId of the instance we just processed. mCurrentMappedConstBuffer has already been incremented
return uint32( ( ( mCurrentMappedConstBuffer - mStartMappedConstBuffer ) >> 2u ) - 1u );
}

Code: Select all

// RenderQueue obtains the baseInstance:
uint32 baseInstance = hlms->fillBuffersForV2( hlmsCache, queuedRenderable, casterPass,
                                                  lastHlmsCacheHash, mCommandBuffer );
/* ... */                                                  
drawIndexedPtr->baseInstance = baseInstance << baseInstanceShift; // baseInstanceShift is always 0 for non VR rendering.

And that's what resetOffset is for. We can always resume the drawID. But there are cases when we must restart the drawId to 0:

  1. We must bind a new tex buffer because it is full (d'oh!).
  2. We must bind a new const buffer because it is full. Since the const buffer must go back to 0, the drawId must go back to 0. However the const buffers often have a 64kb limit while tex buffers can be very big like 4MB or more, so we can still use the same tex buffer but bind a different subregion of it so the GPU sees the subregion as starting from 0 and thus tex_buffer[drawId] works.
User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

I missed that drawId is not always restarted from 0. It is clear that when hlms is changed, it is necessary to ensure that the correct buffers are bound, but isn't it however possible, that within the scope of

Code: Select all

if( OGRE_EXTRACT_HLMS_TYPE_FROM_CACHE_HASH( lastCacheHash ) != mType )
{
   ...
}

the buffers are bound and immediately after the scope, it is detected, that the buffers are full and mapNextConstBuffer/mapNextTexBuffer (or rebindTexBuffer) are called? From the code it seems like it is possible.

I think I have already enough information to get my ideas in the files - hopefully I will be able to send you something soon.

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

bishopnator wrote: Thu Jun 13, 2024 7:59 pm

the buffers are bound and immediately after the scope, it is detected, that the buffers are full and mapNextConstBuffer/mapNextTexBuffer (or rebindTexBuffer) are called? From the code it seems like it is possible.

That is possible, but it should only happen by coincidence. If it happens 100% of the time it sounds like a bug.

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

bishopnator wrote: Thu Jun 13, 2024 7:59 pm

the buffers are bound and immediately after the scope, it is detected, that the buffers are full and mapNextConstBuffer/mapNextTexBuffer (or rebindTexBuffer) are called? From the code it seems like it is possible.

That is possible, but it should only happen by coincidence. If it happens 100% of the time it sounds like a bug.

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

I put the ideas in cpp so you can see the direction of my thinking.

HlmsBufferPool - contains management of a single buffer type - it provides access to the buffers like the current HlmsBufferManager for const buffers (mapped always from offset 0, automatic advanceFrame calls) and read-only buffers (allows to bind same buffer multiple times using different offsets within the buffer).
HlmsConstBufferPool - implementation of HlmsBufferPool using ConstBufferPacked - internal size fixed to 65536b, allow to bind always only from offset 0
HlmsReadOnlyBufferPool - implementation of HlmsBufferPool using ReadOnlyBufferPacker - internal size defined by caller, allow to bind same buffer multiple times from arbitrary offset
HlmsExt - implementation of Hlms using the buffer pools

I see advantage there to allow reusing of the functionality - current implementations of HlmsPbs and HlmsUnlit use also identical management also for pass buffers and light buffers (in case of PBS).

Implementation of HlmsExt can in its constructor define the buffers and binding points and also assign proper names for the buffers (instead of referring to mConstBuffers and mTexBuffers, it is possible to write:

Code: Select all

MyCustomHlms::MyCustomHlms(...)
{
   mPassBuffers = &createConstBufferPool(VertexShader | PixelShader, 0);
   mInstanceBuffer = &createConstBufferPool(VertexShader | PixelShader, 2);
   mInstaceData = &createReadOnlyBufferPool(PixelShader, 0);
}
There is clear overview which buffers are needed by the Hlms, to which states and slots they will be bound.

The HlmsExt can replace current HlmsBufferManager, but I didn't want to use that name yet.

The code is compilable, but not tested yet - I use MOD_ASSERT library in my project so you can get some compilation errors - just comment out the MOD_ASSERT_XXX macros.

You do not have the required permissions to view the files attached to this post.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

Hi!

I haven't been able to go into full detail but mostly because of these reasons:

  1. OgreComponentExport.h was not included (although it was easy to figure out)
  2. Be careful which version of OgreNext you're targetting. For example the class GeneralAllocatedObject no longer exists. Ideally target master branch which is the latest one. But if you are not feeling comfortable, try with 3-0 branch instead.

I didn't see anything I didn't like :D ; but I guess I would have to wait until you've done more.

Other misc:

  1. Do not use auto (or at least, the final PR should not use it).
  2. Remember that OgreNext has a few practices like member variables being mFoo instead of m_Foo; and variables are initialized in the constructor instead of in the header.
  3. "Direct" & "Indirect" names are already heavily overloaded in computer graphics. Perhaps "Bulk" vs "SubRegion" would fit better?

Keep it up!

Cheers!

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

Soon I will hopefully post a link to git repository with my changes. In the meantime, I am getting an exception from my HlmsBufferPool::postCommandBufferExecution which is actually identical to HlmsBufferManager::postCommandBufferExecution. The only difference is that I don't create a buffer in preparePassHash and I avoid the 'out of bounds' exception with some additional check. But the problem is I think also in HlmsBufferManager, which is normally not hit exactly due to creating first buffer.

If between the calls HlmsBufferManager::preCommandBufferExecution and HlmsBufferManager::postCommandBufferExecution a new buffer is created and added to mTexBuffers, the advanceFrame was not called on that buffer and hence in postCommandBufferExecution, the regressFrame throws an exception. Probably I am missing something again, but can you check it? Is it possible to call regressFrame somehow without previous advanceFrame?

I think I need little bit explanation about the advanceFrame and regressFrame. As the read-only buffer is mapped multiple times within a single frame, the map call doesn't automatically call advanceFrame internally. Hence the preCommandBufferExecution calls advanceFrame so all the buffers will map to new range (frame) within the larger buffer so the data used by previous frame are not overwritten (potentially still used by gpu). But why is it necessary to call regressFrame in postCommandBufferExecution and adjanceFrame in frameEnded() implementation? Those 2 calls puzzle me little bit. I suppose that the pre/postCommandBufferExecution can be repeated multiple times before frameEnded() call (due to e.g. multiple rendered passes). So the postCommandBufferExecution calls regressFrame only that if preCommandBufferExecution is called, its advanceFrame gets the buffers to the same "frame" as previously rendered pass within the currently rendered frame. And the frameEnded() must call advanceFrame to compensate the last postCommandBufferExecution's regressFrame. Is it correct?

If yes, it just sounds, that there should be also "frameStarted" virtual method where the advanceFrame is called and then there is no need to call advanceFrame/regressFrame in any of those currently implemented methods. If there is no "frameStarted", it is possible to call advanceFrame within preCommandBufferExecution, but only conditionally (bool flag) which is reset in frameEnded(). And then regressFrame will be avoided completely.

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

If between the calls HlmsBufferManager::preCommandBufferExecution and HlmsBufferManager::postCommandBufferExecution a new buffer is created and added to mTexBuffers

There must not be any further buffer creation between preCommandBufferExecution & postCommandBufferExecution:

Code: Select all

for( size_t i = 0; i < HLMS_MAX; ++i )
{
	Hlms *hlms = mHlmsManager->getHlms( static_cast<HlmsTypes>( i ) );
	if( hlms )
	    hlms->preCommandBufferExecution( mCommandBuffer );
}

// Can't add more buffers here!
mCommandBuffer->execute();
// Can't add more buffers here!

for( size_t i = 0; i < HLMS_MAX; ++i )
{
	Hlms *hlms = mHlmsManager->getHlms( static_cast<HlmsTypes>( i ) );
	if( hlms )
	    hlms->postCommandBufferExecution( mCommandBuffer );
}

Commands have already been recorded and only need to be executed so they can be sent to the driver. The data should've already been uploaded.

How are you running into such situation?

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

Hi again!

I forgot to answer some more questions:

I suppose that the pre/postCommandBufferExecution can be repeated multiple times before frameEnded()

That is correct. They exist because multiple passes are possible.

But why is it necessary to call regressFrame in postCommandBufferExecution and adjanceFrame in frameEnded() implementation? Those 2 calls puzzle me little bit.

Ok NORMALLY the idea of a dynamic buffer is that you can only map it once per frame:

Code: Select all

while( true )
{
   buf->map()
   upload_data()
   buf->unmap()
   render();
}

Trying to map the same buffer again in the same frame will throw an exception.

However there are cases where we do need to map again. That's where the idea of "explicitly advancing" and "regressing" comes in.

If map( bAdvanceFrame = true ) was called, the point of regressing a buffer is to go back to how it was before buf->unmap(); so that we can call buf->map() and we promise (pinky swear) to not overwrite regions we already wrote to.
Failing to do so could potentially cause the CPU to write to data that the GPU is already reading from (in practice this is unlikely since the GPU is always quite far enough behind and the driver commands are being parepared in the CPU & may still not have hit the GPU, but still... it's a race condition).

When we call map( bAdvanceFrame = false ); then we must manually call advanceFrame() before rendering.

Q: Why do we need to call advanceFrame and then regressFrame again?

The reason is simple: Because when we send render commands to the driver, advanceFrame must've already been called (otherwise we'll be telling the GPU to look at the wrong offsets). This happens in mCommandBuffer->execute(); which executes all the CbShaderBuffer & CbDraw* commands we've just recorded.

However, there may be more compositor passes next; so after mCommandBuffer->execute() we call regressFrame again.

The steps are like the following (let's assume 2 compositor render_scene passes):

  1. Pass 0: buf->map( bAdvanceFrame = false )

  2. Pass 0: uploads data & records commands into mCommandBuffer

  3. Pass 0: buf->unmap()

  4. Pass 0: buf->advanceFrame();

  5. Pass 0: mCommandBuffer->execute();

  6. Pass 0: buf->regressFrame();

  7. ==== NEW PASS ===

  8. Pass 1: buf->map( bAdvanceFrame = false )

  9. Pass 1: uploads data (we promise to write to a different region) & records commands into mCommandBuffer

  10. Pass 1: buf->unmap()

  11. Pass 1: buf->advanceFrame();

  12. Pass 1: mCommandBuffer->execute();

  13. Pass 1: buf->regressFrame();

  14. END OF FRAME: frameEnded() gets called which performs the last and final call to buf->advanceFrame();

So shortened:

  1. buf->advanceFrame(); // +1, counter = 1

  2. mCommandBuffer->execute(); // Executes with counter = 1

  3. buf->regressFrame(); // -1, counter = 0

  4. buf->advanceFrame(); // +1, counter is +1

  5. mCommandBuffer->execute(); // Executes with counter = 1

  6. buf->regressFrame(); // -1, counter = 0

  7. buf->advanceFrame(); // +1, counter = 1

Ultimately by the end of the frame, counter = 1. And the next frame this process is repeated and will end up with counter = 2. The next frame this repeats and counter warps back to 0 so counter = 0 (assuming VaoManager::getDynamicBufferMultiplier == 3)

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

Thanks for all details. Actually I made nasty typo - from my Ogre::HlmsExt::preCommandBufferExecution, I called pBufferPool->postCommandBufferExecution :-(

But thanks for pointing out, that between pre/postCommandBufferExecution there shouldn't be any creation of the buffers - it is correct and forced me to search for the bug.

Finally I have some running state - I committed everything to my ogre-next copy on github - you can check the progress and current state at https://github.com/bishopnator/ogre-nex ... ffer-pools.

I updated the code style little bit - using mXXX for member names instead of m_XXX, using lower-case first letter for all methods. There are still some usages of 'auto' keyword and initialization of members in headers - more updates will follow.

There are some issues:

  • in hlsl vertex shader I have to unpack from 4x4 matrix instead of 4x3 and I don't understand why:

    Code: Select all

      ogre_float4x3 worldMat = UNPACK_MAT4x3(worldMatricesBuf, input.drawId);
      float4 worldPos = float4(mul(input.position_local, worldMat).xyz, 1.0f);
    

    should be exactly the same as:

    Code: Select all

      float4x4 worldMat = UNPACK_MAT4(worldMatricesBuf, input.drawId);
      float4 worldPos = mul(float4(input.position_local, 1.0), worldMat);
    

    Only the second code snipped works. I checked also in code that the last row is really always (0,0,0,1). With the 4x3 matrix somehow the translation is always ignored (but I see the boxes rotating at (0, 0, 0))

In Ogre::HlmsPbs the world transformation matrix is uploaded like:

Code: Select all

            memcpy( currentMappedTexBuffer, &worldMat, 4 * 3 * sizeof( float ) );
            currentMappedTexBuffer += 16;

and in my implementation as:

Code: Select all

            memcpy( currentMappedTexBuffer, &worldMat, sizeof(Ogre::Matrix4) );
            currentMappedTexBuffer += 16;
  • objects using my hlms are wrongly rendered in shadow maps - here I have to say I didn't analyse code in HlmsPbs for rendering the objects in shadow maps - I will get to it later
    Image
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

Hi!

I forgot to reply.

Only the second code snipped works. I checked also in code that the last row is really always (0,0,0,1). With the 4x3 matrix somehow the translation is always ignored (but I see the boxes rotating at (0, 0, 0))

This sounds like row major vs column major. I always confuse them, we made the macros to make sure all shader languages match in how matrices are constructed.

Basically it's not working because you must be loading or multiplying a 3x4 matrix instead of a 4x3 matrix.

I suspect the problem will go aways if you change:

Code: Select all

float4 worldPos = float4(mul(input.position_local, worldMat).xyz, 1.0f);

with:

Code: Select all

float4 worldPos = float4(mul(worldMat, input.position_local).xyz, 1.0f);

But if that doesn't work perhaps from C++ the matrix needs to be transposed before sending to GPU.

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

I will try to play with it. I am trying to extract the code for proper rendering of shadow caster(s) and it looks like a mess there between Hlms, HlmsUnlit and HlmsPbs implementations. There are a lot of duplicated code between HlmsUnlit and HlmsPbs and even between Hlms and HlmsUnlit.

I would say that all properties from HlmsBaseProp should be handled only and only by Hlms (as base class) - no setProperty call of those properties should be done by HlmsUnlit and HlmsPbs.

From the implementation side it looks to me that HlmsUnlit was the first and then somehow extracted to Hlms - at least HlmsUnlit::preparePassHash contains some handling which is done by Hlms::preparePassHashBase like the handling of item in mPassCache (including the assertion that too many items are there).

There are (probably only few) some identical properties defined in HlmsUnlit and HlmsPbs - why aren't they part of HlmsBaseProp? Like "exponential_shadow_maps". I didn't compare them, but when I tried to use a shader piece DoShadowCastPS, no code is generated by my CustomHlms due to missing exponential_shadow_maps.

Such details are really hard to extract from the code and it is necessary to use huge effort to bring it to life.

My goal is to write a new base class inherited from Hlms (like in my example code HlmsExt) which contain somehow a clearer information what must be fulfilled to render geometry with the very simplest shader - e.g. support properly automatic instancing done by Ogre::RenderQueue, support shadow casters and optionally also shadow receivers.

I noticed from the debugging, that filling pass buffers in preparePassHash implementation is also not very good idea - in my example code I registered HlmsPbs, HlmsUnlit and my CustomHlms - when items in the scene use only CustomHlms, the HlmsPbs and HlmsUnlit still create and fill pass buffers, even if their fillBuffersFor is not called at all. I think the good approach is to only accumulate the data in member variable (like it is done in HlmsUnlit::mPreparedPass), but don't fill the GUP buffers yet. The filling of the buffers can be done in fillBuffersFor when hlms type change is detected - there the pass buffers are bound. It is of course also better to guard the filling of the pass buffers to ensure that they are filled only once.

I am not sure how to implement all the details yet, but trying hard to come with something useful. Slowly but surely I am also converging to use the same kind of buffers like HlmsPbs and HlmsUnlit (pass buffer, instance data, world matrices and buffers which hold the data from Hlms data blocks) - but trying to find a solution how to indicate such using through the base class - I am thinking here about providing more "base" classes - the Ogre::Hlms without forcing anything - it will be completely up to the developer, then base class which provides the basic support with suggesting the mentioned buffers, etc. The developer should be able to pick a base class and start writing his/her shaders without huge effort.

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

And another note about the shadow casters - I think the support should be verified by the Hlms itself - if Item has set a bool flag that it casts shadows, but Hlms doesn't support shadow casters, the Item shouldn't be rendered in the shadow maps. This way it would be possible to have a mixture of Item instances in the scene (and same render queue id) with the shadows but if Item is rendered through Hlms without proper (or missing) implementation for shadow casters, the shadows would be still relatively ok - some objects will be missing there, but they will be missing for good as such artifacts like in my screenshot would be possible to simply dismiss by telling Ogre that the Hlms implementation doesn't support shadow casters.

Is something like that already supported by Ogre? It is also possible that I missed that feature of Hlms - I didn't look explicitly into this as I would expect that by default it should be turned-off (no support for shadow casters by default) and it would be turn-on by the developer of Hlms when the proper shaders are implemented.

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

Hi!

There are a lot of duplicated code between HlmsUnlit and HlmsPbs and even between Hlms and HlmsUnlit.

The fundamental issue is that the requirements keep shifting and changing.
And in many cases sharing becomes worse: Unlit is supposed to be lightweight (as in simple) aside from not supporting lights.

Once the code begins to share more, you'll hit the issue that the shared code will begin to pull in more heavy weight code that is only meant for PBS.

If the advanced features of PBS are needed, one option is to use Emissive's element of PBS with no diffuse, no specular and no ambient; which will result in the same look but with all the other available options.

From the implementation side it looks to me that HlmsUnlit was the first and then somehow extracted to Hlms - at least HlmsUnlit::preparePassHash contains some handling which is done by Hlms::preparePassHashBase like the handling of item in mPassCache (including the assertion that too many items are there).

It actually happened the other way around: HlmsPbs came first and then Unlit was a stripped down version.

As for the assert, if you mean this one:

Code: Select all

        assert( mPassCache.size() <= (size_t)HlmsBits::PassMask &&
                "Too many passes combinations, we'll overflow the bits assigned in the hash!" );

This is because the Hlms final hash is a 32-bit value, and we encode 8 bits for passes and 21 bits for renderables. The goal is to have O(1) lookups without having to implement or use a hash table.

There are (probably only few) some identical properties defined in HlmsUnlit and HlmsPbs - why aren't they part of HlmsBaseProp? Like "exponential_shadow_maps". I didn't compare them, but when I tried to use a shader piece DoShadowCastPS, no code is generated by my CustomHlms due to missing exponential_shadow_maps.

Unlit was originally written to have no shadows. But the issue arose that it was useful for Unlit to cast shadows (but not receive them).

Exponential shadows introduced the problem that the caster must be aware of this setting, thus Unlit had to clone certain (but not all) functionality that is from PBS.

Exponential shadows is at its origin, a PBS feature. We could rethink the issue and probably refactor it as its own component that can be plugged into HlmsPbs and HlmsUnlit (and whatever other Hlms implementation). But there is not enough interest in exponential shadows to begin with, to allocate resources to do such refactor.

I noticed from the debugging, that filling pass buffers in preparePassHash implementation is also not very good idea - in my example code I registered HlmsPbs, HlmsUnlit and my CustomHlms - when items in the scene use only CustomHlms, the HlmsPbs and HlmsUnlit still create and fill pass buffers, even if their fillBuffersFor is not called at all. I think the good approach is to only accumulate the data in member variable (like it is done in HlmsUnlit::mPreparedPass), but don't fill the GUP buffers yet. The filling of the buffers can be done in fillBuffersFor when hlms type change is detected - there the pass buffers are bound. It is of course also better to guard the filling of the pass buffers to ensure that they are filled only once.

It is certainly a minor(\*) problem that could be fixed, but not an easy to fix:

  1. Several of the operations in preparePassHash can NOT be done in fillBuffersFor: Vulkan & Metal require buffer copy operations to not happen while a "render pass" is open. Once RenderSystem::executeRenderPassDescriptorDelayedActions is called, those operations can no longer be issued until RenderSystem::endRenderPassDescriptor is called.

  2. For the data that can be uploaded in fillBuffersFor, copying the data to a CPU buffer and then uploading on demand is pointless, because you'll be doubling bandwidth usage for no reason (read data from various sources, write into CPU buffer, then read from that CPU buffer, write into GPU buffer).

  3. You could split Pass data generation from the uploading tasks that can be done in fillBuffersFor. But you'll quickly end up with a lot of duplicated work, like evaluating the size of mapSize.

(\*) It's "minor" because a normal rendering pipeline usually has less than 10 passes. At a max of < 64kb each (usually around 1kb typically) and 3 Hlms implementations, that's around 10 x 64 x 3 = 1.88MB of max data per frame, 112MB/s at 60Hz.
Typically 1.75MB/s if we consider 1kb instead of 64kb.

What you're proposing is to evaluate if we have already uploaded the pass data in fillBuffersFor. Which gets called per object, per pass. That's almost certainly a net loss. Even with a branch missprediction of 1%, at 100k objects that'd be the same as calling preparePasshHash 1000 times per pass per frame.

What's counterintuitive about modern computer performance is that it's sometimes (but not always) more efficient to do something 10 times and discard it because it was never needed after all, than to check if we need to ever do it... but check it 100,000 times.

So far the only solution I've found to this problem is to have the CompositorPass only call the prepare pass hash that it needs.
CompositorPassScene does not support this (I'm worried that people could disable it and then by accident use it, which can cause more unnecessary support tickets), but custom passes can do it. For example Colibri only calls HlmsColibri::preparePassHash (which derives from HlmsUnlit) in its compositor pass used to draw the GUI exclusively. This happens inside of m_colibriManager->prepareRenderCommands.

And another note about the shadow casters - I think the support should be verified by the Hlms itself - if Item has set a bool flag that it casts shadows, but Hlms doesn't support shadow casters, the Item shouldn't be rendered in the shadow maps.

Hlms implementations have no say on what gets rendered (at least with the default RenderQueue implementation).
They just say how to render it.

The best they can do if they want something gone is to replace the vertex shader with a dummy one or fill the WVP matrix with zeroes so that it creates degenerate triangles, thus nothing gets drawn on screen. However if your goal is to save performance, the GPU still has to process it.

Either that, or assert / log when they encounter a shadow casting object that shouldn't be casting shadow, to inform the object is incorrectly setup.

The analysis / assertion can happen either early in calculateHashForPreCaster (which is called once when a datablock is assigned to a renderable, but it will miss if the setting is later enabled) or late in fillBuffersFor (which is called every time the object is rendered).

Other ways to avoid this issue is to use Render Queue IDs or visibility masks in the shadow nodes passes.
"What" gets rendered and in which order is the Compositor's job.

Cheers

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

With passBuf I didn't mean to check for it by every fillBufferFor call, but only inside the scope where the change of hlms type is detected:

Code: Select all

if( OGRE_EXTRACT_HLMS_TYPE_FROM_CACHE_HASH( lastCacheHash ) != mType )
{
           ...
            // layout(binding = 0) uniform PassBuffer {} pass
            ConstBufferPacked *passBuffer = mPassBuffers[mCurrentPassBuffer - 1];
            *commandBuffer->addCommand<CbShaderBuffer>() = CbShaderBuffer(
                VertexShader, 0, passBuffer, 0, (uint32)passBuffer->getTotalSizeBytes() );
            *commandBuffer->addCommand<CbShaderBuffer>() =
                CbShaderBuffer( PixelShader, 0, passBuffer, 0, (uint32)passBuffer->getTotalSizeBytes() );
           ...
           // additionally check whether the passBuffer is necessary to fill
           ...
}

But if there are any problems with vulkan/metal, then of course it is out of discussion.

As for the assert, if you mean this one:

Code: Select all

        assert( mPassCache.size() <= (size_t)HlmsBits::PassMask &&
                "Too many passes combinations, we'll overflow the bits assigned in the hash!" );

This is because the Hlms final hash is a 32-bit value, and we encode 8 bits for passes and 21 bits for renderables. The goal is to have O(1) lookups without having to implement or use a hash table.

After the assert there is std::find which is definitely not O(1). Probably you meant another lookup using the final (combined) hash. Anyway my comment was more regarding of same implementation in HlmsUnlit and base class Hlms. If Hlms::preparePassHashBase is not called there only due to setting too many properties, then Hlms should contain less "base" properties. I don't see another difference there.

The HlmsUnlit actually generates same "empty" pixel shaders as my CustomHlms:

Code: Select all

struct PS_INPUT
{
};

struct PS_OUTPUT
{
	float colour0 : SV_Depth;
};

void main(PS_INPUT inPs)
{
	PS_OUTPUT outPs;
}

Does it have any meaning? Is the depth somehow automatically written to the output in this case?

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5429
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1337

Re: HlmsBufferManager binds const buffers to vs and ps

Post by dark_sylinc »

The HlmsUnlit actually generates same "empty" pixel shaders as my CustomHlms:

The Hlms should be setting hlms_disable_stage. The Hlms will continue to parse and generate a shader, but if by then hlms_disable_stage is still non-zero, it won't compile it or use it at all. The generated shader will be discarded (but still dumped to disk for debugging the output in case its necessary).

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

I think I am getting closer and closer the result which I had initially in mind. If you find some time, you can check the current version in my git. The CustomHlms supports only a simple color (as Ogre::Vector4) and in the app there are 16 items with different colors set through the datablocks.

I like following points in the implementation:

  • most of the logic is placed in HlmsExt - I extracted the logic from HlmsPbs and HlmsUnlit
  • the specification for the used buffers are placed in CustomHlms::CustomHlms
  • the binding of the buffers is automatic through HlmsExt and there is no needed to take care about it in CustomHlms

I still don't like some details:

  • requesting of the slot is necessary to do explicitly in datablock implementation (constructor of CustomHlms)
  • necessity of releasing the slot from ConstBufferPoolUser in my datablock - why something like that is not done in the ConstBufferPoolUser destructor?
  • connection between HlmsExt and the ConstBufferPool is not automatic - everything can be defined by the datablock class - e.g. if HlmsExt would be a template class, it is possible to implement there a lot of logic basic on the type of the used datablock (for now I avoid such complex templates - but I wanted to mention the idea). The current implementation expects the CustomHlms does the connection between datablocks and the ConstBufferPool
  • uploading of data from the datablocks is not nice by giving direct access to raw mapped memory - I would prefer the interface like I created in HlmsBufferPool with writeData, insertValue and insertValues - then the execution is safer as at least in debug it is possible to assert when buffer overflows (writing more data than specified by number of bytes per slot).

I didn't get proper rendering of shadows from shadow casters - I think I copies everything which is used by HlmsUnlit, but it seems that I am still missing something as the rendering of the shadows is identical to the screenshot which I sent in one of my previous post.

User avatar
bishopnator
Goblin
Posts: 299
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 11

Re: HlmsBufferManager binds const buffers to vs and ps

Post by bishopnator »

The current output:
Image

Note that I still didn't write GLSL shaders so only d3d11 render system works.