Bind uav buffers in more dynamic way

Discussion area about developing with Ogre-Next (2.1, 2.2 and beyond)


User avatar
bishopnator
Gnome
Posts: 327
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 15

Bind uav buffers in more dynamic way

Post by bishopnator »

Hi guys, is it possible to bind uav buffers in similar manner in HLMS like read-only buffers? I would like to write some data in pixel shader using RWStructuredBuffer, but the setup should be done similar to preparation of read-only buffers - this means that in fillBuffersForV2 I will reserve some space in the current uav buffer, store the offset and size in my instance data. If there is no free space in the uav buffer, I want to create new uav buffer, bind it and reserve space there, etc. (so uav buffer pool). I see that the uav buffers are bound through compositors, but it more static way - I have to specify an uav buffer for the whole rendering which is not what I need. In OpenGL it is definitely possible using SSBO. Is D3D11 limited here? How can I bind a buffer by myself RWStructuredBuffer? Is there another read/write buffer type in Ogre which should I use?

note: I would like to render scene (just some of the objects - e.g. sorted out by render queue id) in one pass - there won't be any color output (all pixels discarded), but the pixel shaders will write to RWStructuredBuffer. In next pass, the scene is rendered again, but now the shader(s) will access written data (read-only now). I considered here also the usage of compute shaders, but I need to process data from vertex/index buffers, which doesn't seems to be supported by ogre (they need special flags to be able to be bound to compute shaders which are not set by Ogre).

note2: It seems that it is possible to call OMSetRenderTargetsAndUnorderedAccessViews with D3D11_KEEP_RENDER_TARGETS_AND_DEPTH_STENCIL to just bind new UAVs, but it is not used in Ogre at all. What I am missing here?

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5476
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1359

Re: Bind uav buffers in more dynamic way

Post by dark_sylinc »

Is D3D11 limited here?

Yes.

The problem is that D3D11 is a "safe" API (i.e. compared to D3D12 and Vulkan where you can do whatever and if it's not safe of race conditions, it's your problem) and UAVs imply write.

To put it into context, D3D11 implicitly issues a full barrier between each compute dispatch.

In the context of Graphics (ie., not Compute), traditionally the only Write operations were to the Colour, Depth and Stencil targets. Everything else was read only.

D3D11 needs to perform various checks. For example feedback loops are forbidden: All bound textures must not be equal to the currently bound Colour, Depth or Stencil targets. Depth target has an exception if depth writes are disabled and the depth buffer was bound with D3D11_DSV_READ_ONLY_DEPTH (the same for Stencil).

When UAVs were introduced to Graphics, an arbitrary amount of textures would be in the same situation as regular colour targets: UAVs can't be bound at the same time as a Colour Target, and no texture can be bound for sampling at the same time as an UAV. The same goes for buffers not being bound as SRV.

This also means that if you want to switch the UAV to read only, it must be unbound as UAV, and bound to read/sample. That will force D3D11 to issue a barrier.

When put from a perspective of hazard tracking, it makes sense Microsoft decided to put UAV binding together with OMSetRenderTargets; since they are so similar that all the hazard tracking from RenderTargets can be reused for UAVs.

But it also meant D3D11's interface for binding UAVs is clunky and quite expensive due to all the hazard tracking and validation.

The truth is that even if we move to D3D12/Vulkan, that awkwardness from D3D11 is gone but only because the responsability was shifted to us. However from our perspective, we're met with the exact same problems D3D11 had. It gets worse if mobile/TBDR are considered (render passes must be closed to issue the barrier; which is problematic if the pass has load + store actions that aren't Load and Store respectively), which is why iOS for a long time didn't support binding UAVs to graphics, and in our Vulkan backend IIRC we don't support all types of UAV bindings for Graphics.

This is why OIT algorithms like per-pixel linked lists are still uncommon in modern day rasterizers: It's a PITA to setup the UAV: we have to correctly guess the space beforehand (and risk glitching if we run out of space, or go the extra mile at engine design to support adding more UAVs mid render).

The general belief is that operations that require UAVs are best rewritten as compute shaders when possible.

Due to all these hurdles, I didn't focus too much on improving UAV in Graphics. There's possibly better approaches (and maybe even unknown solutions?) than what we're doing right now. I just gave up due to difficulty + lack of demand.

If you're wondering what's the difference with Compute, I explained it recently in a Reddit post that graphics follows a set of ordering rules that Compute does not have to follow.
It's also easier to remap the discrepancies between OpenGL and D3D11 in HlmsComputeJob (see next block).

In OpenGL it is definitely possible using SSBO. Is D3D11 limited here?

Another issue I forgot to mention is that OpenGL and D3D11 vastly differed in how UAVs are bound, OpenGL being a lot more flexible (OpenGL will happily let you glitch the render if you cause a feedback loop). But the problem is that the slots don't even match.

D3D11 puts both buffers and textures as slots prefixed with u, so u0 and u1 could be a buffer followed by a texture. While OpenGL used separate binding points for buffers, but UAV textures share binding slots with regular textures. This discrepancy between APIs is incredibly infuriating.

Nowadays I created the concept of RootLayouts for Vulkan, and RootLayouts would easily solve those discrepancies if we used RootLayouts on D3D11 and OpenGL as well; but that class did not exist when our Graphics UAV binding model was written (which begrudgingly ended up using the Compositor).

RootLayouts would solve the problem because it creates fake/virtual binding points that are later remapped to the HW slots.

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5476
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1359

Re: Bind uav buffers in more dynamic way

Post by dark_sylinc »

TL;DR: I'm not happy with how UAVs in Graphics turned out in OgreNext. And a lot of the problems could perhaps be solved if we used RootLayouts everywhere, or we ignored OpenGL and D3D11 and focused on the newer APIs.

User avatar
bishopnator
Gnome
Posts: 327
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 15

Re: Bind uav buffers in more dynamic way

Post by bishopnator »

I was thinking about the following: In CompositorPassScene::execute there is a setup (call) of OMSetRenderTargetsAndUnorderedAccessViews (not directly, but it is there) and later from the same method the whole scene is rendered - where actually hlms collects the objects and prepares the command buffer for execution. If we can setup the "table" of bound UAVs to track what is actually bound, then later if during HLMS processing comes request to bind new UAV, we will update the "table" and refresh binding of all UAVs (store new command in command buffer with copy of UAVs table which during its execution calls OMSetRenderTargetsAndUnorderedAccessViews). Can work something like this one?

Now little bit explanation what I would like to try or want to achieve. At the current state I am able to convert a mesh to thick lines with a line pattern (in the image there are 2 line patterns used and different thickness on the boxes):
Image
For the line patterns, it is necessary to access a pattern distance along the path. I am statically preparing the mesh for that - creating a graph from all visible edges and extract the "longest" paths for which I initialize the distances from the beginning (1st vertex) to the end of the path - even such simple mesh as box must have multiple paths encoded. This results in line patterns in the world space.

I would like to support also line patterns in the screen space - for that I need to recalculate the paths dynamically for the current camera and for each instance of the mesh individually. As I know how many paths and how big they are (in the sense of number of edges), during preparation phase in fillBuffersForV2 I can check my current UAV and reserve there space at the end and store those offsets in my instance data. If there is not enough space in UAV, I have to take new one (from the pool or allocate it), issue a new bindings in command buffers and continue. The rendering will store for each edge a screen space distance to the UAV. This will be done in 1st compositor scene pass. In the 2nd compositor scene pass I will bind those UAVs as read-only and in shaders read the precalculate distances. In my mesh (in separate buffer - I call it TriangleDataBuffer) I store the paths of edges like linked list - each edge in TriangleDataBuffer will tore its predecessor so I can accumulate the screenspace distance for the current edge in GeometryShader. Statically I can limit in preprocessing phase (on CPU when mesh is loaded and prepared for my hlms) number of edges in the path. If you consider e.g. a circle very fine tessellated, there is only a single very long path - for the performance I can break it so there will be couple of glitches in the applied line pattern.

I considered compute shader from whatever perspective - I didn't come up with some reasonable solution - the number of objects is dynamic, I cannot limit here size of UAV, I need definitely multiple UAVs bound dynamically according to the content of the scene. I need to bind in compute shader vertex/index buffers to iterate the vertices and triangles and compute screen space distances there, but in Ogre there are missing some bind flags for the vb/ib so I discarded the idea.

No other solution came to me, how I can support screen space line patterns (and hence how to compute the screen space distances along the paths in gpu).

User avatar
bishopnator
Gnome
Posts: 327
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 15

Re: Bind uav buffers in more dynamic way

Post by bishopnator »

I am playing further with the code and wanted to share some ideas with you. Maybe you will point out some crucial details why it cannot work or where are some potential critical problems.

Adding RenderSystem::_setUavPS with all possible overrides:
OgreRenderSystem.h

Code: Select all

virtual void _setUavPS( uint32 slotStart, const DescriptorSetUav *set ) = 0;

OgreD3D11RenderSystem.cpp

Code: Select all

    //---------------------------------------------------------------------
    void D3D11RenderSystem::_setUavPS( uint32 slotStart, const DescriptorSetUav *set )
    {
        ComPtr<ID3D11UnorderedAccessView> *uavList =
            reinterpret_cast<ComPtr<ID3D11UnorderedAccessView> *>( set->mRsData );
        ID3D11DeviceContextN *context = mDevice.GetImmediateContext();
        context->OMSetRenderTargetsAndUnorderedAccessViews( D3D11_KEEP_RENDER_TARGETS_AND_DEPTH_STENCIL, nullptr, nullptr, slotStart, static_cast<UINT>( set->mUavs.size() ),
                                            uavList[0].GetAddressOf(), 0 );
    }

OgreGL3PlusRenderSystem.cpp

Code: Select all

    void GL3PlusRenderSystem::_setUavPS( uint32 slotStart, const DescriptorSetUav *set )
    {
        // setting the buffers in pixel shader is actually identical to setting the buffers in compute shaders
        _setUavCS( slotStart, set );
    }

OgreNULLRenderSystem.h

Code: Select all

void NULLRenderSystem::_setUavPS( uint32 slotStart, const DescriptorSetUav *set ) {}

The main idea in the above code snippets is to expose OMSetRenderTargetsAndUnorderedAccessViews call with D3D11_KEEP_RENDER_TARGETS_AND_DEPTH_STENCIL value.

Now it is necessary to allow the new RenderSystem's method from CommandBuffer. Adding new member in CommandBuffer:

Code: Select all

static CommandBufferExecuteFunc execute_setUavsPS;

And new value in CbType in OgreCbCommon at the end just before MAX_COMMAND_BUFFER:

Code: Select all

CB_SET_UAVS_PS

Adding CbSetUavs:
OgreCbSetUavs.h

Code: Select all

#ifndef _OgreCbSetUavs_H_
#define _OgreCbSetUavs_H_

#include "CommandBuffer/OgreCbCommon.h"

namespace Ogre
{
    struct _OgreExport CbSetUavs : public CbBase
    {
        uint32 slotStart;
        const DescriptorSetUav *set;
        CbSetUavs( uint32 _slotStart, const DescriptorSetUav *_set );
    };
}  // namespace Ogre
#endif

OgreCbSetUavs.cpp

Code: Select all

#include "OgreStableHeaders.h"

#include "CommandBuffer/OgreCbSetUavs.h"
#include "CommandBuffer/OgreCommandBuffer.h"
#include "OgreRenderSystem.h"

namespace Ogre
{
    void CommandBuffer::execute_setUavsPS( CommandBuffer *_this, const CbBase *RESTRICT_ALIAS _cmd )
    {
        const CbSetUavs *cmd = static_cast<const CbSetUavs *>( _cmd );
        _this->mRenderSystem->_setUavPS( cmd->slotStart, cmd->set );
    }

CbSetUavs::CbSetUavs( uint32 _slotStart, const DescriptorSetUav *_set ) :
    CbBase( CB_SET_UAVS_PS ),
    slotStart( _slotStart ),
    set( _set )
{
}
}  // namespace Ogre

In HLMS implementation, there will be a member of DescriptorSetUav which will track the current binding of UAVs. Every time when it is necessary to bind new UAV (or multiple UAVs), this descriptor is updated and returned a pointer from HlmsManager:

Code: Select all

//////////////////////////////////////////////////////////////////////////
const DescriptorSetUav* HlmsExt::updateDescriptorUavSet(uint16_t slot, UavBufferPacked& buffer, size_t bindOffset, size_t sizeBytes)
{
	if (slot >= mDescriptorSetUav.mUavs.size())
		mDescriptorSetUav.mUavs.resize(slot + 1);

mDescriptorSetUav.mUavs[slot].slotType = DescriptorSetUav::SlotTypeBuffer;
mDescriptorSetUav.mUavs[slot].getBuffer().makeEmpty();
mDescriptorSetUav.mUavs[slot].getBuffer().buffer = &buffer;
mDescriptorSetUav.mUavs[slot].getBuffer().offset = bindOffset;
mDescriptorSetUav.mUavs[slot].getBuffer().sizeBytes = sizeBytes;
mDescriptorSetUav.mUavs[slot].getBuffer().access = ResourceAccess::Write;
return mHlmsManager->getDescriptorSetUav(mDescriptorSetUav);
}

I know there is missing initialization of the currently bound UAVs from CompositorPassUav, but my idea is to avoid usage of that pass - rather bind my own UAVs as needed from fillBuffersForV2 and let the RenderSystem to manage the bindings properly. The idea is that the UAV buffers are bound always for writing (ResourceAccess::Write in the above code) and later from another CompositorPassScene the UAVs will be bound for reading using UavBufferPacked::getAsTexBufferView or getAsReadOnlyBufferView. The memory barrier must be issued between those 2 CompositorPassScene (if not issued automatically by Ogre - I didn't check this yet), but there must be 2 scene passes (one for writing to UAVs and another for reading them).

If necessary, I can try to put the changes in some separate branch for better overview.

User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5476
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1359

Re: Bind uav buffers in more dynamic way

Post by dark_sylinc »

It looks about right.

The main issues I see are three:

  1. I don't remember the details about slotStart. That was a PITA and was about making OpenGL & D3D11 somewhat more compatible. I might be overreacting here.
  2. In GL you need to unbind the previous UAVs. GL3PlusRenderSystem::flushUAVs is in charge of that, but watch out it might produce unwanted side effects.
  3. You're ignoring memory barriers, which are important for all APIs except D3D11. This is the biggest issue. As you've layed out the API right now, nowhere in OgreNext notices the UAV slots changed. For example if you bind textureA as an UAV, and now want to sample from it; you can't do that. You must call endRenderPassDescriptor() first. There's nothing preventing this or informing you of this (though in Vulkan you'll probably get an Exception if you try to do it wrong; or if all else failed to noticed the mistake, a Vulkan Validation Layer error). Also if you previously bound textureA for sampling, a barrier must transition it to an UAV before it can be bound. And this transition must happen before beginRenderPassDescriptor. UAV buffers don't need to transition, but they still need a barrier to prevent a pass from writing into a buffer while a pass was still reading from it.

In the current API, HlmsComputeJob::analyzeBarriers makes sures UAV barriers are taken into account for Compute, and CompositorPass::analyzeBarriers makes sure UAV dependencies are accounted via mDefinition->mUavDependencies.

Except D3D11 (which issues a full barrier between operations), you can think of GPUs as devices that execute all passes in parallel unless you explicitly tell them to wait before starting the next set of passes. That's why UAVs are so bothersome. Having abitrary write access needs to be dealt with or else you get into race conditions.

User avatar
bishopnator
Gnome
Posts: 327
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 15

Re: Bind uav buffers in more dynamic way

Post by bishopnator »

  1. I will check the usage of slotStart when I will have 1st running example using the new dynamic approach of using uavs
  2. I test GL3Plus RS at the end because I am using automatic conversion of HLSL shaders to GLSL which so far works perfectly, but always when I use something new, I have to revisited the conversion (it goes as HLSL -> SPIR-V => GLSL)
  3. I checked deeped for memory barriers and how I can smoothly integrate them into Hlms. I think everything is doable in overridden Hlms::analyzeBarriers. In my Hlms implementation, I extracted the creation of buffers into helper classes (pools) so I have better control about the creation and binding of the buffers. Everything is meant to be prepared in the constructor - the implementation creates the desired buffer pools and state also the bindings (slots and shader stages). For UAV buffer pool it is necessary to state list of scene passes where they are active and the type of access (read / write). With this information, it is possible to implement the override of analyzeBarriers in the base class (I call it HlmsExt) because according to the currently executed scene pass, it will be possible to notify the barrier solver about the usage of all stored buffers in the pools. The compositor scripts must then contain the identification of the scene passes which will be accessed by HlmsExt (setting of 'identifier' in the compositor script).

I think in the analyzeBarriers I can access also the current state of DescriptorSetUav, which is initialized by the instances of CompositorPassUav preceding the currently executed CompositorPassScene. Then when the Hlms implementation can setup additional binding of some uavs and if the slots are not interfering, then the binding should be correct and properly updated.

edited: Is there a reason why the sceneManager->_setCurrentCompositorPass( this ) call is after the analyzeBarriers in CompositorPassScene? I would need to access the currently executed compositor scene pass from Hlms::analyseBarriers override

User avatar
bishopnator
Gnome
Posts: 327
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 15

Re: Bind uav buffers in more dynamic way

Post by bishopnator »

I am trying further to get a running sample with my customized uav bindings. By the second frame, the D3D11RenderSystem complains by the call of context->OMSetRenderTargetsAndUnorderedAccessViews that the uav buffer is still bound as input - it seems that there is missing some synchronization point and the d3d11 is not doing it automatically.

I have to CompositorPassScene with assigned identifiers so in my HLMS implementation I can react on those passes.

  1. pass A activates the UAV in the DescriptorSetUav and notifies the BarrierSolver that the resource will be used for writing and stores my new command CbSetUavs to the CommandBuffer (uav buffer is bound to u1 as the first slot is render target's color buffer).
  2. pass B notifies the BarrierSolved that the buffer will be used for reading and binds it as read-only buffer (it is bound to t3 slot).
  3. (new frame) pass A is activated again - and here the OMSetRenderTargetsAndUnorderedAccessViews complains that the buffer is still bound as input in t3).

In the debugger, I see the execution of D3D11RenderSystem::endRenderPassDescriptor which resets the slots (but not all). The D3D11ReadOnlyBufferPacked::bindBufferXX don't update D3D11RenderSystem::mMaxSrvCount - is it correct? How can I automatically unbind the slots between the passes?

note: The D3D11ReadOnlyBufferPacked::bindBufferXX are called from execute_setReadOnlyBufferXX functions and those commands are regularly used by Ogre

User avatar
bishopnator
Gnome
Posts: 327
Joined: Thu Apr 26, 2007 11:43 am
Location: Slovakia / Switzerland
x 15

Re: Bind uav buffers in more dynamic way

Post by bishopnator »

Just some additional notes here regarding D3D11RenderSystem (I didn't try it with GL3PlusRenderSystem yet). The mMaxSrvCount is updated only from D3D11RenderSystem::setTexture and D3D11RenderSystem::setTextures and the resources are automatically unbound from D3D11RenderSystem::beginRenderPassDescriptor() / endRenderPassDescriptor().

The commands issues from HLMS CbShaderBuffer don't update the mMaxSrvCount - I suppose the idea here is that if there will be objects only from a single HLMS, the bound resources can be reused between the frames, however they can be unbound by Ogre and HLMS doesn't know about it. Now it works because there is unlit and pbs HLMS active in almost all cases (UI vs scene) and hence the HLMS gets the notification that the HLMS is changed (from the input value lastCacheHash passed to fillBuffersForV2) and it will rebind the resources.

Now when I move to my case with dynamic binding of UAV, I can identify the problem closer -between the scene passes I have to unbind the UAV as read-only buffer - if I do it when HLMS is changed, there will be problem in scenes when all objects are rendered only with custom HLMS. If I just force to unbound slot where my UAV's read-only buffer view is bound, I risk that I unbind another buffer (if in meantime another HLMS bound there something).

Just thinking loud - wouldn't it be possible/feasible to have in D3D11RenderSystem some kind of slot bindings management where it will be easily possible to track what is bound where and also avoid useless bindings if same buffer is already bound at the given slot? Also UAVs should track by themselves where they are bound as read-only buffer or texture buffer and if they are tried to be bound as UAV (OMSetRenderTargetsAndUnorderedAccessViews), the Ogre has to unbind their read-only and texture buffer views. Now the unnecessary bindings are tried to be resolved from different classes and different HLMS (even in my implementation I had to take care of it).