Page 1 of 1

setNumThreadGroupsBasedOn function for compute shaders [SOLVED]

Posted: Sat Nov 09, 2019 5:51 pm
by SolarPortal
Hey, me again with more tweaks that i feel were missing from the source.
This time its related to the setNumThreadGroupsBasedOn() function which makes it easy to set the num thread groups for the job based on a texture or uav size. However, when i was working with mipmaps it was not functioning as expected as it did not take the mip level into account and also the number of thread groups that came out at the end was always "1,1,(depth)".
So i made a few tweaks that make it much easier to work with uavs and compute passes that need to edit each mip level without having to change the thread groups per mip resolution.

so ill get straight to the tweaks. All the changes are based in "OgreHlmsComputeJob.cpp" on function "_calculateNumThreadGroupsBasedOnSetting()"

after this line,

Code: Select all

const TexturePtr &tex = texSlots[mThreadGroupsBasedOnTexSlot].texture;
i added:

Code: Select all

const int32 mipLevel = texSlots[mThreadGroupsBasedOnTexSlot].mipmapLevel;
i then altered the resolution section to accomadate the changes:

Code: Select all

      
       uint32 resolution[3];
       resolution[0] = tex->getWidth() >> mipLevel;
       resolution[1] = tex->getHeight() >> mipLevel;
       resolution[2] = tex->getDepth();// >> mipLevel;
And finally to make sure that the correct number of threads are created based on the divisor factor i changed,

Code: Select all

                    resolution[i] = (resolution[i] + mThreadGroupsBasedDivisor[i] - 1u) /
                                    mThreadGroupsBasedDivisor[i];

                    uint32 numThreadGroups = (resolution[i] + mThreadsPerGroup[i] - 1u) /
                                             mThreadsPerGroup[i];

to this: (as what was there simply did not make any sense and i believe it was a bug)

Code: Select all

uint32 numThreadGroups = (resolution[i] + mThreadGroupsBasedDivisor[i] - 1u) / mThreadGroupsBasedDivisor[i];
finally here is my version of the function which is working nicely on binding a uav's different mipmaps:

Code: Select all

    void HlmsComputeJob::_calculateNumThreadGroupsBasedOnSetting()
    {
        bool hasChanged = false;

        if( mThreadGroupsBasedOnTexture != ThreadGroupsBasedOnNothing )
        {
            const TextureSlotVec &texSlots = mThreadGroupsBasedOnTexture == ThreadGroupsBasedOnTexture ?
                        mTextureSlots : mUavSlots;

            if( mThreadGroupsBasedOnTexSlot < texSlots.size() &&
                !texSlots[mThreadGroupsBasedOnTexSlot].texture.isNull() )
            {
                const TexturePtr &tex = texSlots[mThreadGroupsBasedOnTexSlot].texture;
				const int32 mipLevel = texSlots[mThreadGroupsBasedOnTexSlot].mipmapLevel;

                uint32 resolution[3];
                resolution[0] = tex->getWidth() >> mipLevel;
                resolution[1] = tex->getHeight() >> mipLevel;
		resolution[2] = tex->getDepth();// >> mipLevel;

                if( tex->getTextureType() == TEX_TYPE_CUBE_MAP )
                    resolution[2] = tex->getNumFaces();

                for( int i=0; i<3; ++i )
                {
                    uint32 numThreadGroups = (resolution[i] + mThreadGroupsBasedDivisor[i] - 1u) / mThreadGroupsBasedDivisor[i];

                    if( mNumThreadGroups[i] != numThreadGroups )
                    {
                        mNumThreadGroups[i] = numThreadGroups;
                        hasChanged = true;
                    }
                }
            }
            else
            {
                LogManager::getSingleton().logMessage(
                            "WARNING: No texture/uav bound to compute job '" + mName.getFriendlyText() +
                            "' at slot " + StringConverter::toString(mThreadGroupsBasedOnTexSlot) +
                            " while calculating number of thread groups based on texture");
            }
        }

        if( hasChanged )
            mPsoCacheHash = -1;
    }
Hope this makes sense and helps someone. :D
Now all i need to figure out is how to bind a constant buffer and change the value of a property per mip map. :P

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Sat Nov 09, 2019 6:31 pm
by dark_sylinc
Sounds like you would greatly benefit from Ogre 2.2

The issues you're having all are texture related, which were fixed in 2.2 8)

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Sat Nov 09, 2019 6:33 pm
by dark_sylinc
SolarPortal wrote: Sat Nov 09, 2019 5:51 pm Now all i need to figure out is how to bind a constant buffer and change the value of a property per mip map. :P
I want to make this easier when using the compositor. But for the time being, solutions like that one have to either rely on a custom compositor pass (like the mipmapping filter does) or manually using compute (which is much more flexible may be a bit hard if you're unaware of how barriers work and when are they needed)

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Sat Nov 09, 2019 9:51 pm
by SolarPortal
I see :) We have been thinking of moving to 2.2 but with the work involved to change the texturing system over to match, it is just to much work for the time being. But definitely will be moving to 2.2 at some point in the near future.

I think i have a solution using the compositor pass listeners and accessing the passEarlyPreExecute and changing the single shared constant buffer data per pass which means the compute job only has the one buffer to manage but each pass could access and change the data to match. Unless there is something wrong with doing it that way.. only recently got into compute shaders lol :P

Thanks for responding though :)

Re: [2.2] setNumThreadGroupsBasedOn function for compute shaders

Posted: Wed Jul 15, 2020 7:29 pm
by SolarPortal
reviving this old thread as im now on ogre 2.2 as @dark_sylinc mentioned above and still struggling with this function.
The error that was fixed above is only partially fixed on 2.2, the mip levels are taken into account yes, but the part which still does not work is simply this part in: _calculateNumThreadGroupsBasedOnSetting()
Line 537:

Code: Select all

       
       resolution[i] = (resolution[i] + mThreadGroupsBasedDivisor[i] - 1u) /
                  mThreadGroupsBasedDivisor[i];

       uint32 numThreadGroups = (resolution[i] + mThreadsPerGroup[i] - 1u) /
                  mThreadsPerGroup[i];
The resulting math from this if you have 32 threads per group is always going to return 1 not 32 thread groups on a 1024 texture or 16 thread groups on a 512 mip etc...

e.g. code before worked out would be like this:
res = (1024 + 32 -1) / 32; //(= 32)
numthreadgroups = (res + 32 -1) / 32; //(= 1)

my proposed code change:
numthreadgroups = (1024 + 32 -1) / 32; //(= 32)
numthreadgroups = (512 + 32 -1) / 32; //(= 16)

The bit of code should be more like this:

Code: Select all

for( int i=0; i<3; ++i )
{
++	uint32 numThreadGroups = (resolution[i] + mThreadGroupsBasedDivisor[i] - 1u) / mThreadGroupsBasedDivisor[i];
        if( mNumThreadGroups[i] != numThreadGroups )
        {
              mNumThreadGroups[i] = numThreadGroups;
              hasChanged = true;
        }
}
What do you think? as my compute shader will only ever process the first thread group as the returned numThreadGroups is always 1

Edit: Removed patch link
Edit2: Do not use code tweak, read lower down for more information if you encounter this problem.

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Wed Jul 15, 2020 11:19 pm
by dark_sylinc
I think you misunderstood what the divisor means. You almost always want the divisor to be 1.

e.g.

Code: Select all

setNumThreadGroupsBasedOn( source, texSlot, 1u, 1u, 1u );
The divisor should NOT be 32 if you have 32 threads per group.

The divisor exists because oftentimes you want each thread (not threadgroup) in a compute shader to process more than one pixel. For example if each thread processes a block of 2x2, then you want the divisor to be set to 2:

Code: Select all

setNumThreadGroupsBasedOn( source, texSlot, 2u, 2u, 1u );

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Thu Jul 16, 2020 12:00 am
by SolarPortal
that doesnt make sense to me lol :P

Im working with the example of writing to a 1024 x 1024 texture in a compute shader.
I have "threads per group" set as 32, 32, 1
and have "thread groups" set as 32, 32, 1

so i can process 32 pixels on the x and y at the same time, and i need 32 groups in order to process the full 1024x1024 image.
threads per group = 32x32x1 = 1024,
thread groups = 32x32x1 = 1024
1024 x 1024 = 1,048,576 which is the resolution of the image in number of pixels.

in case of a cubemap:
threads per group is 32, 32, 1
threads groups 32,32, 6

which allows sampling each face and the full image resolution per slice.

If thread groups is only ever set to 1, then you will only ever sample 1 group on the x, y and z.

Using the function does NOT work until the mip level resolution is less than 32x32 in size, it only computes the first block in the top left of the image, rest is left as the clear colour
Using the code i gave allowed the full computation over the entire image and returns the correct group size over each mip level.

its not right to use 1 thread group and 1024 threads in a group. Nvidia limit of 32, AMD limit of 64 and Intel limit of 8.

Try it yourself and you will see what i mean.

Not meaning to sound harsh or argumentative, but it simply does not work as intended as it is right now in ogre, unless you have somewhere which does show it working for each mip level and processing a full image.

Edit: For example, i am using a render node with several passes under a target to run the compute job on each mip of a texture. Without that function i cannot change the number of thread groups per mip level unless i have a different compute job per mip level where i can set the required thread group count which is overkill. Using that function allows me to run the same job on each pass and bind the uav of the mip to the pass at the time of creating the render node and so far its working flawlessly with the change.

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Thu Jul 16, 2020 12:17 am
by dark_sylinc
You're confusing the divisor with thread group count.

If you have:
1024x1024 image
threads_per_group = 32x32x1

Then call:

Code: Select all

setNumThreadGroupsBasedOn( source, texSlot, 1u, 1u, 1u );
This will automatically set thread_groups to 32x32x1 for mip 0, then to 16x16x1 for mip 1 and so on.

That is the purpose of setNumThreadGroupsBasedOn: to automatically calculate thread_groups for you.

If you want to manually control the number of thread_groups, then you should never call setNumThreadGroupsBasedOn (or its json counterpart)
unless you have somewhere which does show it working for each mip level and processing a full image.
Look for "thread_groups_based_on_texture" and "thread_groups_based_on_uav" in our .json files in Ogre.
For example IblSpecular/Integrate is run on all mipmaps.

It has a threads_per_group of 8x8x1 and then automatically calculates thread_groups based on texture resolution and mipmap bound.

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Thu Jul 16, 2020 12:20 am
by dark_sylinc
I forgot one bit:
Nvidia limit of 32, AMD limit of 64 and Intel limit of 8.
Please note that this is the total number of threads, not per axis.

For NVIDIA that means "threads_per_group" : [4, 8, 1] or [8,4,1].
For AMD that means "threads_per_group" : [8, 8, 1]

Of course, 32x32 is not wrong as it still satisfies both GPUs. It's just that you don't have to making them that large. Usually 8x8 or 16x16 is OK (you should profile what value works best. 32x32 may work faster depending on what you're doing and how you're doing it)

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Thu Jul 16, 2020 12:23 am
by SolarPortal
ok, i will give that a go and see how it pans out :)

and i imagine for a cubemap you would have to use a six on the end:
setNumThreadGroupsBasedOn( source, texSlot, 1u, 1u, 6u );

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Thu Jul 16, 2020 12:34 am
by dark_sylinc
For a cubemap, if you set the last value to 6, that means Ogre expects your compute shader to compute 6 slices per thread. I doubt that's what you're doing.

I'll try to explain the divisor again:

Example 1: Colour invert filter
Each thread:
  1. Accesses 1 pixel
  2. Calculates output = 1.0 - input
  3. Writes 1 pixel as the result.
If you have a 8x8x1 threadgroup, then each threadgroup reads 16 pixels and outputs 16 pixels

In this case I want to call setNumThreadGroupsBasedOn( ..., 1u, 1u, 1u );
On a 1024x1024 texture, it will calculate 128x128x1 threadgroups (1024 / 8 )

Example 2: BC1 encoder
For example a BC1 encoder. In a BC1 encoder, each thread:
  1. Accesses a block of 4x4 pixels (16 pixels)
  2. Processes them, and computes a single 64-bit value which is the BC1 compression calculated for this 4x4 block
  3. Outputs a single value of 64-bit
If you have a 8x8x1 threadgroup, then each threadgroup reads 256 pixels and outputs 16 values.

In this case I want to call setNumThreadGroupsBasedOn( ..., 4u, 4u, 1u );
because each thread reads a block of 4x4

On a 1024x1024 texture, it will calculate 32x32x1 threadgroups ((1024 / 4) / 8 )

Re: setNumThreadGroupsBasedOn function for compute shaders

Posted: Thu Jul 16, 2020 12:44 am
by SolarPortal
i have it working, and yeah didnt need the six.
All makes sense to me now... the bc1 encoding did it for me :P
The slices are also automatically set so dont have to worry about that one either.

Yeah, swapped my json to use the "thread_groups_based_on_uav", set the "threads per group" and deleted the "thread groups" and its working fine.

Very confusing when reading on compute shaders and learning about threads per group and thread groups, put it into practice and it was fine, but the divisor got me good :face_palm: then see the source editing thread groups and reading what seemed an incorrect value... lol :P
At least it works automatically and i get your concept of it now. just feels wrong setting thread groups 1,1,1, just have to remind myself next time that i don't have to do it :P Used to doing these things manually.

Thanks for the very valid information and good job! :) This thread will help a lot of others do it too.