[2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Discussion area about developing with Ogre2 branches (2.1, 2.2 and beyond)
Post Reply
Slamy
Gnoblar
Posts: 22
Joined: Sat Mar 27, 2021 10:49 pm
Location: Bochum, Germany
x 10

[2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Post by Slamy »

Now this is something I believe is machine dependent?

According to the news article I can see the reference result for Test_Voxelizer.
The small house is voxelized in a solid way for normal and albedo.

I've ensured being on master and noticed that something was not right.
ImageImageImage

As I don't know anything about the algorithm behind the voxelization I can't debug this at the moment myself.
My first assumption was a possible regression. I went back in time to 2019 and tested some git states from the time, this article was written.
But starting from the moment the debug visualization mode was added, this bug is present on my machine.

Test performed on Win 10, GL3+, RTX 2070 in case this depends on the used graphics library.

It would be interesting to hear from other ogre users if they experience the same. Just start Test_Voxelizer and cycle through the debugging with F2.

With a small experiement I can even make it worse. I've increased the resolution from the default 128^3 to 256^3.

Code: Select all

mVoxelizer->autoCalculateRegion();
mVoxelizer->dividideOctants( 1u, 1u, 1u );
mVoxelizer->setResolution( 256u, 256u, 256u );
This gives me this.
Image
The voxelized result is only a corner of the scene.
Now If i press F8 twice, I get this:
Image

It's more complete but still as holes. Could it be a timing issue? Maybe the storage process is incomplete?

Sometimes I feel like the QA department here. 8)

EDIT: This issue occurs only with GL3+. Direct3D11 shows something completely different which looks like the reference picture.
Slamy
Gnoblar
Posts: 22
Joined: Sat Mar 27, 2021 10:49 pm
Location: Bochum, Germany
x 10

Re: [2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Post by Slamy »

I've remembered that I had Ogre-next also installed on my laptop with linux installed. GPU is "NVIDIA Corporation GP108M [GeForce MX150]".
First with Vulkan:
Image
Then with GL3+:
Image

Its interesting to note that the Vulkan result on my laptop looks very similar to the GL3+ result on my PC.
On the other hand, the GL3+ is a desaster on my laptop.
I can therefore answer one question myself, that this is machine dependent and probably something about the timing.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 4705
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1024
Contact:

Re: [2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Post by dark_sylinc »

I was theorizing the other day our voxelizer shader may be incorrectly using shader subgroups and this (your pictures) would happen if I were right.

We are assuming the number of subgroups is 64, which is not true in NVIDIA.

I'll post a patch for you to try in a couple hours to see if it hopefully fixes it

Edit:

Does this patch fix it? (Note: It will break D3D11)

Code: Select all

diff --git a/Samples/Media/VCT/Voxelizer_piece_cs.any b/Samples/Media/VCT/Voxelizer_piece_cs.any
index 3e51013cc3..ecd3642135 100644
--- a/Samples/Media/VCT/Voxelizer_piece_cs.any
+++ b/Samples/Media/VCT/Voxelizer_piece_cs.any
@@ -377,7 +377,7 @@
 
 			uint numTris = instance.mesh.numIndices / 3u;
 
-			for( uint triIdxBase=0u; triIdxBase<numTris; triIdxBase += 64u )
+			for( uint triIdxBase=0u; triIdxBase<numTris; triIdxBase += gl_SubGroupSizeARB )
 			{
 				//First make the 64 threads check if any of the triangles intersects
 				//against the combined AABB of the 4x4x4 block. If none do, we can skip
@@ -389,7 +389,7 @@
 
 				if( anyInvocationARB( groupTriInters ) )
 				{
-					for( uint subTri=0u; subTri<64u; ++subTri )
+					for( uint subTri=0u; subTri<gl_SubGroupSizeARB; ++subTri )
 					{
 						uint triIdx = min( triIdxBase + subTri, numTris - 1u );
 						Triangle tri = getTriangle( triIdx, instance.mesh PARAMS_ARG );
Edit 2:
This patch should work even better (probably). I'm thinking perhaps we should use the emulated path in GL as well, since HW doesn't guarantee it will work except in Vulkan but only if VK_PIPELINE_SHADER_STAGE_CREATE_REQUIRE_FULL_SUBGROUPS_BIT_EXT is set at creation (which we currently don't)

Code: Select all

diff --git a/Samples/Media/VCT/Voxelizer_piece_cs.any b/Samples/Media/VCT/Voxelizer_piece_cs.any
index 3e51013cc3..a28e0cc586 100644
--- a/Samples/Media/VCT/Voxelizer_piece_cs.any
+++ b/Samples/Media/VCT/Voxelizer_piece_cs.any
@@ -377,7 +377,9 @@
 
 			uint numTris = instance.mesh.numIndices / 3u;
 
-			for( uint triIdxBase=0u; triIdxBase<numTris; triIdxBase += 64u )
+			uint highestActiveID = findMSB( ballotARB( true ) );
+
+			for( uint triIdxBase=0u; triIdxBase<numTris; triIdxBase += highestActiveID )
 			{
 				//First make the 64 threads check if any of the triangles intersects
 				//against the combined AABB of the 4x4x4 block. If none do, we can skip
@@ -389,7 +391,7 @@
 
 				if( anyInvocationARB( groupTriInters ) )
 				{
-					for( uint subTri=0u; subTri<64u; ++subTri )
+					for( uint subTri=0u; subTri<highestActiveID; ++subTri )
 					{
 						uint triIdx = min( triIdxBase + subTri, numTris - 1u );
 						Triangle tri = getTriangle( triIdx, instance.mesh PARAMS_ARG );
Slamy
Gnoblar
Posts: 22
Joined: Sat Mar 27, 2021 10:49 pm
Location: Bochum, Germany
x 10

Re: [2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Post by Slamy »

I've first tried the Edit 2 patch.
Sadly it didn't compile:

Code: Select all

0(574) : error C1115: unable to find compatible overloaded function "findMSB(uint64_t)"
This might be a known issue.
I've used this code to fix that for me.

Code: Select all

			uint64_t num = ballotARB( true );
			int msbFar = findMSB(uint(num >> 32));
			int msbFinal = (msbFar == -1) ? findMSB(uint(num)) : msbFar + 32;
			uint highestActiveID = msbFinal;
			for( uint triIdxBase=0u; triIdxBase<numTris; triIdxBase += highestActiveID )
Sadly the results were worse than before.
Image

I also tried out Edit1:
Image

But it also didn't look better.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 4705
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1024
Contact:

Re: [2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Post by dark_sylinc »

Thanks for trying it out!

This is something that is borderline between driver bug and our own bug (what we do is risky if we got something wrong, but also drivers tend to have bugs in this area)

I suppose we have no choice but to use the fallback path for D3D11 which uses shared memory and barriers instead of subgroup ops. Luckily comparing D3D11 vs GL performance it didn't seem like there will be a perf hit.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 4705
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1024
Contact:

Re: [2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Post by dark_sylinc »

After playing with an NVIDIA card, I managed to "fix" the issue with this patch (NV only):

Code: Select all

diff --git a/Samples/Media/VCT/Voxelizer.material.json b/Samples/Media/VCT/Voxelizer.material.json
index 9f232767c9..168d2eb7f7 100644
--- a/Samples/Media/VCT/Voxelizer.material.json
+++ b/Samples/Media/VCT/Voxelizer.material.json
@@ -3,7 +3,7 @@
 	{
         "VCT/Voxelizer" :
 		{
-            "threads_per_group" : [4, 4, 4],
+            "threads_per_group" : [4, 4, 2],
 
             "source" : "Voxelizer_cs",
             "pieces" : ["CrossPlatformSettings_piece_all", "Matrix_piece_all", "Voxelizer_piece_cs.any"],
diff --git a/Samples/Media/VCT/Voxelizer_piece_cs.any b/Samples/Media/VCT/Voxelizer_piece_cs.any
index 3e51013cc3..11d6e69740 100644
--- a/Samples/Media/VCT/Voxelizer_piece_cs.any
+++ b/Samples/Media/VCT/Voxelizer_piece_cs.any
@@ -354,8 +354,8 @@
 
 	Aabb groupVoxelAabb;
 	groupVoxelAabb.center	= p_voxelOrigin +
-							  4.0f * p_voxelCellSize * (float3( gl_WorkGroupID.xyz ) + 0.5f);
-	groupVoxelAabb.halfSize	= 4.0f * p_voxelCellSize * 0.5f;
+							  float3(4.0f, 4.0f, 2.0f) * p_voxelCellSize * (float3( gl_WorkGroupID.xyz ) + 0.5f);
+	groupVoxelAabb.halfSize	= float3(4.0f, 4.0f, 2.0f) * p_voxelCellSize * 0.5f;
 
 	bool doubleSided = false;
 	float accumTris = 0;
@@ -377,11 +377,11 @@
 
 			uint numTris = instance.mesh.numIndices / 3u;
 
-			for( uint triIdxBase=0u; triIdxBase<numTris; triIdxBase += 64u )
+			for( uint triIdxBase=0u; triIdxBase<numTris; triIdxBase += 32u )
 			{
-				//First make the 64 threads check if any of the triangles intersects
+				//First make the 32 threads check if any of the triangles intersects
 				//against the combined AABB of the 4x4x4 block. If none do, we can skip
-				//64 triangles in one go (actual performance improvement is around 32x).
+				//32 triangles in one go (actual performance improvement is around 32x).
 				uint tmpTriIdx = min( triIdxBase + gl_LocalInvocationIndex, numTris - 1u );
 				Aabb tmpTriAabb = getTriangleAabb( tmpTriIdx, instance.mesh, instance.worldTransform PARAMS_ARG );
 
@@ -389,7 +389,7 @@
 
 				if( anyInvocationARB( groupTriInters ) )
 				{
-					for( uint subTri=0u; subTri<64u; ++subTri )
+					for( uint subTri=0u; subTri<32u; ++subTri )
 					{
 						uint triIdx = min( triIdxBase + subTri, numTris - 1u );
 						Triangle tri = getTriangle( triIdx, instance.mesh PARAMS_ARG );
It was not enough to force triIdxBase += 32; but also to force the threads_per_group to be 32 instead of 64. But by doing that, groupVoxelAabb center and size needs adjustment since we're now processing blocks that are half in volume.

I'm currently wondering whether:
  1. Find out gl_SubGroupSizeARB. If it's 64 or 32 use these optimized paths. Otherwise use a fallback path like D3D11
  2. Always use fallback path
Performance-wise 1. is better. However compatibility-wise 2. may be the better option. GPUs keep getting wilder. For example RDNA can switch between 32 and 64. Mesa's GL driver for example when running in Wave32 will always report gl_SubGroupSizeARB = 64 but marks the last 32 threads as always inactive (this is spec compliant) but defaults to always using Wave64 for now (with environment variable to allow switching).

Vulkan driver has VK_EXT_subgroup_size_control to select between Wave32 and Wave64


Update:

After thinking this through, we should always use fallback path except on Vulkan if VK_PIPELINE_SHADER_STAGE_CREATE_REQUIRE_FULL_SUBGROUPS_BIT_EXT and VK_EXT_subgroup_size_control are supported; as this gives us future-proof support.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 4705
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1024
Contact:

Re: [2.3] VctVoxelizer has holes in the voxelized result (GL3+)

Post by dark_sylinc »

A fix for this issue has been pushed to both 2.2 and 2.3 branches.

Thanks for the report!
Post Reply