OK!
I forked your sample and did a few things:
-
Ported to 3.0
-
Ported to GL and Vulkan (neither is working, but builds and runs. Nothing shows up)
-
Hardcoded buffers in HlmsParticle are no longer 11-14 but rather 2-5 (this can be achieved by increased mTexUnitSlotStart & mSamplerUnitSlotStart)
-
Builds & runs on Linux (I tested it)
-
Builds & runs on Windows (I tested it)
-
Setup to run with an OgreNext built from source i.e. use the Quick Start script and have the Ogre repo point to Dependencies/Ogre like this:
OgreNext.jpg
I'd wish to also apply clang format as well. To homogeneize the code formatting (I didn't do that now since it would show up as a ton of changes). I also wanted to do more 3.0 stuff (like add more "override" keywords where appropiate)
It's not showing up on Vulkan nor GL3+ though. However the reason is very simple: Packing rules.
It's very obvious when you compare side by side on RenderDoc (D3D11 vs Vulkan).
For example struct EmitterCoreData
.
Vulkan & D3D11 don't agree on padding; hence at some point data goes out of sync and emitterCoreData[1] no longer matches between Vk & D3D11. It's the main reason I often either use all float4 or all float (or 2 float2 in a row), but almost never float3 nor a single float2.
This is a bit tedious and since you wrote the code you should be able to fix it more quickly than I. Would you be able to look at it? With RenderDoc it should be a breeze for you:
RenderDoc.jpg
You pay attention to whenever something dangerous happens:
float3 followed by something.
End of arrays (they have the tendency to add padding to the next float4)
float2 followed by anything other than float2.
For example I'm almost certain there is a desync after uint useSpriteTrack
and float spriteTrackTimes[8]
.
Remember that GPUs love padding to align float4 (i.e. 16 bytes) and some languages/API overpad more than necessary.
I also found up that compute shader generates multiple files which differs with 'num_thread_groups_x' property. Does that mean compute jobs internally use multiple calls and have similar overhead?
Unfortunately on some APIs we need to create a new PSO (I think it was Metal).
However everything should be cached so this should be relatively fast.
Btw I saw that your RNG is a PRNG. Nathan Reed has a good read on RNGs on GPU. Basically you often want to go wide, instead of deep (also see his other post Hash functions for GPU rendering)
You do not have the required permissions to view the files attached to this post.