A cheaper PCF shader?

A place for users of OGRE to discuss ideas and experiences of utilitising OGRE in their games / demos / applications.
Post Reply
User avatar
Posts: 819
Joined: Wed May 05, 2010 4:59 am
Location: Auckland, NZ
x 68

A cheaper PCF shader?

Post by areay »

Hi guys,

I've been playing around a bit with PCF for sampling shadow maps and I've got some pseudo-science to share with you.

Here's how most of setups I've seen on this forum and the Ogre Samples start a lot like this; getting a set of depth samples...

Code: Select all

	float4 depths =	float4(
		tex2D(shadowMap, shadowMapPos.xy + someOffset).r,
		tex2D(shadowMap, shadowMapPos.xy + someOffset).r,
		tex2D(shadowMap, shadowMapPos.xy + someOffset).r,	
		tex2D(shadowMap, shadowMapPos.xy + someOffset).r);
Then the comparisons start, and the examples I've seen all look something like this

Code: Select all

	float final;
	final += (depths.x > fragmentDepth) ? 1.0f : 0.0f;
	final += (depths.y > fragmentDepth) ? 1.0f : 0.0f;
	final += (depths.z > fragmentDepth) ? 1.0f : 0.0f;
	final += (depths.w > fragmentDepth) ? 1.0f : 0.0f;
	final *= 0.25f;
	return final;
But if you do that comparison like this instead (using a 4-float compare then a dot product)

Code: Select all

    float4 inlight = ( depths > fragmentDepth);
    final = dot(inlight, float4(.25, .25, .25, .25));
    return final;
Then you get a cheaper shader! So I'm basing this on the instruction count from the compiled shader

Using a single-pass 2-split PSSM shader taking 8 taps per sample, I get the following compilations

Using standard method; 184 instructions, 4 R-regs using arbfp1. 112 instructions, 4 R-regs, 0 H-regs using FP40
Using 4float compare + dot; 106 instructions & 5 R-regs using arbfp1. 98 instructions, 5 R-regs, 2 H-regs using FP40

I'm no expert, perhaps the cost in extra registers offsets the instruction saving, perhaps the instructions that are being executed are more expensive.

I know in CPU-land, different instructions have differing clock cycle costs, does this also apply in GPU land?

Here's where I found this little optimization
http://amd-dev.wpengine.netdna-cdn.com/ ... apping.pdf There's some other interesting things in that paper regarding something called silhouette maps, I'm wondering if anyone has tried to get that working with Ogre. Here's another silhouette map paper https://graphics.stanford.edu/papers/silmap/silmap.pdf

User avatar
OGRE Team Member
OGRE Team Member
Posts: 4508
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 940

Re: A cheaper PCF shader?

Post by dark_sylinc »

Both approaches will boil down pretty to the same in modern hardware because they're superscalar, not vector. The HLSL or Cg asm code is not a relevant metric when it comes to microoptimizations like this one. Older hardware like the AMD Radeon HD 2000 through 5000 may benefit from it though.

You should look at GPUPerfStudio's shader analyzer or PowerVR's tool to checkout the actual ISA code (sadly the other vendors don't provide tools to see what the actual shader looks like)

If you're looking in a better PCF-filtered shader (higher quality), you may want to look at the produced output from 2.1's shaders when not using depth maps.

The GLSL output is in the form of:

Code: Select all

//The 0.00196 is a magic number that prevents floating point
//precision problems ("1000" becomes "999.999" causing fW to
//be 0.999 instead of 0, hence ugly pixel-sized dot artifacts
//appear at the edge of the shadow).
fW = fract( uv * invShadowMapSize.zw + 0.00196 );

vec4 c;
c.w = texture(shadowMap, uv ).r;
c.z = texture(shadowMap, uv + vec2( invShadowMapSize.x, 0.0 ) ).r;
c.x = texture(shadowMap, uv + vec2( 0.0, invShadowMapSize.y ) ).r;
c.y = texture(shadowMap, uv + vec2( invShadowMapSize.x, invShadowMapSize.y ) ).r;

c = step( fDepth, c );

retVal += mix(
			mix( c.w, c.z, fW.x ),
			mix( c.x, c.y, fW.x ),
			fW.y );
"mix" is lerp in HLSL/Cg.

The differences can be seen in this post.

Post Reply