Hardware PCF shadows

Discussion area about developing or extending OGRE, adding plugins for it or building applications on it. No newbie questions please, use the Help forum for that.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Hardware PCF shadows

Post by sparkprime »

Having shadows that are many factors slower than competing engines is not really acceptable, so I've been trying to find out what it would take to add hardware PCF to OGRE. The lowest GPU spec I care about is Geforce 8 series and equivalent ATI hardware.

It seems that a different approach is called for, depending on the hardware (ATI/Nvidia) and the rendersystem (D3D9/GL).

Nvidia/(D3D9+GL):

Shadows are received using hardware shadow fetch to do PCF for the four values. In Cg this looks like tex2D(tex, float3(uv, depth_from_light_to_fragment));

Does this only work with depth24 textures? Does Ogre support them? Can they be used for shadow cast targets?

The Ogre::Texture would have to have some flag that could tell the rendersystem to set the renderstate flag to allow the depth test. Is there any support for this yet?

Afaict the support is only superficially different in GL and D3D9.


ATI/D3D9

Shadows are received using fetch4 and PCF done in the shader. Does CG (ps_3_x target) support fetch4? Will tex2D compile down to the fetch4 on ATI hardware?

What formats are supported? In this thread (http://www.ogre3d.org/forums/viewtopic. ... 14#p348249) Sinbad suggested that ATI demos use float32 but then said the format had to be 'DF24' which suggests a depth buffer texture, not a float32.


ATI/GL:

No support that I know of. But then I can't use CG with ATI anyway (GLSL target is not supported by Ogre, arbfp1 is not enough) so this battle is already lost.


Is there a single format for shadow textures that works well in all cases?
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5509
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1378

Re: Hardware PCF shadows

Post by dark_sylinc »

Eh? On NVIDIA you get HW PCF by using tex2Dproj (the float4 version). Turning bilinear filtering on for the texture unit will make the driver automatically to automatically perform the 4-tap check.

Dunno about ATI. However tex2Dproj will at least HW give you a shadow test (worst case, no PCF)

Edit: The shadow performance isn't that bad compared to other engine, although I admit it's below the average. Note the biggest bottleneck is in the CPU SceneGraph and draw call overhead (by a great margin). Not on the sampling code at the final stage when rendering.
User avatar
Wolfmanfx
OGRE Team Member
OGRE Team Member
Posts: 1525
Joined: Fri Feb 03, 2006 10:37 pm
Location: Austria - Leoben
x 100

Re: Hardware PCF shadows

Post by Wolfmanfx »

ad Nvidia you have to declare the surface as D24S8_SHADOWMAP so that the driver performs a pcf.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

dark_sylinc wrote:Eh? On NVIDIA you get HW PCF by using tex2Dproj (the float4 version).
You can either use the proj variant and give it uvw or you can divide by w yourself and use the standard call with just the uv. The important thing is the extra depth param. Since I'm also playing with the uvs a bit (trying various filters) i need to divide by w myself anyway so there's no need to use proj.
Turning bilinear filtering on for the texture unit will make the driver automatically to automatically perform the 4-tap check.
Yes on nvidia it's mainly an issue of working out how to make the depth check actually happen.
Dunno about ATI. However tex2Dproj will at least HW give you a shadow test (worst case, no PCF)
they could in principle compile it to a fetch4 on ATI hardware however I can't find any evidence that this is actually done. In fact I can't find any reference to fetch4 at all in Cg documentation which makes me thing that i will have to either forfeit Cg or shadow quality or shadow performance on ATI cards.
Edit: The shadow performance isn't that bad compared to other engine, although I admit it's below the average. Note the biggest bottleneck is in the CPU SceneGraph and draw call overhead (by a great margin). Not on the sampling code at the final stage when rendering.
It's not particularly worse than other renders in Ogre, my problem is doing 4 texture fetches to do the PCF manually is very slow (17fps instead of 65 or something like that). This is for scenes with a negligible number of draw calls. One thing at a time :)
Last edited by sparkprime on Fri Mar 18, 2011 11:39 pm, edited 1 time in total.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

Wolfmanfx wrote:ad Nvidia you have to declare the surface as D24S8_SHADOWMAP so that the driver performs a pcf.
That's what I got from googling, but I have no idea what that actually is -- is it a depth texture (i.e. should be rendered to without colour, etc). In which case I cannot store linear depth? That would be annoying.

I've never explicitly handled depth textures before. Even with my deferred shading code I am storing the depth in the colour buffer.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

From deep in Cg documentation:
It is also possible to use other texture formats for shadow mapping, though your
shader will (a) have to render the depth as a color map and (b) will need to do any
texture filtering in the shader, rather than through the dedicated hardware channel
used by D24S8_SHADOWMAP.
So I need to find out how to use depth buffers in ogre then...
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5509
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1378

Re: Hardware PCF shadows

Post by dark_sylinc »

sparkprime wrote: they could in principle compile it to a fetch4 on ATI hardware however I can't find any evidence that this is actually done. In fact I can't find any reference to fetch4 at all in Cg documentation which makes me thing that i will have to either forfeit Cg or shadow quality or shadow performance on ATI cards.
It's done at an instruction level thing. Cg translates a tex2Dproj function to a tex2Dproj instruction, not to a fetch4. It's the driver that makes the magic on tex2Dproj.
Good luck making it work.
sparkprime wrote:It's not particularly worse than other renders in Ogre, my problem is doing 4 texture fetches to do the PCF manually is very slow (17fps instead of 65 or something like that). This is for scenes with a negligible number of draw calls. One thing at a time :)
You must be doing something wrong. I get negligible performance difference between using a 4-tap PCF and a non-filtered one; and I've tried simple scenes to very heavy ones (GeForce 8600 GTS 512MB).
Bear in mind, the loop code included with Ogre samples gets very badly optimized (dead slow!).
This is my PSSM + PCF handwritten code and works like a charm (ok, the code is not originally mine, it's an adaptation from the original Ogre PSSM demo):

Code: Select all

float getShadow( sampler2D shadowMap, float4 vPosLN, float2 texelOffset, float2 invTexSize )
{
	const float fDepth = vPosLN.z;
	const float4 uv = (vPosLN.xy / vPosLN.w + texelOffset * invTexSize).xyyy;
	const float3 o = float3(invTexSize, -invTexSize.x) * 0.3f;

	//Uncomment this to do 1x1 shadow mapping
	//return fDepth <= tex2Dproj( shadowMap, vPosLN ).x ? 1 : 0;

	// Perform 2x2 PCF
	float c =	(fDepth <= tex2Dlod(shadowMap, uv - o.xyyy).r) ? 1 : 0; // top left
	c +=		(fDepth <= tex2Dlod(shadowMap, uv - o.zyyy).r) ? 1 : 0; // top right
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.zyyy).r) ? 1 : 0; // bottom left
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.xyyy).r) ? 1 : 0; // bottom right
	return c / 4;

	/*Uncomment this code to enable 3x3 PCF
	const float4 o = float4(invTexSize.xy, -invTexSize.x, 0.0f) * 0.5f;

	// Note: We using 2x2 PCF. Good enough and is alot faster.
	float c =	(fDepth <= tex2Dlod(shadowMap, uv).r) ? 1 : 0; //center
	c +=		(fDepth <= tex2Dlod(shadowMap, uv - o.xwww).r) ? 1 : 0; // left
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.xwww).r) ? 1 : 0; // right
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.zyww).r) ? 1 : 0; // bottom-left
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.wyww).r) ? 1 : 0; // bottom
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.xyww).r) ? 1 : 0; // bottom-right
	c +=		(fDepth <= tex2Dlod(shadowMap, uv - o.xyww).r) ? 1 : 0; // top-left
	c +=		(fDepth <= tex2Dlod(shadowMap, uv - o.wyww).r) ? 1 : 0; // top
	c +=		(fDepth <= tex2Dlod(shadowMap, uv - o.zyww).r) ? 1 : 0; // top-right
	return c / 9;*/
}

float getPSSMShadow( sampler2D shadowMap0, float4 vPosL0, float2 invTexSize0,
					 sampler2D shadowMap1, float4 vPosL1, float2 invTexSize1,
					 sampler2D shadowMap2, float4 vPosL2, float2 invTexSize2,
					 float fDepth, float4 pssmSplitPoints, float2 texelOffset )
{
	half fShadow = 1.0f;
	if( fDepth <= pssmSplitPoints.y )
		fShadow = getShadow( shadowMap0, vPosL0, texelOffset.xy, invTexSize0.xy );
	else if( fDepth <= pssmSplitPoints.z )
		fShadow = getShadow( shadowMap1, vPosL1, texelOffset.xy, invTexSize1.xy );
	else if( fDepth <= pssmSplitPoints.w )
		fShadow = getShadow( shadowMap2, vPosL2, texelOffset.xy, invTexSize2.xy );
							 
	return fShadow;
}
"vPosL0" & co. contain the position in shadow camera's space.
texelOffset uses texel_offsets binding.
fDepth contains Z depth for the PSSM split calculation
invTexSize0-N contains the inverse_texture_size binding for each shadow texture.

The sad thing about doing the sampling by hand instead of using the loop is that you can't increase/decrease at will, just uncomment code. But the performance gain is astonishing.

Hope this helps
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

dark_sylinc wrote: It's done at an instruction level thing. Cg translates a tex2Dproj function to a tex2Dproj instruction, not to a fetch4. It's the driver that makes the magic on tex2Dproj.
Good luck making it work.
True, in principle the ATI driver could implement the same 'bilinear filtering' thing in their sm3 -> native code compiler, by trapping the texture fetch instruction and instead doing a fetch4, 4 comparisons, 4 adds, and a *0.25, but i think that's unlikely. It would have to do it only when the bilinear filtering is turned on for the texture, they would be providing the same semantics as Nvidia. I think it's more likely that I will have to get Cg to emit the fetch4 instruction myself somehow, and implement my own texture fetch function that on nvidia just defers to the regular one, and on ATI will do the emulation with fetch4. However I have no idea how to get Cg to emit such a fetch4 instruction.

It's very telling that if I google for 'cg fetch4' the first result is this thread :)

You must be doing something wrong. I get negligible performance difference between using a 4-tap PCF and a non-filtered one; and I've tried simple scenes to very heavy ones (GeForce 8600 GTS 512MB).
That can be explained if your shader is compute-bound instead of being bound by texture fetches. If you increase the tap size, or used a rotating poisson disc (as I do) then you may well get my performance characteristics. It's possible my gpu (quadro 3700M) has a different texel fetch / compute ratio too.

We're probably also using different cg targets, heh.
Bear in mind, the loop code included with Ogre samples gets very badly optimized (dead slow!).
Interesting, can you paste the Cg here as I'm not sure exactly which one you're referring to. I've noticed several bugs with Cg. I am a compiler engineer for my day job, so I usually try and push the envelope because I've got a good idea of what it damn well ought to be doing, but I always check the generated code to make sure it's doing the optimisations that I expect of it. I haven't seen it screw up a simple constant bounded loop before.
This is my PSSM + PCF handwritten code and works like a charm (ok, the code is not originally mine, it's an adaptation from the original Ogre PSSM demo):
yeah that's very similar to my code except I don't use lod / proj and I have a noise texture to get the offsets.

Why are you using lod? The depth map will not have mipmaps. You should be able to use any fetch call and get what you want.
The sad thing about doing the sampling by hand instead of using the loop is that you can't increase/decrease at will, just uncomment code. But the performance gain is astonishing.
you can always unroll it manually but cut it off with #if
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5509
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1378

Re: Hardware PCF shadows

Post by dark_sylinc »

If you increase the tap size, or used a rotating poisson disc (as I do) then you may well get my performance characteristics.
A 9-tap (3x3) still works fast in my machine. Same results in my Mobile Radeon HD 4650

However your poisson disc offsets might be very cache unfriendly. You may be surprised the performance improvement you may get by just reordering the values in the declaration or just slightly tweaking the values (so that it won't follow exactly a poisson distribution, but still close enough).
Sample from lower left to lower right, from furthest to closest, from bottom to top. In that order (in relation to the center of the texel you want to sample)
Edit: In other words, texture units cache your samples as if you were doing a box blur sampling pattern. They don't predict what you're trying to do, therefore the more you diverge from that box access pattern, the worse it becomes. That's one of the top reasons CS (compute shaders) are becoming popular for performing post process operations.

I'm using PS 3.0
Interesting, can you paste the Cg here as I'm not sure exactly which one you're referring to
I'm talking about this code in shadows.cg (found in the Media folder):

Code: Select all

// Simple PCF 
// Number of samples in one dimension (square for total samples)
#define NUM_SHADOW_SAMPLES_1D 2.0
#define SHADOW_FILTER_SCALE 1

#define SHADOW_SAMPLES NUM_SHADOW_SAMPLES_1D*NUM_SHADOW_SAMPLES_1D

float4 offsetSample(float4 uv, float2 offset, float invMapSize)
{
	return float4(uv.xy + offset * invMapSize * uv.w, uv.z, uv.w);
}
float calcDepthShadow(sampler2D shadowMap, float4 uv, float invShadowMapSize)
{
	// 4-sample PCF
	
	float shadow = 0.0;
	float offset = (NUM_SHADOW_SAMPLES_1D/2 - 0.5) * SHADOW_FILTER_SCALE;
	for (float y = -offset; y <= offset; y += SHADOW_FILTER_SCALE)
		for (float x = -offset; x <= offset; x += SHADOW_FILTER_SCALE)
		{
			float depth = tex2Dproj(shadowMap, offsetSample(uv, float2(x, y), invShadowMapSize)).x;
			if (depth >= 1 || depth >= uv.z)
				shadow += 1.0;
		}

	shadow /= SHADOW_SAMPLES;

	return shadow;
}
Why are you using lod? The depth map will not have mipmaps. You should be able to use any fetch call and get what you want.
To prevent "ddx" instruction used inside a branch or loop (I use branches to choose the right PSSM split, as you've seen in my code).
If you're using plain tex2D (can't recall if tex2Dproj is bound to the same problem) and not experiencing the "ddx inside a branch or loop" error, then your Cg compiler isn't generating branches for your PSSM (I assume you're using PSSM). A 4-tap PCF with 3 pssm splits means 12 texture fetches + your noise offset fetches. That ain't going to be fast...
yeah that's very similar to my code except I don't use lod / proj and I have a noise texture to get the offsets.
Whoa! Wait! Your texture read is dependent on another texture read?? That's a lot of latency!!
If you're fetching one noise value per PCF tap, I suggest you fetch one noise value for the 3 splits (that is, for the 12 fetches) and fake randomization using that single value as a seed.
Keep your noise texture at 8-bit 64x64 (Use L8 format) and you'll improve performance a lot thanks to cache usage.
Also check your noise texture access pattern, you may be doing it in a very cache unfriendly way.
Another possibility is to randomize using something different than a texture filled with noise. But a 64x64 noise is usually good.

Nevertheless, looks you could use of VSM (Variance Shadow Mapping) or ESM (Exponential shadow mapping). Note I couldn't make ESM useful in my case. IIRC I was overflowing too fast; and I needed substantial changes in Ogre code to make encoding correct that would prevent overflowing.

Cheers
Dark Sylinc
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

dark_sylinc wrote:
Why are you using lod? The depth map will not have mipmaps. You should be able to use any fetch call and get what you want.
To prevent "ddx" instruction used inside a branch or loop (I use branches to choose the right PSSM split, as you've seen in my code).
If you're using plain tex2D (can't recall if tex2Dproj is bound to the same problem) and not experiencing the "ddx inside a branch or loop" error, then your Cg compiler isn't generating branches for your PSSM (I assume you're using PSSM). A 4-tap PCF with 3 pssm splits means 12 texture fetches + your noise offset fetches. That ain't going to be fast...
Ah yes I forgot about the branching problem, now that you jogged my memory I recall that I give the ddx/ddy values as constant 0, that shuts up the compiler and is nice and tidy.

Also I did some more tests and my 4x figure is not correct -- it's actually more like 2x (80 fps vs 40).

I have a problem with my fan at the moment, it causes the gpu to be throttled which is not really ideal when doing benchmarking :)
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

yeah that's very similar to my code except I don't use lod / proj and I have a noise texture to get the offsets.
Whoa! Wait! Your texture read is dependent on another texture read?? That's a lot of latency!!
It's a single noise fetch per fragment (the noise uv is just the screen pos / 64.0). (I also have another approach where the noise is a sort of procedural dither pattern but there is only about 1 fps difference between the two, curiously.) The noise fetch is used to offset every tap, i.e. the whole tap grid is offset. So the overhead per tap is very low.

If using PCF then there will be 4 fetches per tap, otherwise just 1.
Keep your noise texture at 8-bit 64x64 (Use L8 format) and you'll improve performance a lot thanks to cache usage.
It is 64x64 but it's 2 channels -- there is no relationship like x^2 + y^2 = 1 because the vectors are random points within the unit circle, and since we're on the topic, I found that high frequency noise with a gaussian distribution works pretty well -- the effect on the shadow is like a gaussian blur. I'll try dropping the res to 32x32 and see if it affects the FPS. (edit: no change to FPS)
Also check your noise texture access pattern, you may be doing it in a very cache unfriendly way.
Another possibility is to randomize using something different than a texture filled with noise. But a 64x64 noise is usually good.
It is my understanding from working with CUDA that GPU texture caches operate using a morton curve -- a single memory operation will bring in a n*n (for 2d texture) area, and subsequent fetches that fall into that area will hit the cache and not contend the bus. I've never heard before that the order of reads is important (for CPU it sometimes is due to prefetching), and in fact for graphics this doesn't make much sense since it would affect performance depending how you transformed your geometry (e.g. rendering the floor whilst looking north vs looking south).
Nevertheless, looks you could use of VSM (Variance Shadow Mapping) or ESM (Exponential shadow mapping). Note I couldn't make ESM useful in my case. IIRC I was overflowing too fast; and I needed substantial changes in Ogre code to make encoding correct that would prevent overflowing.
Yeah it is something I have basically punted thinking about for now. I read a year or two ago that in Crysis they used VSM for the landscape, because there were not so much artifact-causing "surfaces behind surfaces" in that geometry. But they used PCF for the regular geometry (props, etc). That makes me think I ought to at least perfect the PCF shadows first.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5509
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1378

Re: Hardware PCF shadows

Post by dark_sylinc »

sparkprime wrote: Ah yes I forgot about the branching problem, now that you jogged my memory I recall that I give the ddx/ddy values as constant 0, that shuts up the compiler and is nice and tidy.
Try tex2Dlod. Ought to be slightly faster, mainly on ATIs.
sparkprime wrote: Also I did some more tests and my 4x figure is not correct -- it's actually more like 2x (80 fps vs 40).
That's more reasonable, still not completely.
sparkprime wrote: I have a problem with my fan at the moment, it causes the gpu to be throttled which is not really ideal when doing benchmarking :)
Well, that can produce lots of distortions. :?
sparkprime wrote: It is my understanding from working with CUDA that GPU texture caches operate using a morton curve -- a single memory operation will bring in a n*n (for 2d texture) area, and subsequent fetches that fall into that area will hit the cache and not contend the bus. I've never heard before that the order of reads is important (for CPU it sometimes is due to prefetching)
That's correct. However, keep in mind:
a. A sampling pattern may fall outside the N*N area multiple times, while a different pattern may reduce the amount of times an N*N are is sent to the cache.

b. You're running multiple threads/pixels at the same time, and the N*N area you're talking about is shared for the whole group. This makes the order problem more important.

So in theory, looks like it shouldn't, but practice quickly tells the order does matter. Often it doesn't at all. But sometimes it does a big change. I've found one can easily screw it when using poisson discs; unless they're all really really close to the center sample.

By the way, I use 3 splits: 2048x2048; 1024x1024; 512x512 respectively. All of them PF_FLOAT32_R
It's an important factor I forgot to mention. If they're all 4098x4098 (or 2048x2048) it's gonna be slow.

Some games (Assassin's Creed, Just Cause 2) allow manually adjusting the splits' resolutions.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

dark_sylinc wrote:

Code: Select all

        const float3 o = float3(invTexSize, -invTexSize.x) * 0.3f;

	// Perform 2x2 PCF
	float c =	(fDepth <= tex2Dlod(shadowMap, uv - o.xyyy).r) ? 1 : 0; // top left
	c +=		(fDepth <= tex2Dlod(shadowMap, uv - o.zyyy).r) ? 1 : 0; // top right
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.zyyy).r) ? 1 : 0; // bottom left
	c +=		(fDepth <= tex2Dlod(shadowMap, uv + o.xyyy).r) ? 1 : 0; // bottom right
	return c / 4;
Just realised this is not what I've been calling PCF. When I say with and without PCF I mean the emulation of the hardware operatoin -- i.e. adjacent texels. What you're doing is a spacing of 0.6 texels so it's equivalent to what I'm doing with shadowEmulatePCF turned off and a filter size of 0.6. I can recreate that in my engine and post screenshots.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

OK This is my reconstruction of your shader:

Note this is PSSM with 3 1024x1024 float32 textures, with linear depth normalised between -1 and 1 by dividing by 4000. This is a deferred shading engine but that's not going to make much difference here.

The scene is very close to the camera though of course.

user_cfg.shadowFilterSize = 0.6
user_cfg.shadowFilterTaps = 4
user_cfg.shadowEmulatePCF = false
user_cfg.shadowFilterNoise = false
user_cfg.shadowFilterDither = false

Image

This looks like crap so I suspect I'm missing something from your approach.

Here is a single tap:

user_cfg.shadowFilterTaps = 1
user_cfg.shadowEmulatePCF = false
user_cfg.shadowFilterNoise = false
user_cfg.shadowFilterDither = false

Image

And with the emulated PCF:

user_cfg.shadowFilterTaps = 1
user_cfg.shadowEmulatePCF = true
user_cfg.shadowFilterNoise = false
user_cfg.shadowFilterDither = false

Image

This is analogous to bilinear magnification filtering on a 2 colour mask texture.

With 2x2 taps, the filter size starts to matter. I usually run it with 4 texels to get nice soft shadows:

user_cfg.shadowFilterSize = 4
user_cfg.shadowFilterTaps = 4
user_cfg.shadowEmulatePCF = false
user_cfg.shadowFilterNoise = false
user_cfg.shadowFilterDither = false

Image

And again with the emulated hardware PCF operation:

user_cfg.shadowFilterSize = 4
user_cfg.shadowFilterTaps = 4
user_cfg.shadowEmulatePCF = true
user_cfg.shadowFilterNoise = false
user_cfg.shadowFilterDither = false

Image

And the same two things for 3x3 taps:

Image
Image

So the FPS has really tanked by this point. I'd rather stick with a 2x2 filter and some noise. The first noise thing I have is this procedural dither thing which is equivalent to a 2x2 noise texture that yields (1,0) (-1,0) (0,1), (0,-1).

user_cfg.shadowFilterSize = 4
user_cfg.shadowFilterTaps = 1
user_cfg.shadowEmulatePCF = false (and then true)
user_cfg.shadowFilterNoise = false
user_cfg.shadowFilterDither = true

Image
Image

user_cfg.shadowFilterTaps = 4

Image
Image

user_cfg.shadowFilterTaps = 9

Image
Image

This doesn't look very appealing but I have a hunch if I ever do the trick with a post blur pass on a rendered shadow mask, then such a dither pattern will be much easier (and faster) to blur effectively.

However without that, I think a richer noise texture looks better. The noise texture I'm using is 64x64 and each texel is a random 2d vector whose length is less than 1. If uniformly distributed, these vectors would form a nice disc, however I prefer the look of a gaussian distributed noise, so there are more vectors that are close to the centre of the disc than near the edge.

user_cfg.shadowFilterSize = 4
user_cfg.shadowFilterTaps = 1
user_cfg.shadowEmulatePCF = false (and then true)
user_cfg.shadowFilterNoise = true
user_cfg.shadowFilterDither = false

Image
Image

user_cfg.shadowFilterTaps = 4

Image
Image

user_cfg.shadowFilterTaps = 9

Image
Image

These last two images (without and without emulated hardware pcf) are the two whose FPS I mentioned earlier.

Hope this goes some way to eliminating confusion :)
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

Here is the noise texture Image

I made it with the following code: http://gritengine.svn.sourceforge.net/v ... iew=markup

Also my shader code is here:

http://gritengine.svn.sourceforge.net/v ... iew=markup (utiltiies)
http://gritengine.svn.sourceforge.net/v ... iew=markup (stuff factored out from deferred and forward shading, like shadows)
http://gritengine.svn.sourceforge.net/v ... iew=markup (actual deferred and forward shading code)
moagames
Halfling
Posts: 70
Joined: Thu Apr 10, 2008 8:10 pm
x 1

Re: Hardware PCF shadows

Post by moagames »

I can't remember where I read it, but in one paper I read, that one of the reasons they used PCF instead of VSM for the main shadows in their engine was, that using PCF they had the possibility to turn off colour writing, what gave them a performance gain by factor 2 compared to rendering colour too.
So a great advantage of hardware PCF is not only during applying the shadows, but already during rendering of the shadows (what in my case uses the most of the rendering time). By faking PCF you have to turn colour writing on and looses this performance gain.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

If there is such a gain then I would definitely consider using the depth buffer instead of a linear depth in float32 configuration. I would like to know if ogre supports this.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5509
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1378

Re: Hardware PCF shadows

Post by dark_sylinc »

sparkprime wrote:If there is such a gain then I would definitely consider using the depth buffer instead of a linear depth in float32 configuration. I would like to know if ogre supports this.
It doesn't. The reason was that D3D9 didn't support this (not without non-standarized hacks) and in OpenGL there are a few non-defined issues with the compatible version (IIRC, you get a lot less precision than what you asked for, and other sampling rules that aren't consistent across GPUs). D3D10 supports it.

In Ogre, the pixel format PF_DEPTH is reserved for this use, albeit not used.

The situation has changed since this was last reviewed. If you see this site, the fourcc "INTZ" used as hack is consistently supported in all 3 major vendors (ATI, NVIDIA, Intel) for D3D10-level cards running in D3D9. You're aiming for D3D10 level GPUs, so this is cool.

Also in OpenGL I believe there's a better (D3D10 cards...) version for using depth as a texture that is equally or similar consistent and requirements.

D3D10 supports it natively and well defined.

If you want support for it, submit a patch ;)

By the way, you can still use linear depth in the depth buffer by tweaking the projection matrix. See this guide on how to do it. Like the article suggests, there are two ways:
1. Adding a mul & a div in the VS or
2. Adding a mul & slightly changing the projection matrix before sending it to the VS
The first option is easier to perform and to experiment with, while the latter ought to be faster.

As a side-note when using 3-split PSSM, the bottleneck is first the CPU drawcall overhead, second the GPU rendering to the shadow maps. Although this is highly dependent on the scene and they're always competing to see who's the main bottleneck. Depending on yours, you may see quite a performance improvement, or barely none. NVPerfHUD will tell you.

Edit: Don't bother about the RAWZ, DF16 & DF24 fourcc. They're the non-standarized hacks I told you about at the beginning. Each one has it's own requirements and are difficult to maintain. They were the reason depth PF weren't considered for D3D9 rendersystem (and as a result, other render systems as well).
If you're going to go ahead for the Ogre patch, remember to add a render capability flag to tell depth textures are supported
User avatar
so0os
Bugbear
Posts: 833
Joined: Thu Apr 15, 2010 7:42 am
Location: Poznan, Poland
x 33

Re: Hardware PCF shadows

Post by so0os »

Doesn't tex2Dproj() compile down to stull like fetch4?
Sos Sosowski :)
http://www.sos.gd
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

dark_cyclic: thanks I'll look into this at the weekend (assuming friday at work goes to plan so I actually have a weekend :) )
so0os wrote:Doesn't tex2Dproj() compile down to stull like fetch4?
tex2Dproj(tex, uvz) == tex2D(tex, uv/z) (for normal colour maps)
tex2Dproj(tex, uvzw) == tex2D(tex, uvz/w) (for shadow maps)
User avatar
so0os
Bugbear
Posts: 833
Joined: Thu Apr 15, 2010 7:42 am
Location: Poznan, Poland
x 33

Re: Hardware PCF shadows

Post by so0os »

I know that, but since they're intended for stuff like shadows, they should compile to that thingie. I read that somewhere, I can't remember now. The best way is to check it out i guess.
Sos Sosowski :)
http://www.sos.gd
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

so0os wrote:I know that, but since they're intended for stuff like shadows, they should compile to that thingie. I read that somewhere, I can't remember now. The best way is to check it out i guess.
They're not intended for shadows, they're intended for projective textures.

Shadow textures are often projected but that is only a coincedence.

The texture operation that *is* intended for shadows is the one with the depth component, and there is one of those for standard texture fetch, fetch with ddx/ddy, 2d, 3d, projective, rect and whatever else.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

I had to work over the weekend but I may have some time before next weekend to look into this again.

It seems that a good first objective would be to allow explicit access to depth buffers in Ogre. These are some things that it would be useful to support:

* Rendering shadow maps to depth buffer only.
* Binding depth buffers in shadow cast targets to material texture units for the purposes of a) depth read in shader b) hardware depth comparison in shader
* Binding depth buffers from compositor RTT targets to material texture units for doing compositor techniques that can make use of depth (deferred shading, DoF, SSAO)

There is already an Ogre::DepthBuffer, to what extent does it already support any of this? Should it be extended or replaced? Should there be another class to represent depth buffers?

What is the first step?
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Hardware PCF shadows

Post by sparkprime »

Bumping because I need some help with these questions before I begin :)
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5509
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1378

Re: Hardware PCF shadows

Post by dark_sylinc »

Hi! Sorry, didn't see you replied. Good thing you bumped.

I am the one who wrote the Ogre::DepthBuffer class. It was part of a large patch that introduced depth sharing support.

Long story short, because depth buffer is delicate issue (and we share them between render targets as much as we can to reduce memory waste and needless buffer changes) you "hint" Ogre by using a Depth Buffer pool ID.
It's similar to the RenderQueue: everything that has the same ID comes together. In theory, all RTTs using the same depth buffer pool ID would share the same buffer. Because of some HW restrictions, on among other things, this may not be 100% true, but you have deterministic control over it (which means you can anticipate when it will be shared and when wit won't at production time) and besides the main advantage is that you're guaranteed that RTTs with different depth buffer pool IDs will never share the same depth buffer (unless RSC_RTT_SEPARATE_DEPTHBUFFER isn't present, which happens in arcane OpenGL implementations).

Now to your point: The purpose of DepthBuffer is to encapsulate API-dependant buffers. In D3D9 it's overloaded by D3D9DepthBuffer, in OGL by GLDepthBuffer.

Using "DepthBuffer" directly for rendering won't be 100% straightforward in Ogre because it doesn't derive from RenderTarget class. And probably it shouldn't. I can see two ways of implementing this:
  • Implement a special method in SceneManager that accepts "DepthBuffer" instead of "RenderTarget" for class. Probably a lot of code duplication and will imply modifying RenderSystems too. Messy.
  • Use a dummy RenderTarget, which contains a null colour target or something, and use it's attached DepthBuffer.
I like the 2nd option most as it's clean and easy. Note that:
  • You can get the DepthBuffer pointer from an RTT by calling RenderTarget::getDepthBuffer(). Beware it may return null.
  • DepthBuffers are assigned and attached automatically in RenderSystem::setDepthBufferFor
  • Shadow's depth buffer pool ID can be set in SceneManager::setShadowTextureConfig()
  • The default depth buffer pool ID for all RTTs is POOL_DEFAULT (POOL_DEFAULT = 1)
  • IDs are arbitrary, they don't have to be sequential. POOL_NO_DEPTH & POOL_MANUAL_USAGE are reserved (they're both 0)
  • Non-manual depth buffers may be destroyed on device lost; thus never keep a pointer to them! Use getDepthBuffer always. Manual buffers are ok as long as you don't call _cleanupDepthBuffers(true)
For example, with the defaults, shadow maps will often share the depth buffer with the main render window.
But this shouldn't happen if you want to use the depth buffer as a texture to get the data from the shadow map pass while you're rendering to the back buffer, therefore you'll need 2 depth buffers at least, or more if you're using PSSM, one per each split.
Therefore each shadow map should have a it's own pool ID, which would be also different from the RTT. You would need to ensure no other RT uses those IDs (note: Ogre doesn't randomly assign IDs, by default they're all POOL_DEFAULT unless specifically given a different number)

Last but not least:
Sticking to the pool ID rules would be really cool. You can use very high numbers so that you ensure they never accidentally used. Or use an algorithm that chooses those IDs that haven't been selected yet, introduce asserts, etc etc.
However, while I was designing the depth buffer class, I foresaw the possibility someone in the future would need manual control of it, with great flexibility.

To switch to manual control, call using the special (reserved) pool ID:

Code: Select all

DepthBuffer *myDepthBuffer = RenderSystem->_createDepthBufferFor( myRenderTarget );
myRenderTarget->setDepthBufferPool( DepthBuffer::POOL_MANUAL_USAGE );
//and equivalent pool ID for SceneManager::setShadowTextureConfig

if( !myRenderTarget::attachDepthBuffer( myDepthBuffer ) )
   error(); //Handle this yourself

//Call this when you no longer need the depth buffer (i.e. on exit)
myRenderTarget->detachDepthBuffer();

//DepthBuffer will occupy memory until exit. You can force flushing all depth buffers, which will erase ALL depth buffers. They'll be recreated if needed. Be sure you don't hold a pointer to a DepthBuffer (manual or not!)  or else it will become dangling.
RenderSystem->_cleanupDepthBuffers( true );
The manual control is per RenderTarget, not a global setting, as it would break the depth sharing system (which is very useful in Compositors)

Hope this helps
Cheers
Dark Sylinc