Hi Assaf!
Sorry that I'm late to the party, but I don't check these forums as frequently as I used to.
Anyway I have recently spent time debugging stalls and dropped frames and would like to tell you and anyone else reading about some of my experiences.
I have a copy of VTune XA and frequently use it for profiling, however VTune sometimes isn't the best tool for the job. Hot spots that only occur in 1/300 frames will get lost in the noise of the hotspots of the other 299 frames.
Adding manual profiling code throughout my code (as you seem to be doing now) has been mildly successful. the problem with it is that it produces lots of data that is difficult to analyse.
However they both pale in comparison to GPUView. GPUView is provided as part of the windows performance toolkit. You run it while your application is running and it captures HUGE volumes of information until you stop it running. It manages to do this without impacting system performance which is great. It then can display the information in a useful way and relate it all to the graphics pipeline. I've found that its great for detecting multithreading related rendering stalls, like from main thread context switches or locks waiting too long, however it can also help with things like hardware buffer allocations causing stalls and things like that. It is really amazing and anyone profiling stalls should certainly look at it.
Two problems I found while using it which I hope will illustrate its usefulness.
1. The main thread context switches. I have N threads on this machine. Ogre therefore creates N threads for background use. All N of theses threads have the same thread priority as the main thread and the windows scheduler will roughly give them equal CPU time. When all 8 threads were busy doing some background loading the main thread would be starved of CPU time and therefore not render for a brief time. This resulted in hard to catch dropped frames. the solution was to make ogre only create N-1 threads and also to increase the thread priority of the main thread. Huge difference.
2. I have a GTX 670 here. I began noticing dropped frames in windowed mode (windows 7) again and wasn't sure what the cause was. VTune wasn't showing any issues and there wasn't much happening in terms of threading or heavy processing. So I opened GPUView. I found that i was rendering 200 fps. Cleanly rendering multiple frames in one vsync interval. The command queue wasn't overloaded, and everything looked fine, however windows direct window manager was not flipping the buffer when it should. Sometimes half of my frames would not get flipped resulting in my 200FPS feeling like 30... Eventually i tried running on my second monitor. BAM! Totally smooth. Seems as though my particular card has a DVI port that runs optimally in windowed mode and one that does't. I didn't find a solution for that problem, but I was able to rule it out as being an issue with my code. (screen shot below)
I would strongly encourage you to check it out. Its a very handy tool to add to your performance measuring toolkit.
Before you know what you are looking at its hard to tell whats going on in the profile, but once you read about it its very handy.
http://graphics.stanford.edu/~mdfisher/GPUView.html
blue vertical line are a vsync interval. Clearly you can see that the flip queue is not flipping every frame. HTH
c0deface.
GPUVIEW.jpg
You do not have the required permissions to view the files attached to this post.