Ogre 2.0 doc (slides) - Updated 1st dec 2012

Discussion area about developing with Ogre-Next (2.1, 2.2 and beyond)


TheSHEEEP
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 972
Joined: Mon Jun 02, 2008 6:52 pm
Location: Berlin
x 65

Re: Ogre 2.0 doc (slides)

Post by TheSHEEEP »

syedhs wrote:Probably threading should be made into core and thus, compulsory not an option?
I agree.
Or is there a case/platform in which you would not want Ogre to thread internally if it can?
My site! - Have a look :)
Also on Twitter - extra fluffy
User avatar
Klaim
Old One
Posts: 2565
Joined: Sun Sep 11, 2005 1:04 am
Location: Paris, France
x 56

Re: Ogre 2.0 doc (slides)

Post by Klaim »

There are two things I see:

1. the call to renderOneFrame()
2. resource loading

1) can be done only with always the same thread (currently - can it be fixed?)
Some parts of 1) can be asynchronous tasks spawned (animation update? etc.).
2) can be all asynchronous tasks.
The user control what thread is calling 1), so it might be the main thread or another thread.

My understanding is that:
A. Ogre itself don't need to spawn threads itself (I mean the core). 1) is controlled by the user code, 2) should pass control to user's tasks scheduler (to avoid subscription)
B. Ogre needs to provide potentially asynchronous tasks to be crunched by worker threads (which mean potentially in a linear execution if there is only the main thread running)
C. Ogre can provide an implementation of a tasks scheduler (which spawn and manage worker thread(s)) IF the user don't explicitly provide his own. As it would be optional (but default), it would be a component (almost as now?)

As soon as you have an abstraction for a task scheduler, you don't need to spawn threads yourself and you can assume (if you don't spawn infinite tasks...) that if the user don't want Ogre to spawn threads, it will not. The fact that there would be a default implementation (whatever how it works) would be only to help get quickly something that run and simplify samples, like the default scene manager provided.

Is my understanding correct? So far that's what I thought Cabalistic was talking about.
Herb wrote:Couple of thoughts.... I like the idea in principle of separating Ogre into more smaller components, but I also realize there can be more complexity with that model "if" you're using a majority of the components. More DLL's to load and register to use for the components. I guess I'm speaking more towards a person who's new to Ogre as there are so many other components to integrate already before even thinking about integrating components within Ogre. If nothing else, it's a thought to consider if that moves forward.
AFAIK most Ogre "Components" (not plugins) are static libs or additional code injected into OgreMain, right?
As for Boost, I agree with the comments. I actually like the fact that I can select what threading library to use, as for example, I use POCO instead of Boost. Really, if Boost is a requirement, then we should actually "use" it's features throughout the library.
Boost have never been a requirement for Ogre and it was already confirmed that it will not be one.
But, as for threading, has anyone looked at the threading support in C++11? I thought threading support was baked into that and that should be cross-platform, pending Visual Studio has it implemented (most things I find the GNU guys have already baked in).
I did try to provide C++11 implementation of the current use of multithreading in Ogre (there is a thread somewhere). It was a failure because:

- C++11 don't provide any task scheduling features (even async is flawed and not appropriate)
- C++11 don't provide multiple-readers-unique-writer mutex which makes things a bit hard to handle for high efficiency (even if in some cases, exploiting shared_ptr atomicity fixes this).

Now, as stated above, I don't think there is a need for direct manipulation of threads in Ogre and C++11 provide mostly only thread basics, which are an excellent base for future works, but we're not quite there.

Following the approach I was talking about before, Ogre core always relying to a task scheduling abstraction for potentially asynchronous tasks, I would suggest Ogre to implement and provide by default the simplest tasks scheduler ever, which would just execute tasks sequentially. It would be a good test too. Then, optional task scheduling interface implementations could be provided for tbb and other popular frameworks. One based on C++11 std::async() would be easy to provide (but really far from being ideal or performant I think - also, there is a leak in VS2012 implementation).
User avatar
lunkhound
Gremlin
Posts: 169
Joined: Sun Apr 29, 2012 1:03 am
Location: Santa Monica, California
x 19

Re: Ogre 2.0 doc (slides)

Post by lunkhound »

Nice work on the slides. Some great ideas in there, and obviously alot of thought has gone into them.
I agree with what some others have mentioned earlier about tackling these changes in manageable chunks, i.e. a series of smaller refactors rather than a big rewrite.

I have some concerns about the whole SoA thing though. I worry that it may be alot of developer-pain for very little gain. Considering that:

1. SoA isn't necessary to fix the cache-misses. Cache misses can be fixed by reorganizing how data is laid out but without interleaving vector components of different vectors together.
2. SoA isn't necessary to improve performance of vector math using SIMD. OK maybe you don't get the full benefit of SIMD and not in all cases but you can probably get 70% of SoA performance simply by using a SIMD-ified vector library.
3. SoA is not easy to work with. Code is harder to read, harder to debug, more effort to maintain going forward. Imagine inspecting Node structures in the debugger when the vector components are interleaved in memory with other Nodes...

I think SoA is best for a limited-scope, highly optimized, tight loop where every cycle counts, and only affecting a small amount of code. Kind of like assembly language, SoA comes with a cost in developer time and I'm just not sure it would be worth it.
Thanks again for all the work on those slides. I'm really glad to see these issues being raised.

Chris
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5477
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1359

Re: Ogre 2.0 doc (slides)

Post by dark_sylinc »

Regarding my view on "separate components":
I'm always keen on abstracting, providing modularity. In fact that's what I tried to achieve with the proposed render flow algorithm.
However, this doesn't mean that each "component" can go into a DLL or static lib. There are multiple issues that can't be easily tackled. This level of modularization is something that sounds nice, but ends up being impractical.
Furthermore the bigger the project, the bigger the chances there's some dependency between seemingly independent modules. I could've written about this dependencies in the slides, but it's very implementation-specific, and it would misslead from the main topic.

When objects are highly modular, they're easier to mantain, refactor, or even replace. But that doesn't mean it has to go each in it's own dll or lib.
Klaim wrote:There are two things I see:

1. the call to renderOneFrame()
2. resource loading

1) can be done only with always the same thread (currently - can it be fixed?)
Some parts of 1) can be asynchronous tasks spawned (animation update? etc.).
2) can be all asynchronous tasks.
The user control what thread is calling 1), so it might be the main thread or another thread.

My understanding is that:
A. Ogre itself don't need to spawn threads itself (I mean the core). 1) is controlled by the user code, 2) should pass control to user's tasks scheduler (to avoid subscription)
B. Ogre needs to provide potentially asynchronous tasks to be crunched by worker threads (which mean potentially in a linear execution if there is only the main thread running)
C. Ogre can provide an implementation of a tasks scheduler (which spawn and manage worker thread(s)) IF the user don't explicitly provide his own. As it would be optional (but default), it would be a component (almost as now?)
My view is that renderOneFrame() is called from one thread. The user may want to update it's logic & physics in the same thread, or in another one.

As for Ogre's managament of threads:
  • The CompositorManager must have a high degree of control over it's batch threads.
  • The animation & scenenode transform update may have it's own threads. Because their jobs are fairly trivial (and there are many ways to split the work), the idea of a TaskScheduler provided by the user seems fine to me.
Note that all components (including, and specially, the CompositorManager) should accept a hint on number of threads they can spawn, in order to prevent oversubscription (i.e. the user wants to run many threads for himself, unrelated with Ogre)
lunkhound wrote:I have some concerns about the whole SoA thing though. I worry that it may be alot of developer-pain for very little gain. Considering that:

1. SoA isn't necessary to fix the cache-misses. Cache misses can be fixed by reorganizing how data is laid out but without interleaving vector components of different vectors together.
2. SoA isn't necessary to improve performance of vector math using SIMD. OK maybe you don't get the full benefit of SIMD and not in all cases but you can probably get 70% of SoA performance simply by using a SIMD-ified vector library.
3. SoA is not easy to work with. Code is harder to read, harder to debug, more effort to maintain going forward. Imagine inspecting Node structures in the debugger when the vector components are interleaved in memory with other Nodes...

I think SoA is best for a limited-scope, highly optimized, tight loop where every cycle counts, and only affecting a small amount of code. Kind of like assembly language, SoA comes with a cost in developer time and I'm just not sure it would be worth it.
Thanks again for all the work on those slides. I'm really glad to see these issues being raised.
You're right about your concerns. So let me address them:

1. It is true that there are other ways to optimize the data. However, transformation and culling is something that is actually fairly trivial operations, which are are done sequentially on massive amount of elements. Note that the interleaving is for SIMD. An arragement of "XYZXYZXYZ" is possible by specifying 1 float per object at compile time.
The performance gains of using SoA for critical elements such as position & matrices are documented in the SCEE's paper (reference 4)

2. We already do SIMD math and tries to do it's best. There are huge margins to gain using SoA + SIMD because the access patterns and the massive number of operations to perform fit exactly the way SSE2 works. There's a lot of overhead in unpacking & packing.
DICE's Culling the Battlefield slides show the big gains of using SoA + SIMD (reference 3)

3. Without proper planning, it's harder to write. That's true. However my idea is that SoA_Vector3 is encapsulated, including operators (+, -, /, *, etc; using _mm_add_ps & co.).
So, the code would roughly look like one of these two:

a. Keep derivedPos, derivedRot, derivedScale, and WorldTransform matrix (like currently Ogre does):

Code: Select all

for( int i=0; i<mCount, i += 4 )
{
/* prefetch() around here */

//We're updating 4 elements here.
const SoA_Vector3 &parentPos = mChunk[level+0].pos + i;
SoA_Vector3 &localPos = mChunks[level+1].pos + i;
SoA_Vector3 &derivedPos = mChunks[level+1].derivedPos + i;
const SoA_Quaternion &parentRot = mChunk[level+0].rot + i;
SoA_Quaternion &localRot = mChunk[level+1].rot + i;
SoA_Quaternion &derivedRot = mChunk[level+1].derivedRot + i;
const SoA_Vector3 &parentScale = mChunk[level+0].scale + i;
SoA_Vector3 &localScale = mChunk[level+1].scale + i;
SoA_Vector3 &derivedScale = mChunk[level+1].derivedScale + i;

SoA_Matrix4 &derivedTransform = mChunk[level+1].transform + i;

derivedPos = parentPos + parentRot * (parentScale * localPos);
derivedRot = parentRot * localRot; //fsel() to see if we should parentRot should be identity rot.
derivedScale = parentScale * localScale; //fsel() here too.

derivedTransform = NonTemporal( SoA_Matrix4( derivedPos, derivedRot, derivedScale ) );
}
b. Discard derivedPos, derivedRot, derivedScale; always work with matrices (but harder to retrieve derived rotation & scale on demand):

Code: Select all

for( int i=0; i<mCount, i += 4 ) //Actually, it's not "+= 4", but rather += compile_time_number_of_simd_elements_macro
{
/* prefetch() around here */

//We're updating 4 elements here.
SoA_Vector3 &localPos = mChunks[level+1].pos + i;
SoA_Quaternion &localRot = mChunk[level+1].rot + i;
SoA_Vector3 &localScale = mChunk[level+1].scale + i;

const SoA_Matrix4 &parentTransform = mChunk[level+0].transform + i;
SoA_Matrix4 &derivedTransform = mChunk[level+1].transform + i;

SoA_Matrix4 localTransform = SoA_Matrix4( localPos, localRot, localScale ); //Use fsel for rot & scale
derivedTransform = NonTemporal( parentTransform * localTransform );
}
No intrinsics whatsoever, they're inside the operators. It's not harder to read either. And with good docs about how the render flow works (which we already have with the slides) it's not hard to "get" what's going on within the loop (i.e., why += 4).
If we expose the SoA nature to users and even Ogre devs, then we're probably doing something wrong. The idea is to work in SoA, not make the users think in SoA (other than very high level knowledge, like "I'm working X objs at the same time", note that X can be reduced to 1)

It's very true that debugging becomes much harder, specially when examining a single Entity or SceneNode.
I see two complementary solutions:
  • Use getPosition() would retrieve the scalar version; which can be called from the watch window (as long as we ensure it's fully const...)
  • There are a few MSVC features (I don't remember if they had to be installed, or if they were defined through pragmas) that tell MSVC how to read objects while debugging. I'm sure gdb probably has something similar.
Last edited by dark_sylinc on Sun Nov 25, 2012 1:07 am, edited 1 time in total.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5477
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1359

Re: Ogre 2.0 doc (slides)

Post by dark_sylinc »

I found it!
Visual Studio allows customizing how it shows data in the debugger. All we have to do is to mess with the autoexp.dat (and optionally, use a DLL that can be added using the ADDIN directive) which is usually located in C:\Program Files\Microsoft Visual Studio 9\Common7\Packages\Debugger\Autoexp.dat.
User avatar
Klaim
Old One
Posts: 2565
Joined: Sun Sep 11, 2005 1:04 am
Location: Paris, France
x 56

Re: Ogre 2.0 doc (slides)

Post by Klaim »

dark_sylinc wrote: My view is that renderOneFrame() is called from one thread. The user may want to update it's logic & physics in the same thread, or in another one.
Yes that's what I meant.
As for Ogre's managament of threads:
  • The CompositorManager must have a high degree of control over it's batch threads.
I don't understand this. To me, whatever the kind of parallel work, it should work with the task scheduler underneath, the same way parallel_for in tbb will spawn tasks for each batch of cycle.
[*]The animation & scenenode transform update may have it's own threads. Because their jobs are fairly trivial (and there are many ways to split the work), the idea of a TaskScheduler provided by the user seems fine to me.[/list]
Note that all components (including, and specially, the CompositorManager) should accept a hint on number of threads they can spawn, in order to prevent oversubscription (i.e. the user wants to run many threads for himself, unrelated with Ogre)
Well I really don't understand why there is a need to spawn threads if you want to prevent oversubscription (as I said too), because the only way is to let the user control the task scheduler and make Ogre agnostic on this. I might missunderstand something but to me as soon as a library spawn it's own threads, it does become candidate for oversubscription.
Threads are too low level resources.
User avatar
_tommo_
Gnoll
Posts: 677
Joined: Tue Sep 19, 2006 6:09 pm
x 5

Re: Ogre 2.0 doc (slides)

Post by _tommo_ »

I'll drop this here, even if I'm not an ogre user since a long time (and the non-existence of said 2.0 is one of the reasons), because it strikes me that noone sees it this way:

to me, Ogre, as pure graphics engine needs to NOT expose any threading system.
I don't like at all the idea that a renderer will "take life of its own" and start spawning threads unless I do some arcane forms of control (ie. subclassing the default task manager class).
The default should be simplicity.

Threading is something that is clearly orthogonal to all the systems in a complete game engine made of graphics, sound, physics, ai and whatever library;
so relevant ogre functions need to be clearly defined as thread-safe but Ogre should not attempt to spawn threads or tasks in itself.
It should leave to the client application all the freedom on when, how and how concurrently run its code while being clear on what can and what cannot be parallelized.

This should be both much faster and solid to develop, and easier to support while threading needs do evolve, because they will; and at the same time eases a burden on those that need a lean system without bloat.

PS: imo all of Ogre 2.0 should aim at being a pure graphics library, focusing on simplicity. And this imo means dropping a lot of existing functionality, and becoming more passive on which role Ogre takes in a game engine architecture.
Basically everyone that approaches Ogre feels the urge to place it at the cornerstone of its engine (with no decoupling between maths, threading, and scene managing between rendering & logic ), and Ogre is responsible of this because of the current all-encompassing architecture.

PPS: the docs are great but the biggest setback Ogre has in regard of the said engines (and Unity, which strangely was not mentioned even if it is the greatest Ogre-killer between AAs) are TOOLS. lots of excellent tools for artists and designers.
So along of a simplification of the graphic library itself, there should be a serious effort in making the engine useful, as in, in the real world. Thinking that devs are leaving Ogre because it is "not fast enough for AAAs" means completely missing the point.

PPPS: most of the proposed ways of "optimizing" by bruteforcing jumps or switching full-on to SoA + SIMD just ignore that Ogre today needs to run energy-efficiently on cheap ARMs much more than squeezing SSE2 archs and are probably best ignored, and are indeed an ugly case of optimizing without even thinking what the use case will be.
The DICE papers might be good for their very restrictive use cases (next gen consoles and PCs) but fail quite badly when you try to make, say, an Android game.
OverMindGames Blog
IndieVault.it: Il nuovo portale italiano su Game Dev & Indie Games
User avatar
saejox
Goblin
Posts: 260
Joined: Tue Oct 25, 2011 1:07 am
x 36

Re: Ogre 2.0 doc (slides)

Post by saejox »

Does 2.0 aim for better performance or better usability?

If it is going to thread-safe it means hundreds of mutexes in every function.
Goodbye performance.

Ogre already has many shared_ptrs and locks, even tho it is not thread-safe.
I think all those useless locks and shared_ptr should be removed.
No need to wait for a big release for that.

There are many opportunities for SSE2+ and cache friendly structures as mentioned in the paper.
Ogre is already the most usable open source rendering engine, it just need to be faster and less resource hungry to be more competitive.

that's how i see it.
Nimet - Advanced Ogre3D Mesh/dotScene Viewer
asPEEK - Remote Angelscript debugger with html interface
ogreHTML - HTML5 user interfaces in Ogre
User avatar
Xavyiy
OGRE Expert User
OGRE Expert User
Posts: 847
Joined: Tue Apr 12, 2005 2:35 pm
Location: Albacete - Spain
x 87

Re: Ogre 2.0 doc (slides)

Post by Xavyiy »

and Unity, which strangely was not mentioned even if it is the greatest Ogre-killer between AAs) are TOOLS. lots of excellent tools for artists and designers.
Well, actually I don't think it's fair to compare Unity to Ogre that way. Unity is a full game engine, very featured, with an awesome editor perfectly married with the engine. Also quite optimized, specially last year versions. Ogre is a render engine, just that, which urgently needs a redesign focussed on optimization and DX11/OGL4 arquitecture. It's not cool seeing that a complex scene runs twice faster in UDK or even Unity. It's not very cool either how each compositor render_scene pass culls the whole scene again, etc.
the docs are great but the biggest setback Ogre has in regard of the said engines (and Unity, which strangely was not mentioned even if it is the greatest Ogre-killer between AAs) are TOOLS. lots of excellent tools for artists and designers.
What kind of tools do you want to see? A scene editor? Material editor? But that's again the same story: ogre is just a render engine, should not provide any kind of high level tool. Just mesh/material importer/exporters and mesh optimization tools, not much more, IMHO.
Thinking that devs are leaving Ogre because it is "not fast enough for AAAs" means completely missing the point.
Actually this is my case. Of course I'm not leaving Ogre, but I'm quite concerned about 1.8.X/1.9.X performance. I think any ogre developer using some compositors(specially involving render_scene passes), high quality shadows(cascaded shadow mapping with 3 or 4 shadow maps, for example) and any kind of water system(which will need at least 2 more render passes: reflection, refraction. Depth map may be shared with other depth-based effects, like the used for DOF and similar) shares my concerns about ogre performance.
The DICE papers might be good for their very restrictive use cases (next gen consoles and PCs) but fail quite badly when you try to make, say, an Android game.
IMHO PCs and next gen consoles are not a very restrictive use case. But indeed, it would be nice put some attention on ARM, although mobile SoC are evolving very fast. Anyway I think the development should be focussed on "next-gen" PC and consoles architecture(aka DX11) rather than on limited mobile ones(GLES2 / 3?).

-----------------------------

I've the feeling that whatever the 2.0 roadmap will be, it'll not be ideal for the whole community. I would like to read concrete solutions rather than "general ideas", since I see a very low SNR in all ogre redesign threads (of course! each person has its own interests, but things must move ahead!)

As I've said before, hope my posts don't sound rude, that's not my intention at all.
There are many opportunities for SSE2+ and cache friendly structures as mentioned in the paper.
Ogre is already the most usable open source rendering engine, it just need to be faster and less resource hungry to be more competitive.
+1!

Xavier

Edit: Just to clarify: I'm not saying I don't consider mobile platforms important, in fact I think them are very important. What I want to say is that IHMO, ogre development should be done around PC (like until now) and not around mobile.
User avatar
DanielSefton
Ogre Magi
Posts: 1235
Joined: Fri Oct 26, 2007 12:36 am
Location: Mountain View, CA
x 10

Re: Ogre 2.0 doc (slides)

Post by DanielSefton »

Xavyiy wrote:Although I think the development should be focussed on "next-gen" PC and consoles architecture(aka DX11) rather than in limited mobile ones(GLES2 / 3?).
Actually I believe that Ogre's mobile development userbase may even surpass that of traditional PC/console development in the near future.

Regardless, both should be the focus, just like Unreal and Unity are able to run fast on both PC and mobile. If Ogre becomes more lightweight and cache friendly, then it will naturally benefit speed on mobiles as well as PC, and various team members can make sure the balance of platform specific optimisation is met (like David on iOS, Murat on Android, Assaf on DX11 etc.)
Xavyiy wrote:Unity is a full game engine, very featured, with an awesome editor perfectly married with the engine. Also quite optimized, specially last year versions. Ogre is a render engine, just that, which urgently needs a redesign focussed on optimization
Agreed. One of Ogre's main attractions is that it allows developers to create their own tools and engines around it. As we tell users all the time, Ogre is NOT a game engine, it's a graphics library. If you want tools, use/extend Ogitor, or something third party like Xavyiy's Paradise Engine ;)
saejox wrote:Does 2.0 aim for better performance or better usability?
You said it yourself:
saejox wrote:Ogre is already the most usable open source rendering engine, it just need to be faster and less resource hungry to be more competitive.
And to add my opinion to the threading discussion, parallel architecture should be up to third party engines (Ogre itself shouldn't be using TBB or boost threads). But you may perhaps expose multiple update loops, like scene graph and frame rendering, I guess. There would be nothing stopping us from providing a basic example framework which makes use of TBB.

However, my research stopped before I got the chance to check out the idea of turning render operations into tasks though, like what Sony's Phyre Engine does.
User avatar
Mako_energy
Greenskin
Posts: 125
Joined: Mon Feb 22, 2010 7:48 pm
x 9

Re: Ogre 2.0 doc (slides)

Post by Mako_energy »

_tommo_ wrote:to me, Ogre, as pure graphics engine needs to NOT expose any threading system.
I've been thinking about this more heavily as of late and I am growing into that mentality myself. The more I think about the complications of working with a scheduler that Ogre is aware of that it interfaces with the more I think that it'll just cause issues if anyone has a different idea of how threading should work in their game. Different projects have different needs and it seems somewhat unrealistic to assume you can put a catch-all into Ogre that will work. Then there is my next point...
Xavyiy wrote:Well, actually I don't think it's fair to compare Unity to Ogre that way. Unity is a full game engine, very featured, with an awesome editor perfectly married with the engine. Also quite optimized, specially last year versions. Ogre is a render engine, just that, which urgently needs a redesign focussed on optimization and DX11/OGL4 arquitecture. It's not cool seeing that a complex scene runs twice faster in UDK or even Unity. It's not very cool either how each compositor render_scene pass culls the whole scene again, etc.
Xavyiy wrote:What kind of tools do you want to see? A scene editor? Material editor? But that's again the same story: ogre is just a render engine, should not provide any kind of high level tool. Just mesh/material importer/exporters and mesh optimization tools, not much more, IMHO.
I hear this often among Ogre users more experienced than I, however I can't really see how this is true. Ogre does SO MUCH, I feel it is half-way to a game engine and as I have stated in some other posts the resource system is a large part of that. I completely agree that what you are saying is how Ogre should be...but I can't at all agree that's what it is. Breaking off more things into components or plugins is needed. Starting with the resource system, imo.

In addition I don't think more systems should be added to exacerbate the issue. If something lightweight and flexible can't be implemented, then don't try to put threading into Ogre at all. One possibility that comes to mind is the multi-threading in Bullet. It's a very simple class meant to be overridden that I have heard works well with a large number of multi-threading strategies. I personally haven't used it(yet) so I can't comment too much on it, but it's an idea I just wanted to throw out there. Has anyone here used it? Would a similar class be appropriate for Ogre?
Xavyiy wrote:I've the feeling that whatever the 2.0 roadmap will be, it'll not be ideal for the whole community. I would like to read concrete solutions rather than "general ideas", since I see a very low SNR in all ogre redesign threads (of course! each person has its own interests, but things must move ahead!)
I think we all want to see this move ahead as fast as possible, but a lot of us have different use cases that must be made aware if we are to hope to arrive at a solution that is the most ideal for the community. To that end maybe expecting people to post in the development forums is asking too much of most of the people out there. If the Ogre team has the time maybe it would be better to do another survey. One aimed more directly at all the subjects raised here. At least ask enough questions to get a start on the whole thing.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5477
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1359

Re: Ogre 2.0 doc (slides)

Post by dark_sylinc »

_tommo_ wrote:to me, Ogre, as pure graphics engine needs to NOT expose any threading system.
I don't like at all the idea that a renderer will "take life of its own" and start spawning threads unless I do some arcane forms of control (ie. subclassing the default task manager class).
The default should be simplicity.
This is WHY I insist so much in allowing the user to specify how many threads it wants Ogre to spawn. For example, Havok can spawn no threads, or spawn as many as it wants. This value is set during startup.

If you don't want Ogre to "take life on it's own", just tell it not to while creating Ogre::Root.
BTW, some sound systems tend to take life on it's own without notice, because that's what DirectSound needs, for example. The future is in multi-core, so it's about time we tackle that.
As for Ogre's managament of threads:

The CompositorManager must have a high degree of control over it's batch threads.
I don't understand this. To me, whatever the kind of parallel work, it should work with the task scheduler underneath, the same way parallel_for in tbb will spawn tasks for each batch of cycle.
What I meant is that the control over the batch (worker) threads is too advanced. Creating a generic task scheduler that would run on is not a trivial issue at all. May be something for the far future, IF it seems to be viable.
To prevent oversubscription, tell Ogre at startup how many threads it can spawn at max.
Do you have a few links of similar implementations of what you have in mind? Because I think I'm no seeing what you see.
_tommo_ wrote:PS: imo all of Ogre 2.0 should aim at being a pure graphics library, focusing on simplicity. And this imo means dropping a lot of existing functionality, and becoming more passive on which role Ogre takes in a game engine architecture.
Basically everyone that approaches Ogre feels the urge to place it at the cornerstone of its engine (with no decoupling between maths, threading, and scene managing between rendering & logic ), and Ogre is responsible of this because of the current all-encompassing architecture.
You're describing to convert into a set of utilities library. A rendering engine is exactly composed of a math library, a render queue, a batch dispatcher & material manager, and a scene graph.
The urge IMO comes from all this being in the same big chunk called "SceneManager" (except for the math part)
_tommo_ wrote:PPS: the docs are great but the biggest setback Ogre has in regard of the said engines (and Unity, which strangely was not mentioned even if it is the greatest Ogre-killer between AAs) are TOOLS. lots of excellent tools for artists and designers.
So along of a simplification of the graphic library itself, there should be a serious effort in making the engine useful, as in, in the real world. Thinking that devs are leaving Ogre because it is "not fast enough for AAAs" means completely missing the point.
I agree on the tools. This is why I added a few slides about RTSS to be more node-like. If we make a customizable node system, creating a graphical interactive tool for setting up material would be very easy. As for the rest, I left them out because they demand a PDF on it's own.
There are a few tools that are too game engine specific (rather than render engine) when compared to Unity & UDK (like DanielSefton said); however these are areas we can develop on:
  • Material editor. A real one. Preferably with node views for setting the relations (that artists would use), and a syntax highlighter to write the shader associated with each node (that programmers would write). Of course, WYSIWYG
  • Compositor editor. Preferably with node views; but a stack based implementation (like GIMP) may work too. WYSIWYG
  • Export integration. So far the biggest complain I get from artists & indie companies is that the export process pipeline sucks. UDK & Unity do this pretty well. There are too many steps involved into getting from 3DS Max/Maya/Blender into the actual game. This is because:
    1. It usually involves setting a material file (by hand, using text! artists don't like text!), being careful not to overwrite a previous material file
    2. Exporting all the right submeshes; and placing the .mesh folder into the right folder (or setup resources.cfg)
    3. Setting up an additional file for the custom game engine to link the GameObject with a given Mesh
    4. Getting a preview means doing all the above (+ launching a custom preview tool, Ogre Meshy, or even loading the full game depending on each case). It's a pita if something was wrong and above steps must be followed all over again. Cuts the iteration process
Having that said, most of us don't have the time to work on tools, because tools involve GUI code (many find it boring & frustrating). Making good GUI is an art, and requires a lot of co-developing with artists & designers (after all, those are the users). This forum sadly lacks artists.
It's a chicken and egg problem; we can't make appealing tools because we don't have artists to work with, and we don't have artists because we have no appealing tools.
_tommo_ wrote:Thinking that devs are leaving Ogre because it is "not fast enough for AAAs" means completely missing the point.
It's not fast for AAA, nor for indies either. I'm not working for an AAA company and Ogre's limitations are annoying me, as well as other users. The big main problem is that it's lacking scalability. Overclock your CPU from 3Ghz to 6Ghz and it will only speed up a little because of the cache misses (you can overclock the RAM to increase the bandwidth, but then you'll increase latency...). Throw a CPU with more cores or a faster GPU and it will run as slow as it was before. In other words we're doomed if we don't change this scenario. Specially since AAA companies are lending their engines to the average Joe (thus competing with Ogre & game engines relying on Ogre).

Faster means more flexibility. Even small games have to look out for the number of bones & entities they spawn; where as it is much more flexible if they don't have to worry about that at all; and let the problem to experts making games who need all the juice they can squeeze.

Also appealing to AAA companies may have it's perks. If it's good enough, there's the potential for Ogre to start getting sponsors like LuaJIT & Bullet do; because many game companies find it easier to fund an open source project they feel useful, than paying >$50.000 per title license.
_tommo_ wrote:PPPS: most of the proposed ways of "optimizing" by bruteforcing jumps or switching full-on to SoA + SIMD just ignore that Ogre today needs to run energy-efficiently on cheap ARMs much more than squeezing SSE2 archs and are probably best ignored, and are indeed an ugly case of optimizing without even thinking what the use case will be.
The DICE papers might be good for their very restrictive use cases (next gen consoles and PCs) but fail quite badly when you try to make, say, an Android game.
It's true that we're bruteforcing. But the current implementation is trying to be smart and fails misserably. Android phones are going multicore, and NEON is the SSE of ARMs.
In fact the most power hungry element in a phone is the RAM memory. More bandwidth usage = more battery wear. And since we're doing lots of cache misses and wasting lots of RAM for needless variables, and running ultra slow, it's safe to say we're draining energy like a nuclear submarine.

Optimizing means finishing the frame faster. If the Android game is updated at 5hz, a faster frame update means sleeping more between each frame. Most phone optimizations revolve around lazy frame updating (updating only when necessary, or just parts that need to) or updating elements at different frequencies depending on the object's nature (i.e. a tree vs the main player).
Lazy frame updating 90% of the time falls out of the scope of a render engine: It has to be done at a higher level, and with a proper Ogre setup (i.e. compositors, multiple visibility layers).
As for updating elements at different frequencies, it should be much easier. Because we plan to separate elements into set of chunks. Just place the elements into different render queue IDs, and slightly modify Ogre (a design feature that can be evaluated) to update the Chunks at different intervals.
saejox wrote:If it is going to thread-safe it means hundreds of mutexes in every function.
Goodbye performance.
God no! The threading model is about spliting work on objects that aren't being touched at the same time (hence no need for locking except when the job is done), that's all. :D
saejox wrote: Ogre already has many shared_ptrs and locks, even tho it is not thread-safe.
I think all those useless locks and shared_ptr should be removed.
No need to wait for a big release for that.
I cannot agree more! :)
Mako_energy wrote:I hear this often among Ogre users more experienced than I, however I can't really see how this is true. Ogre does SO MUCH, I feel it is half-way to a game engine and as I have stated in some other posts the resource system is a large part of that. I completely agree that what you are saying is how Ogre should be...but I can't at all agree that's what it is. Breaking off more things into components or plugins is needed. Starting with the resource system, imo.
Part of this "Ogre does SO MUCH" comes (as downside?) from Open Source. Some programmer pops up, decides he needs X thing implemented to render his stuff the way he wants, without investing much time if there was already a way of achieving the same result; then he submit his change and gets into core. This programmer probably doesn't show up again after that.
Therefore we end up with redundant ways of doing the same thing.
Take the variable mVisible for example. It's unnecessary. Why? Because we have visibility masks. Just reserve one layer for making stuff invisible, and problem solved. Instead, we add an extra byte (up to 4 if bad alignment packing happens) where using one 1 bit out of the 32 from mVisibilityMask was enough.
I remember the main reason one of the many Render Listener was added (I don't recall exactly which one of the listeners) was because someone in the community wanted to control whether some objects were visible in particular passes. Oh wait, that's what mVisibilityMask is for, to selectively render & filter between passes...
Making Node::getPosition virtual is another example of this habit (I'M JUST GUESSING, but it probably became virtual because someone at some point needed it).

It's not bad per-se, as someone's work may prove very useful for something totally unrelated to the original intention (that couldn't be achieved with the other preexisting method); but because these programmers tend implement their contribution in a rush, it leaves little time for thinking how all fits together in the grand design. When these contributions start piling up, we end up with half render engine, half game engine; and the "Ogre does SO MUCH" phrase.

Like I said, it's not necessarily bad. And I don't want to disrespect the Open Source community at all! :)
But every once in a while we need to clear the mess, tie the loose ends; and remove what's totally unnecessary. It's not easy though. What looks totally unnecessary to someone may not be actually be for a few niches.
Sqeaky
Gnoblar
Posts: 24
Joined: Sat Jun 19, 2010 1:44 am

Re: Ogre 2.0 doc (slides)

Post by Sqeaky »

I agree that a lighter more focused Ogre would be nice. Some features are easy enough to avoid, but others, like the resource system seem very core and hard to avoid if unwanted. Despite this I think that threading should be a core component of Ogre. I think that trying to add an abstraction layer such that Ogre takes advantage of external threading systems is not useful with out adding copious complexity. Going the other way and allowing the customization by game developers is a good idea, but there is a limit to what can be accomplished. Even if it where just a renderer, because of the renderer's importance to a game, Ogre will either be providing a threading model for the game or will be run separately from the rest of the frame in a game and in practice only simple configurations (like thread count) change will be made to its threading. If the Ogre threading model is well thought out the former will seem a reasonable solution for many. If the Ogre threading model is lightweight enough to be ignored in the microseconds before or after it runs and the threading can be disabled allowing the game logic to ignore it and run in other threads this then everyone else can be happy.
Xavyiy wrote:I would like to read concrete solutions rather than "general ideas", since I see a very low SNR in all ogre redesign threads
I don't know a huge amount about cache misses, but I am writing a threading library I will gladly re-license to zlib (Currently it is GPL3) for Ogre use. It is not ready for prime time yet, but it has some of the features I want it to have when it is done. I also want to tune it extensively for performance.

It is a variation on a conventional WorkQueue/Threadpool that moves where synchronization occurs to minimize contention. Currently it is lockless and uses a Atomic CAS operations to prevent race conditions internally, while exposing a concept of dependencies to make writing workunits easier. In WorkUnit code there is no need to use or know about conventional synchronization primitives, but they can be optionally used if desired. I can describe in as much detail as anyone would like and you can also see what I have so far at https://github.com/BlackToppStudios/DAGFrameScheduler/ . It also has fairly comprehensive doxygen docs, including a description of the algorithm, located at doc/html/index.html relative to the root of the downloaded repo.
Need an alternative to a single threaded main loop for a game: https://github.com/BlackToppStudios/DAGFrameScheduler/
--Sqeaky
User avatar
lunkhound
Gremlin
Posts: 169
Joined: Sun Apr 29, 2012 1:03 am
Location: Santa Monica, California
x 19

Re: Ogre 2.0 doc (slides)

Post by lunkhound »

dark_sylinc wrote:
lunkhound wrote:I have some concerns about the whole SoA thing though. I worry that it may be alot of developer-pain for very little gain. Considering that:

1. SoA isn't necessary to fix the cache-misses. Cache misses can be fixed by reorganizing how data is laid out but without interleaving vector components of different vectors together.
2. SoA isn't necessary to improve performance of vector math using SIMD. OK maybe you don't get the full benefit of SIMD and not in all cases but you can probably get 70% of SoA performance simply by using a SIMD-ified vector library.
3. SoA is not easy to work with. Code is harder to read, harder to debug, more effort to maintain going forward. Imagine inspecting Node structures in the debugger when the vector components are interleaved in memory with other Nodes...

I think SoA is best for a limited-scope, highly optimized, tight loop where every cycle counts, and only affecting a small amount of code. Kind of like assembly language, SoA comes with a cost in developer time and I'm just not sure it would be worth it.
Thanks again for all the work on those slides. I'm really glad to see these issues being raised.
You're right about your concerns. So let me address them:

1. It is true that there are other ways to optimize the data. However, transformation and culling is something that is actually fairly trivial operations, which are are done sequentially on massive amount of elements. Note that the interleaving is for SIMD. An arragement of "XYZXYZXYZ" is possible by specifying 1 float per object at compile time.
The performance gains of using SoA for critical elements such as position & matrices are documented in the SCEE's paper (reference 4)
Sorry, I didn't make myself clear. I agree that SCEE paper is exactly the sort of thing that we ought to be doing, but it doesn't mention SoA as I understand it. When I see "SoA" I think of this: http://software.intel.com/en-us/article ... chitecture

Code: Select all

struct StructureOfArrays
{
   float x[numVertices];
   float y[numVertices];
   float z[numVertices];
...
};
Intel has been telling everyone to swizzle their data like this ever since they came out with MMX. My comments were ONLY directed at this Intel-style swizzling, and not at the sort of grouping of homogeneous data structures featured in the SCEE reference. I will refer to it as "swizzling" and not "SoA" for clarity.
dark_sylinc wrote: 2. We already do SIMD math and tries to do it's best. There are huge margins to gain using SoA + SIMD because the access patterns and the massive number of operations to perform fit exactly the way SSE2 works. There's a lot of overhead in unpacking & packing.
DICE's Culling the Battlefield slides show the big gains of using SoA + SIMD (reference 3)
You'll notice however, that in those DICE slides, they are not actually storing any of their data structures swizzled in memory. They swizzled the frustum planes (on the fly presumeably), and then loop over un-swizzled bounding spheres. That's a great use of SoA/swizzling because no user-facing data structures are swizzled.

<snipsnip> code looks OK...
dark_sylinc wrote: It's very true that debugging becomes much harder, specially when examining a single Entity or SceneNode.
I see two complementary solutions:
  • Use getPosition() would retrieve the scalar version; which can be called from the watch window (as long as we ensure it's fully const...)
  • There are a few MSVC features (I don't remember if they had to be installed, or if they were defined through pragmas) that tell MSVC how to read objects while debugging. I'm sure gdb probably has something similar.
I think looking at those DICE slides again actually convinced me that there is very little to gain from keeping stuff in swizzled format in memory. Just swizzle the frustum planes on the fly and a bit of optimized SIMD code will yield great performance.
If there are any performance gains to be had from swizzling the SceneNodes in memory, I would expect them to be tiny and not at all worth the trouble it would cause every user who has to examine a SceneNode in the debugger.
However, I'm sure there are cases where it would make sense, like a particle-system.
User avatar
spookyboo
Silver Sponsor
Silver Sponsor
Posts: 1141
Joined: Tue Jul 06, 2004 5:57 am
x 151

Re: Ogre 2.0 doc (slides)

Post by spookyboo »

Some programmer pops up, decides he needs X thing implemented to render his stuff the way he wants, without investing much time if there was already a way of achieving the same result; then he submit his change and gets into core
This is indeed the disadvantage of open source. If you want to redesign Ogre, you need a dedicated team that sticks 'till the end' and has a clear vision. Every change to the core is validated by that team. The problem is that such a team needs time and an incentive (money, no personal life) to stick to the project. That is the difference between Ogre development and companies like Epic and Crytek. Ogre can survice when combined with some kind of commercial activity. This is tried before (by Steve) but I am the first to admit that this is no easy task. Ogre needs at least some substantial gifts from large companies (you know who you are!). Maybe these companies want something in return, but as long as this fits into the teams' vision, I don't see a problem.
User avatar
dark_sylinc
OGRE Team Member
OGRE Team Member
Posts: 5477
Joined: Sat Jul 21, 2007 4:55 pm
Location: Buenos Aires, Argentina
x 1359

Re: Ogre 2.0 doc (slides)

Post by dark_sylinc »

lunkhound wrote:Sorry, I didn't make myself clear. I agree that SCEE paper is exactly the sort of thing that we ought to be doing, but it doesn't mention SoA as I understand it. When I see "SoA" I think of this: http://software.intel.com/en-us/article ... chitecture

Code: Select all

struct StructureOfArrays
{
   float x[numVertices];
   float y[numVertices];
   float z[numVertices];
...
};
Intel has been telling everyone to swizzle their data like this ever since they came out with MMX. My comments were ONLY directed at this Intel-style swizzling, and not at the sort of grouping of homogeneous data structures featured in the SCEE reference. I will refer to it as "swizzling" and not "SoA" for clarity.
Oh I see. SCEE's is technically "SoA" which stands for Structure of Arrays (or ptrs). If we look at the SceneNode declaration from SCEE's, it is:

Code: Select all

class SceneNode
{
   Vector3 *Position; //ptr
   Quaternion *qRot;//ptr
   Quaternion *qRot;//ptr
   Matrix4 *matrix; //ptr
}
Indeed, Intel's proposal since the introduction of MMX sucked hard. Because when we need to go scalar (we know that happens sooner or later) reading X, Y & Z are three cache fetches, because they're too far a part. It's horrible. Not to mention very inflexible.
That's why I came out with the idea of interleaving the data as XXXXYYYYZZZZ: When we go scalar, it is still one fetch (in systems that fetch 64-byte lines).
lunkhound wrote:I think looking at those DICE slides again actually convinced me that there is very little to gain from keeping stuff in swizzled format in memory. Just swizzle the frustum planes on the fly and a bit of optimized SIMD code will yield great performance.
If there are any performance gains to be had from swizzling the SceneNodes in memory, I would expect them to be tiny and not at all worth the trouble it would cause every user who has to examine a SceneNode in the debugger.
However, I'm sure there are cases where it would make sense, like a particle-system.
Actually, we have nothing to lose and possible something to gain (performance). And I'll tell you why:
Regardless of whether you want to swizzle in memory, or swizzle using instructions; we still have to write the code that ensure all memory is contiguous. Even if we don't use SSE at all (we would use XYZXYZ model, that is, specifying one float instead of four at compile time) we need continuity, and being able to load from memory without data depedencies.

My idea is that in PC systems, default to four floats, and use SSE. However if you really, really think debugging is going to be a big problem (even with MSVC's custom data display, I admit not everyone uses MSVC), then compile using one float; and there can be also a "SoA_Vector3" implementation that uses packing instructions to swizzle the memory onto the registers on the fly.
After all SoA_Vector3 & co. is platform dependant. In PCs with 4 floats per object, it will use SSE intrinsics. In ARM with 2 & 4 floats per object, it will use NEON.
In PCs with 1 float per object, it can use scalar operations... or packing+shuffling SSE intrinsics and still operate 4 objects at the time, like you suggest.

So, it is a win-win situation. We can have it my way and your way too, with minimal effort (other than writing multiple versions of SoA_Vector3, SoA_Quaternion & SoA_Matrix4). The magic happens in the memory manager that will dictate how the SoA memory gets allocated & arranged. The rest of the systems are totally abstracted from the number of floats interleaved.
Sqeaky
Gnoblar
Posts: 24
Joined: Sat Jun 19, 2010 1:44 am

Re: Ogre 2.0 doc (slides)

Post by Sqeaky »

spookyboo wrote:
Some programmer pops up, decides he needs X thing implemented to render his stuff the way he wants, without investing much time if there was already a way of achieving the same result; then he submit his change and gets into core
This is indeed the disadvantage of open source. If you want to redesign Ogre, you need a dedicated team that sticks 'till the end' and has a clear vision.
I disagree that this is a disadvantage, free labor is rarely bad, particularly if you do have a core team as Ogre appears to. Many open source projects have become very successful exactly because of this kind of free labor. But this is clearly off-toppic.
Klaim wrote:
As for Ogre's managament of threads:
  • The CompositorManager must have a high degree of control over it's batch threads.
I don't understand this. To me, whatever the kind of parallel work, it should work with the task scheduler underneath, the same way parallel_for in tbb will spawn tasks for each batch of cycle.
It may be easier to understand if you think of all multithreaded code as providing and expecting guarantees. Different task/workunit scheduling algorithms expect different amounts of thread-safety from their workunits and interact with their workunits based on these assumptions. Some schedulers require no thread safety, some require just re-entrancy, some require full data write isolation, and there are other more esoteric requirements that are possible. Tasks/WorkUnits will also implicitly make assumptions of their schedulers. They are written differently if workunits finish in a known order, if two workunits are guaranteed to not access the same resources, if every data access need to be wrapped in a mutex/atomic cas, and based what information the scheduler provides the workunit.

If the default Ogre task scheduler provides a certain guarantees and the game developer provides a task scheduler of his own it must provide at least the same guarantees. If the new scheduler provides more, the Ogre tasks/workunits will not be able to take full advantage of the because they are already written. If he provides fewer guarantees he will likely introduce race conditions or deadlocks.

For a more concrete example please consider Apples LibDispatch ( http://libdispatch.macosforge.org/ ), which uses a custom barrier primitive, custom semaphores and communication with the scheduler to ensure data consistency and an Apache WorkQueue( http://cxf.apache.org/javadoc/latest/or ... Queue.html ) which implicitly assumes that the work unit will provide it own consistency mechanism without any communication from the Queue, and implements a timeout to provide other guarantees. It would be very difficult to write a work unit that would work ideally in both places.

The parallel_for construct in Threading Building Blocks really is different class of construct than a scheduler. It is a threading primitive designed to parallelize obviously parellizable problems. I do not know, but I suspect that many parts of Ogre are not obviously parellizable, and I suspect that any threading algorithm used in Ogre must be carefully designed to get maximum performance.
dark_sylinc wrote:To prevent oversubscription, tell Ogre at startup how many threads it can spawn at max.
There are likely a few other configurations that can used when Ogre starts to adjust it, but I agree the thread count is the obvious one. IMHO a good threading design will allow the game developer to interact with the Ogre threading system in at least three ways.
  • Tight Integration with Ogre's threading - Using the same system would clearly be beneficial for small projects or projects that require similar performance characteristics. This could be supported by exposing whatever classes and functions implement the threading model.
  • Ignore Ogre by letting it do its work all at once - This would use the thread count to give Ogre full control of all the system resources so it could finish its work swiftly. The game would use its own threading system and simply ignore Ogre's during the rest of the frame while it performed the required game tasks. This can be supported by allowing configuration on the threading models classes (thread count, and maybe others).
  • Ignore Ogre by letting it do its work a limited number of threads - Some games may already take full advantage of threading or may require a long time for a single threaded task. This allows the game logic to run the full duration of a frame while Ogre works its mostly separate task. This would be the hardest of these use cases to support, but I think it could be managed by providing a limited isolated/thread way to provide updates to Ogre through a small number of functions
[/list]
Need an alternative to a single threaded main loop for a game: https://github.com/BlackToppStudios/DAGFrameScheduler/
--Sqeaky
User avatar
lunkhound
Gremlin
Posts: 169
Joined: Sun Apr 29, 2012 1:03 am
Location: Santa Monica, California
x 19

Re: Ogre 2.0 doc (slides)

Post by lunkhound »

dark_sylinc wrote:
lunkhound wrote:Sorry, I didn't make myself clear. I agree that SCEE paper is exactly the sort of thing that we ought to be doing, but it doesn't mention SoA as I understand it. When I see "SoA" I think of this: http://software.intel.com/en-us/article ... chitecture

Code: Select all

struct StructureOfArrays
{
   float x[numVertices];
   float y[numVertices];
   float z[numVertices];
...
};
Intel has been telling everyone to swizzle their data like this ever since they came out with MMX. My comments were ONLY directed at this Intel-style swizzling, and not at the sort of grouping of homogeneous data structures featured in the SCEE reference. I will refer to it as "swizzling" and not "SoA" for clarity.
Oh I see. SCEE's is technically "SoA" which stands for Structure of Arrays (or ptrs). If we look at the SceneNode declaration from SCEE's, it is:

Code: Select all

class SceneNode
{
   Vector3 *Position; //ptr
   Quaternion *qRot;//ptr
   Quaternion *qRot;//ptr
   Matrix4 *matrix; //ptr
}
I've never seen that called SoA before. That's a structure of pointers to structures (or a structure of pointers into arrays of structures). I'm not sure if there is an "official" definition for SoA (nothing on Wikipedia). But I've always seen it mentioned in conjunction with SIMD.
dark_sylinc wrote:Indeed, Intel's proposal since the introduction of MMX sucked hard. Because when we need to go scalar (we know that happens sooner or later) reading X, Y & Z are three cache fetches, because they're too far a part. It's horrible. Not to mention very inflexible.
That's why I came out with the idea of interleaving the data as XXXXYYYYZZZZ: When we go scalar, it is still one fetch (in systems that fetch 64-byte lines).
lunkhound wrote:I think looking at those DICE slides again actually convinced me that there is very little to gain from keeping stuff in swizzled format in memory. Just swizzle the frustum planes on the fly and a bit of optimized SIMD code will yield great performance.
If there are any performance gains to be had from swizzling the SceneNodes in memory, I would expect them to be tiny and not at all worth the trouble it would cause every user who has to examine a SceneNode in the debugger.
However, I'm sure there are cases where it would make sense, like a particle-system.
Actually, we have nothing to lose and possible something to gain (performance). And I'll tell you why:
Regardless of whether you want to swizzle in memory, or swizzle using instructions; we still have to write the code that ensure all memory is contiguous. Even if we don't use SSE at all (we would use XYZXYZ model, that is, specifying one float instead of four at compile time) we need continuity, and being able to load from memory without data depedencies.

My idea is that in PC systems, default to four floats, and use SSE. However if you really, really think debugging is going to be a big problem (even with MSVC's custom data display, I admit not everyone uses MSVC), then compile using one float; and there can be also a "SoA_Vector3" implementation that uses packing instructions to swizzle the memory onto the registers on the fly.
After all SoA_Vector3 & co. is platform dependant. In PCs with 4 floats per object, it will use SSE intrinsics. In ARM with 2 & 4 floats per object, it will use NEON.
In PCs with 1 float per object, it can use scalar operations... or packing+shuffling SSE intrinsics and still operate 4 objects at the time, like you suggest.

So, it is a win-win situation. We can have it my way and your way too, with minimal effort (other than writing multiple versions of SoA_Vector3, SoA_Quaternion & SoA_Matrix4). The magic happens in the memory manager that will dictate how the SoA memory gets allocated & arranged. The rest of the systems are totally abstracted from the number of floats interleaved.
I've used the MSVC autoexp.dat stuff before, and it works OK, but it is an extra hassle. For one thing its global, so if you have different projects with different needs you'll have to merge it all into the global autoexp.dat file somewhere in your "Program Files" directories. Also the syntax of it may vary with different versions of MSVC (see warning here). We'd probably need to put up a wiki page to help people configure their debuggers. My point is simply that this swizzling of data inside user-facing data structures DOES have a cost. And its a cost that will be paid by everyone who tries to debug their Ogre based application (assuming the default is 4 floats per object). If there is no measureable performance gain to be had from it, then it is a net loss.
As you say, we still have to write the code to ensure the memory is contiguous, I would suggest we start with that part. Once that is done, SceneNode memory will be abstracted behind a manager of some kind, and it should be pretty easy to try out the swizzling to see if it is a net gain or not. Only then would we be able to say whether the swizzling of SceneNodes is worthwhile. I just think that the "bang for the buck" on this is low if the DICE folks aren't bothering with it.
User avatar
Klaim
Old One
Posts: 2565
Joined: Sun Sep 11, 2005 1:04 am
Location: Paris, France
x 56

Re: Ogre 2.0 doc (slides)

Post by Klaim »

Sqeaky wrote: It may be easier to understand if you think of all multithreaded code as providing and expecting guarantees. Different task/workunit scheduling algorithms expect different amounts of thread-safety from their workunits and interact with their workunits based on these assumptions. Some schedulers require no thread safety, some require just re-entrancy, some require full data write isolation, and there are other more esoteric requirements that are possible. Tasks/WorkUnits will also implicitly make assumptions of their schedulers. They are written differently if workunits finish in a known order, if two workunits are guaranteed to not access the same resources, if every data access need to be wrapped in a mutex/atomic cas, and based what information the scheduler provides the workunit.
So far, my understanding is that all task scheduler implementations (even an synchronous one) only provide a "at some point in the future, the provided task will be executed" guarantee at least.
If the default Ogre task scheduler provides a certain guarantees and the game developer provides a task scheduler of his own it must provide at least the same guarantees. If the new scheduler provides more, the Ogre tasks/workunits will not be able to take full advantage of the because they are already written. If he provides fewer guarantees he will likely introduce race conditions or deadlocks.
The default task scheduler should do exactly: execute task now synchronously. (call the task execution immediately in the same thread) It's the simplest one and don't need any dependency.
I don't see the relationship between race conditions/deadlocks and tasks schedulers because to me it's the user code that have to protect shared data (or not share it).
For a more concrete example please consider Apples LibDispatch ( http://libdispatch.macosforge.org/ ), which uses a custom barrier primitive, custom semaphores and communication with the scheduler to ensure data consistency and an Apache WorkQueue( http://cxf.apache.org/javadoc/latest/or ... Queue.html ) which implicitly assumes that the work unit will provide it own consistency mechanism without any communication from the Queue, and implements a timeout to provide other guarantees. It would be very difficult to write a work unit that would work ideally in both places.
My understanding is that LibDispatch is not what I mean by "task scheduler".
The parallel_for construct in Threading Building Blocks really is different class of construct than a scheduler. It is a threading primitive designed to parallelize obviously parellizable problems. I do not know, but I suspect that many parts of Ogre are not obviously parellizable, and I suspect that any threading algorithm used in Ogre must be carefully designed to get maximum performance.
parallel_for from TBB does exactly that (I just checked again the code to be sure I'm correct): it creates a hierarchy of tasks and spawn them (in the global task scheduler). The fact that it's a hierarchy of tasks helps the scheduler manage the tasks real execution time (and more importantly will force each child task to be allocated in separate enough memory adresses to avoid false sharing and other related performance problems), but it's still tasks pushed into the tasks scheduler.

Also, what I don't understand is why Ogre should control this aspect of performance. It will be different between platforms anyway and that's why tbb and alike are used, because they have algorithms that knows how to exploit different contexts efficiently. How does any library would be able to do the same without relying on such library?

I kind of agree with other around here that Ogre shouldn't do anything related to threading itself, only help or provide the different parts that CAN be parallelized, optionally. At this point, other than by decomposing renderOneFrame() in different functions, there is no other way to let the user decide how to manage thread resources than to let him provide some kind of function or interface implementation which would decide (or not) to spawn tasks in a worker threads.
dark_sylinc wrote: What I meant is that the control over the batch (worker) threads is too advanced. Creating a generic task scheduler that would run on is not a trivial issue at all. May be something for the far future, IF it seems to be viable.
Which is why I think Ogre shouldn't provide an asynchrounous task scheduler, only an interface and a syncrhonous implementation. Let the user plug his solution in.
To prevent oversubscription, tell Ogre at startup how many threads it can spawn at max.
I disagree because it is definitely hardcore to define an algorithm that would decide how much threads to use depending on other factors, like the hardware resources. tbb does that though, but it assumes that it's the only task scheduler running.
Do you have a few links of similar implementations of what you have in mind? Because I think I'm no seeing what you see.
I agree that there might be bad communication here (I might not use the right words in fact, I'm not an academic in this domain- or any actually). I don't have a public or well known example but I see a "simple" (maybe simplist) way to do it, that could be a good starting point.
I'm currently in travel and will be back home in a few days. I'll try to setup some kind of proposal (basically an interface and some explaination of use) to explain what I meant. Even if it's disapproved as a solution for Ogre, it would help Ogre pointing why it would not be a good solution.
User avatar
masterfalcon
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 4270
Joined: Sun Feb 25, 2007 4:56 am
Location: Bloomington, MN
x 126

Re: Ogre 2.0 doc (slides)

Post by masterfalcon »

From what I understand our requirements would be, libdispatch is nearly perfect. But I don't know about its platform support at this time.

I like the schedule put forth on the previous page. The next step is really to break it down into individual tasks and assign them.
Sqeaky
Gnoblar
Posts: 24
Joined: Sat Jun 19, 2010 1:44 am

Re: Ogre 2.0 doc (slides)

Post by Sqeaky »

Klaim wrote: So far, my understanding is that all task scheduler implementations (even an synchronous one) only provide a "at some point in the future, the provided task will be executed" guarantee at least.
The whole concept of 'task scheduler' is still under heavy research. Just for example, on arxiv and noted as recent,in the section, "computer science/Distributed, Parallel, and Cluster Computing" there is at least one paper clearly talkng about work scheduling and about a dozen others covering tengentially related topics. The are many kinds of possible scheduling algorithms, like the two I posited in my earlier post. I intentionally picked two similar and in production constructs to demonstrate that even when similar, achieving optimal performance with different queues/schedulers would be hard.
Klaim wrote: The default task scheduler should do exactly: execute task now synchronously. (call the task execution immediately in the same thread) It's the simplest one and don't need any dependency.
I don't see the relationship between race conditions/deadlocks and tasks schedulers because to me it's the user code that have to protect shared data (or not share it).
This could cost a good deal of performance without tangible benefit. I think evidence/data, theory, or at least use cases should be looked at before before asserting how ogre or any piece of software should be designed. Synchronization mechanisms, mutexes/semaphores, even atomic CAS are many orders of magnitude slower than single threaded code. If a task scheduler provides an environment where synchronization mechanisms could be avoided/minimized it would provide a performance boost as well as allowing code to be simpler.

I am trying to build a scheduler that provides these kinds of guarantees, in part to make usable data. Based on my experience that mutexes are slow, and theory that says many points of synchronization slow code execution (many simple benchmarks and a few sophisticated ones back this up) it seems logical to move this synchronization to various places in the scheduling algorithm to minimize its impact. I am intending it for use in games, and I have structured it very carefully (at least I think so), to make writing code easy and allow good performance. Despite this effort, it might not be ideal for Ogre, or even my own use, but I am willing to update and modify the design until it is ideal for my use case.
masterfalcon wrote:From what I understand our requirements would be, libdispatch is nearly perfect. But I don't know about its platform support at this time.
Even though libdispatch is an ideal example of a modern task scheduler, I don't think it would work on most platforms Ogre does. Apple provides implementations on on OS X and iOS, and it has been ported to FreeBSD, but it requires a "C compiler support for blocks is required" according to http://libdispatch.macosforge.org/ . As far I know this means Clang and leaves windows completely unsupported and would be difficult to use even on Linux.

I am not aware of a better choice which is complete (why I decided to try making one). I know what platforms Ogre works on, I am familiar with some parts of Ogre, but I only have a rough idea of the data structures inside ogre that would need synchronization. MasterFalcon, could you fill in some of the key details and design goals for threading?

*Edit* added 'slower'
Need an alternative to a single threaded main loop for a game: https://github.com/BlackToppStudios/DAGFrameScheduler/
--Sqeaky
User avatar
masterfalcon
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 4270
Joined: Sun Feb 25, 2007 4:56 am
Location: Bloomington, MN
x 126

Re: Ogre 2.0 doc (slides)

Post by masterfalcon »

My understanding of some of the plans is a little sketchy. I haven't been keeping up with it as much as I should have. But I believe one of the main goals is speed up scene graph updates via threads. Really, the word task is a better choice and has been used quite a bit because we're not talking about real very long lived processes. Instead firing off a task to aid in things like animation, node or compositor updating. Immediately dispatch_once comes to mind but as you noted libdispatch does not have good platform support.
Sqeaky
Gnoblar
Posts: 24
Joined: Sat Jun 19, 2010 1:44 am

Re: Ogre 2.0 doc (slides)

Post by Sqeaky »

Judging by page 55/56 of the slides, determining the kinds/amounts of Tasks/WorkUnits up front should be straight foward. It seems that some pieces of code lend themselves to being put into tasks and others into a parallel_for of some kind and other just need to be run after others finish.

I think that UpdateAllTransforms() and UpdateAllAnimations() should monopolize cpu resources for a brief period, it seems they could be put into an iterable collection of somekind and could be knocked out with a single parallel_for or similar construct. Then each of the boxes with rounded corners would be one or more Tasks/WorkUnits with data dependency to keep them from stepping on eachother's data (even though it is marked as read-only, the arrows and cylinders represent something).

My frame scheduler was designed with the assumption that all the Tasks/WorkUnits would be known at the beginning of a frame. So if I am wrong about being able to know the amount of Tasks/WorkUnits up front then my library stops looking like a custom built solution. To handle the parrallel_for I have no idea. I might implement a simple one in my scheduler, because it is such a commonly used primitive. However, it might even be better to construct a simple threading routine for based on the data in those systems and the foreknowledge of what they need to do. At this point I do not know enough about Ogre to say which way would be best.

In case anybody here cares, since I posted the link to my scheduler I have tested and fixed bugs with it on Mac OS X. So now it compiles and passes some basic tests on Ubuntu x64 with GCC 4.6.3 or clang 3.0-6, Mac OS X 10.6.8 with GCC 4.2.1 and on windows xp 32 with MinGW or vs10.

Whatever Scheduling system we go with we are going to need to break ogre in Tasks/WorkUnits of somekind. I did try to research what converting the current system into Tasks/WorkUnits would vaguely look like:
from http://www.ogre3d.org/forums/viewtopic.php?f=4&t=30250
jwatte wrote:1) manage mesh->entity objects
2) implement spatial hierarchy
3) do visibility culling
4) do state management/sorting
5) do pass management/sorting
6) load terrain data
7) generate terrain/heightmaps
8) manage sets of (lights,cameras,billboards,animations,entities)
9) do sky rendering
From http://www.ogre3d.org/forums/viewtopic. ... 91#p453845
tuan kuranes wrote: Transform Stage -> Ogre transforms the buffers (handling page/locality/etc) filling an Cull buffers
Cull Stage -> Culling per render target fills Ogre renderqueues
Shading Stage -> Shade/Render each renderqueue according to its "shading/rendering" type into a dx/gl/etc. command buffer
Execute Stage -> merge (or dispatch between GPU/Tiles/etc) all command buffer and execute them (asynchronously)
It seems every body agrees Transforms -> Cull -> Complex Visuals -> Render, but which stages depend on each other for data and which connections are essential? How many of these become into 1 Task/WorkUnit and how many become some data driven amount of Tasks/WorkUnits?
Need an alternative to a single threaded main loop for a game: https://github.com/BlackToppStudios/DAGFrameScheduler/
--Sqeaky
User avatar
tuan kuranes
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 2653
Joined: Wed Sep 24, 2003 8:07 am
Location: Haute Garonne, France
x 4

Re: Ogre 2.0 doc (slides)

Post by tuan kuranes »

My above post about more smaller components and separated stage is about the Data Oriented Design part which is not about performance, but about Design, as in leading to simpler, smaller code, each directly and only related to data it controls, which I didn't found in the slides.
wrote:
Couple of thoughts.... I like the idea in principle of separating Ogre into more smaller components, but I also realize there can be more complexity with that model "if" you're using a majority of the components. More DLL's to load and register to use for the components. I guess I'm speaking more towards a person who's new to Ogre as there are so many other components to integrate already before even thinking about integrating components within Ogre. If nothing else, it's a thought to consider if that moves forward.
Regarding my view on "separate components":
However, this doesn't mean that each "component" can go into a DLL or static lib. There are multiple issues that can't be easily tackled. This level of modularization is something that sounds nice, but ends up being impractical.Furthermore the bigger the project, the bigger the chances there's some dependency between seemingly independent modules. I could've written about this dependencies in the slides, but it's very implementation-specific, and it would misslead from the main topic.When objects are highly modular, they're easier to mantain, refactor, or even replace. But that doesn't mean it has to go each in it's own dll or lib.
Was just advertizing a first step, as in just making sub-libs of different ogre core huge big lib, more like code reorganisation than refactoring really. Could starts with just some folder reorganisation really. And only then, next steps would be easier & faster, as spotting candidate for dependencies minimization, components In & Out, and therefore DOD refactoring would be much more obvious. (The "switchable/pluggable" sub-libs is just a bonus side-effect of that and will be possible after the DOD refactoring done, really not a goal on itself)

Just have a look at https://github.com/openscenegraph/osg/tree/master/src and then at https://bitbucket.org/sinbad/ogre/src/3 ... reMain/src ?at=default try to find how Ogre does animation against how openscenegraph does it. (note that it's easier if you already know some class name in Ogre, but for a beginner...)

It seems every body agrees Transforms -> Cull -> Complex Visuals -> Render, but which stages depend on each other for data and which connections are essential? How many of these become into 1 Task/WorkUnit and how many become some data driven amount of Tasks/WorkUnits?
Idea is to make it much simpler than actual structures that are shared by too many stages, by copying only relevent data between stage using buffers.

For Each scene: const shared Nodes Buffer -> Transformed Nodes Buffer
For Each Viewport: const shared Transformed Nodes Buffer -> Culled Transformed Node Buffer (threadable)
For Each renderTarget: const shared Culled Transformed Node Buffer -> RenderQueue (threadable)
For Each RenderQueue: const shared Render Queue -> Command Buffer

Seems like much more memory usage, but using clever structures and memory operations, it leads to much better locality, memory access and simpler code.
Gives much more Data oriented design, easier testing/recording/debugging of each stage (buffers being serialisable, you can debug each stage independently).

So the Component/Stage Idea is really about minimizing dependencies between stage/component with clear data In & Out at each stage/component, as in Data Oriented Code.
User avatar
sparkprime
Ogre Magi
Posts: 1137
Joined: Mon May 07, 2007 3:43 am
Location: Ossining, New York
x 13

Re: Ogre 2.0 doc (slides)

Post by sparkprime »

Wow, what a thread. The elephant in the room (ogre's performance) has been given a good kick in the balls. I can report the same kinds of problems that dark_cyclic has reported, in the Grit Engine (http://www.gritengine.com) -- a very high per batch cost and a near idle GPU. And thanks to him for also putting together those slides, must have taken an age. I have also realised that Ogre's design is the limiting factor, and in fact if I knew 5 years ago what I know now, I probably would have written my own GL-based renderer from scratch instead of using Ogre. Over the years, I have dropped more and more functionality, starting with the background loading system, the scripting framework, particles, billboards, the compositor framework, and now, the overlay component. I'm actually anticipating dropping or heavily altering scene management and renderqueues (not with a smile on my face I can assure you) because of these issues under discussion.

So, I have spent literally hours reading those slides and this thread and I surely haven't taken it all 100% so please bear with me.

We certainly want optimised culling algorithms, whether that is using this or that parallel programming framework, or SIMD, doesn't really matter, as long as it is state-of-the-art. We want a render queue that can keep the driver busy while doing sorting, and whatever else work it needs to do per batch. Broadly all of this sounds like a good idea. But this is years of work.

What I would propose is doing the work from back to front. First fixing the rendersystem API -- make it the perfect render system API for the modern world. If we have a rendersystem abstraction that is cross platform, uses d3d/gl completely transparently, and has the right performance for modern workloads, that is already immensely useful. At the moment, as far as I can tell, it is not possible to get the right performance even if you are doing explicit draws, using the RenderSystem API, because of the way it handles shaders, for instance, and the lack of key functionality like hardware PCF and normal map compression. Also there are irritatations with leaky abstractions from the two render systems, such as render target flipping and projection matrixes. I can't help thinking some of these could be hidden without losing appreciable performance. It needs cleaning up too, with the removal of fixed function functionality. It seems we're coming close to this with the GL4 and D3D11 work nearing completion. But I'd like to make sure it is 'all the way' and not just 'what is useful now given the current infrastructure on top'. There should be performance tests demonstrating that rendering explicit draws are fast enough, at this level.

Following on from that, some sort of render queue that can offload the actual draw calls into a background thread (pipelining them) seems to be appropriate. Then at least you can use 2 cores for ogre -- one for doing all the waste during scene management, and one for pushing draws out at max speed. The render system is essentially isolated from the user threads at this point. Actual loads of resources (as opposed to prepares) would have to be performed by the hidden thread. I don't see why this thread cannot be created with a simple abstraction around the win32/posix thread library. It's not a short-lived task -- it is continual and contains synchronisation points.

Optimising the actual culling and construction of the list of draws is then the final task. Of course if you have a SIMDised octree cull or whatever, you'd also want the naive version ifdeffed out alongside it for testing. That way if people get weird behaviour they can just switch to the naive versions for debugging. This is good practice with any highly performant code, because you always sacrifice clarity for performance. I think anything particularly advanced, using microthreading programming frameworks for instance, should maybe be kept out of the default compile. Seems like tuning it will be a pain in the arse anyway, for unbalanced workloads. I've seen things like (naive) fibonacci implementations with hardcoded constants as to how much of a subtree to do in a single task. SIMD will have to assume SSE i presume so that can't be on by default either. I say, do relatively clean cache-friendly algorithms for these kernels (principally culling, it seems) and leave more hardcore implementations as optional extras.


Perhaps there are some lower hanging fruit in these front end stages of the pipe, but I'd rather have a solid backend that gives people the opportunity to build their own higher level layers on top. That way, if the work stalls, it is always possible to hack up an application-specific layer on top of the rendersystems and get AAA performance while Ogre remains incomplete. This may even *inform* the design of the upper layers.


It is good to see these issues being discussed so intensely, it gives me some hope that things could get fixed. :)


Now, a couple of rants:

I see a big divide in programmers, tools, applications, and approaches between cellphone graphics programming and pc/console. I am unsure why anyone would want to do both with the same library. I'd take a mansion or a yacht over a houseboat, any day of the week. Some code / diskformat sharing would be good, but let's not overdesign the core APIs for compatability between cells and pcs! Noone is porting an optimised pc game to the cellphone by simply recompiling it with a different rendersystem. All the rendering techniques, materials, and assets will have to change. How about two different rendersystem APIs, two different scene management APIs, a shared resource system, and shared math/scene culling APIs? Bring all the appropriate bits together to make a complete library for your given system.

Nobody wants DLLs. Splitting a huge binary into pieces is not a solution. They just make a mess on the filesystem, and a configuration headache. Stuff should be if-deffed out and that's all you need. Ignore pressure from package maintainers trying to create canned versions of everything. This is not libjpeg. Static linking creates a little wasted disk space. So what? The alternative is a complete mess and potential performance hazards. If it's easy to drop unused functionality in the build, the binary sizes will be reasonable anyway. And people doing serious apps on Ogre will probably be using their own fork, or an unreleased trunk. Static linking is the answer for libraries like Ogre.