I agree.syedhs wrote:Probably threading should be made into core and thus, compulsory not an option?
Or is there a case/platform in which you would not want Ogre to thread internally if it can?
I agree.syedhs wrote:Probably threading should be made into core and thus, compulsory not an option?
AFAIK most Ogre "Components" (not plugins) are static libs or additional code injected into OgreMain, right?Herb wrote:Couple of thoughts.... I like the idea in principle of separating Ogre into more smaller components, but I also realize there can be more complexity with that model "if" you're using a majority of the components. More DLL's to load and register to use for the components. I guess I'm speaking more towards a person who's new to Ogre as there are so many other components to integrate already before even thinking about integrating components within Ogre. If nothing else, it's a thought to consider if that moves forward.
Boost have never been a requirement for Ogre and it was already confirmed that it will not be one.As for Boost, I agree with the comments. I actually like the fact that I can select what threading library to use, as for example, I use POCO instead of Boost. Really, if Boost is a requirement, then we should actually "use" it's features throughout the library.
I did try to provide C++11 implementation of the current use of multithreading in Ogre (there is a thread somewhere). It was a failure because:But, as for threading, has anyone looked at the threading support in C++11? I thought threading support was baked into that and that should be cross-platform, pending Visual Studio has it implemented (most things I find the GNU guys have already baked in).
My view is that renderOneFrame() is called from one thread. The user may want to update it's logic & physics in the same thread, or in another one.Klaim wrote:There are two things I see:
1. the call to renderOneFrame()
2. resource loading
1) can be done only with always the same thread (currently - can it be fixed?)
Some parts of 1) can be asynchronous tasks spawned (animation update? etc.).
2) can be all asynchronous tasks.
The user control what thread is calling 1), so it might be the main thread or another thread.
My understanding is that:
A. Ogre itself don't need to spawn threads itself (I mean the core). 1) is controlled by the user code, 2) should pass control to user's tasks scheduler (to avoid subscription)
B. Ogre needs to provide potentially asynchronous tasks to be crunched by worker threads (which mean potentially in a linear execution if there is only the main thread running)
C. Ogre can provide an implementation of a tasks scheduler (which spawn and manage worker thread(s)) IF the user don't explicitly provide his own. As it would be optional (but default), it would be a component (almost as now?)
You're right about your concerns. So let me address them:lunkhound wrote:I have some concerns about the whole SoA thing though. I worry that it may be alot of developer-pain for very little gain. Considering that:
1. SoA isn't necessary to fix the cache-misses. Cache misses can be fixed by reorganizing how data is laid out but without interleaving vector components of different vectors together.
2. SoA isn't necessary to improve performance of vector math using SIMD. OK maybe you don't get the full benefit of SIMD and not in all cases but you can probably get 70% of SoA performance simply by using a SIMD-ified vector library.
3. SoA is not easy to work with. Code is harder to read, harder to debug, more effort to maintain going forward. Imagine inspecting Node structures in the debugger when the vector components are interleaved in memory with other Nodes...
I think SoA is best for a limited-scope, highly optimized, tight loop where every cycle counts, and only affecting a small amount of code. Kind of like assembly language, SoA comes with a cost in developer time and I'm just not sure it would be worth it.
Thanks again for all the work on those slides. I'm really glad to see these issues being raised.
Code: Select all
for( int i=0; i<mCount, i += 4 )
{
/* prefetch() around here */
//We're updating 4 elements here.
const SoA_Vector3 &parentPos = mChunk[level+0].pos + i;
SoA_Vector3 &localPos = mChunks[level+1].pos + i;
SoA_Vector3 &derivedPos = mChunks[level+1].derivedPos + i;
const SoA_Quaternion &parentRot = mChunk[level+0].rot + i;
SoA_Quaternion &localRot = mChunk[level+1].rot + i;
SoA_Quaternion &derivedRot = mChunk[level+1].derivedRot + i;
const SoA_Vector3 &parentScale = mChunk[level+0].scale + i;
SoA_Vector3 &localScale = mChunk[level+1].scale + i;
SoA_Vector3 &derivedScale = mChunk[level+1].derivedScale + i;
SoA_Matrix4 &derivedTransform = mChunk[level+1].transform + i;
derivedPos = parentPos + parentRot * (parentScale * localPos);
derivedRot = parentRot * localRot; //fsel() to see if we should parentRot should be identity rot.
derivedScale = parentScale * localScale; //fsel() here too.
derivedTransform = NonTemporal( SoA_Matrix4( derivedPos, derivedRot, derivedScale ) );
}
Code: Select all
for( int i=0; i<mCount, i += 4 ) //Actually, it's not "+= 4", but rather += compile_time_number_of_simd_elements_macro
{
/* prefetch() around here */
//We're updating 4 elements here.
SoA_Vector3 &localPos = mChunks[level+1].pos + i;
SoA_Quaternion &localRot = mChunk[level+1].rot + i;
SoA_Vector3 &localScale = mChunk[level+1].scale + i;
const SoA_Matrix4 &parentTransform = mChunk[level+0].transform + i;
SoA_Matrix4 &derivedTransform = mChunk[level+1].transform + i;
SoA_Matrix4 localTransform = SoA_Matrix4( localPos, localRot, localScale ); //Use fsel for rot & scale
derivedTransform = NonTemporal( parentTransform * localTransform );
}
Yes that's what I meant.dark_sylinc wrote: My view is that renderOneFrame() is called from one thread. The user may want to update it's logic & physics in the same thread, or in another one.
I don't understand this. To me, whatever the kind of parallel work, it should work with the task scheduler underneath, the same way parallel_for in tbb will spawn tasks for each batch of cycle.As for Ogre's managament of threads:
- The CompositorManager must have a high degree of control over it's batch threads.
Well I really don't understand why there is a need to spawn threads if you want to prevent oversubscription (as I said too), because the only way is to let the user control the task scheduler and make Ogre agnostic on this. I might missunderstand something but to me as soon as a library spawn it's own threads, it does become candidate for oversubscription.[*]The animation & scenenode transform update may have it's own threads. Because their jobs are fairly trivial (and there are many ways to split the work), the idea of a TaskScheduler provided by the user seems fine to me.[/list]
Note that all components (including, and specially, the CompositorManager) should accept a hint on number of threads they can spawn, in order to prevent oversubscription (i.e. the user wants to run many threads for himself, unrelated with Ogre)
Well, actually I don't think it's fair to compare Unity to Ogre that way. Unity is a full game engine, very featured, with an awesome editor perfectly married with the engine. Also quite optimized, specially last year versions. Ogre is a render engine, just that, which urgently needs a redesign focussed on optimization and DX11/OGL4 arquitecture. It's not cool seeing that a complex scene runs twice faster in UDK or even Unity. It's not very cool either how each compositor render_scene pass culls the whole scene again, etc.and Unity, which strangely was not mentioned even if it is the greatest Ogre-killer between AAs) are TOOLS. lots of excellent tools for artists and designers.
What kind of tools do you want to see? A scene editor? Material editor? But that's again the same story: ogre is just a render engine, should not provide any kind of high level tool. Just mesh/material importer/exporters and mesh optimization tools, not much more, IMHO.the docs are great but the biggest setback Ogre has in regard of the said engines (and Unity, which strangely was not mentioned even if it is the greatest Ogre-killer between AAs) are TOOLS. lots of excellent tools for artists and designers.
Actually this is my case. Of course I'm not leaving Ogre, but I'm quite concerned about 1.8.X/1.9.X performance. I think any ogre developer using some compositors(specially involving render_scene passes), high quality shadows(cascaded shadow mapping with 3 or 4 shadow maps, for example) and any kind of water system(which will need at least 2 more render passes: reflection, refraction. Depth map may be shared with other depth-based effects, like the used for DOF and similar) shares my concerns about ogre performance.Thinking that devs are leaving Ogre because it is "not fast enough for AAAs" means completely missing the point.
IMHO PCs and next gen consoles are not a very restrictive use case. But indeed, it would be nice put some attention on ARM, although mobile SoC are evolving very fast. Anyway I think the development should be focussed on "next-gen" PC and consoles architecture(aka DX11) rather than on limited mobile ones(GLES2 / 3?).The DICE papers might be good for their very restrictive use cases (next gen consoles and PCs) but fail quite badly when you try to make, say, an Android game.
+1!There are many opportunities for SSE2+ and cache friendly structures as mentioned in the paper.
Ogre is already the most usable open source rendering engine, it just need to be faster and less resource hungry to be more competitive.
Actually I believe that Ogre's mobile development userbase may even surpass that of traditional PC/console development in the near future.Xavyiy wrote:Although I think the development should be focussed on "next-gen" PC and consoles architecture(aka DX11) rather than in limited mobile ones(GLES2 / 3?).
Agreed. One of Ogre's main attractions is that it allows developers to create their own tools and engines around it. As we tell users all the time, Ogre is NOT a game engine, it's a graphics library. If you want tools, use/extend Ogitor, or something third party like Xavyiy's Paradise EngineXavyiy wrote:Unity is a full game engine, very featured, with an awesome editor perfectly married with the engine. Also quite optimized, specially last year versions. Ogre is a render engine, just that, which urgently needs a redesign focussed on optimization
You said it yourself:saejox wrote:Does 2.0 aim for better performance or better usability?
And to add my opinion to the threading discussion, parallel architecture should be up to third party engines (Ogre itself shouldn't be using TBB or boost threads). But you may perhaps expose multiple update loops, like scene graph and frame rendering, I guess. There would be nothing stopping us from providing a basic example framework which makes use of TBB.saejox wrote:Ogre is already the most usable open source rendering engine, it just need to be faster and less resource hungry to be more competitive.
I've been thinking about this more heavily as of late and I am growing into that mentality myself. The more I think about the complications of working with a scheduler that Ogre is aware of that it interfaces with the more I think that it'll just cause issues if anyone has a different idea of how threading should work in their game. Different projects have different needs and it seems somewhat unrealistic to assume you can put a catch-all into Ogre that will work. Then there is my next point..._tommo_ wrote:to me, Ogre, as pure graphics engine needs to NOT expose any threading system.
Xavyiy wrote:Well, actually I don't think it's fair to compare Unity to Ogre that way. Unity is a full game engine, very featured, with an awesome editor perfectly married with the engine. Also quite optimized, specially last year versions. Ogre is a render engine, just that, which urgently needs a redesign focussed on optimization and DX11/OGL4 arquitecture. It's not cool seeing that a complex scene runs twice faster in UDK or even Unity. It's not very cool either how each compositor render_scene pass culls the whole scene again, etc.
I hear this often among Ogre users more experienced than I, however I can't really see how this is true. Ogre does SO MUCH, I feel it is half-way to a game engine and as I have stated in some other posts the resource system is a large part of that. I completely agree that what you are saying is how Ogre should be...but I can't at all agree that's what it is. Breaking off more things into components or plugins is needed. Starting with the resource system, imo.Xavyiy wrote:What kind of tools do you want to see? A scene editor? Material editor? But that's again the same story: ogre is just a render engine, should not provide any kind of high level tool. Just mesh/material importer/exporters and mesh optimization tools, not much more, IMHO.
I think we all want to see this move ahead as fast as possible, but a lot of us have different use cases that must be made aware if we are to hope to arrive at a solution that is the most ideal for the community. To that end maybe expecting people to post in the development forums is asking too much of most of the people out there. If the Ogre team has the time maybe it would be better to do another survey. One aimed more directly at all the subjects raised here. At least ask enough questions to get a start on the whole thing.Xavyiy wrote:I've the feeling that whatever the 2.0 roadmap will be, it'll not be ideal for the whole community. I would like to read concrete solutions rather than "general ideas", since I see a very low SNR in all ogre redesign threads (of course! each person has its own interests, but things must move ahead!)
This is WHY I insist so much in allowing the user to specify how many threads it wants Ogre to spawn. For example, Havok can spawn no threads, or spawn as many as it wants. This value is set during startup._tommo_ wrote:to me, Ogre, as pure graphics engine needs to NOT expose any threading system.
I don't like at all the idea that a renderer will "take life of its own" and start spawning threads unless I do some arcane forms of control (ie. subclassing the default task manager class).
The default should be simplicity.
What I meant is that the control over the batch (worker) threads is too advanced. Creating a generic task scheduler that would run on is not a trivial issue at all. May be something for the far future, IF it seems to be viable.I don't understand this. To me, whatever the kind of parallel work, it should work with the task scheduler underneath, the same way parallel_for in tbb will spawn tasks for each batch of cycle.As for Ogre's managament of threads:
The CompositorManager must have a high degree of control over it's batch threads.
You're describing to convert into a set of utilities library. A rendering engine is exactly composed of a math library, a render queue, a batch dispatcher & material manager, and a scene graph._tommo_ wrote:PS: imo all of Ogre 2.0 should aim at being a pure graphics library, focusing on simplicity. And this imo means dropping a lot of existing functionality, and becoming more passive on which role Ogre takes in a game engine architecture.
Basically everyone that approaches Ogre feels the urge to place it at the cornerstone of its engine (with no decoupling between maths, threading, and scene managing between rendering & logic ), and Ogre is responsible of this because of the current all-encompassing architecture.
I agree on the tools. This is why I added a few slides about RTSS to be more node-like. If we make a customizable node system, creating a graphical interactive tool for setting up material would be very easy. As for the rest, I left them out because they demand a PDF on it's own._tommo_ wrote:PPS: the docs are great but the biggest setback Ogre has in regard of the said engines (and Unity, which strangely was not mentioned even if it is the greatest Ogre-killer between AAs) are TOOLS. lots of excellent tools for artists and designers.
So along of a simplification of the graphic library itself, there should be a serious effort in making the engine useful, as in, in the real world. Thinking that devs are leaving Ogre because it is "not fast enough for AAAs" means completely missing the point.
It's not fast for AAA, nor for indies either. I'm not working for an AAA company and Ogre's limitations are annoying me, as well as other users. The big main problem is that it's lacking scalability. Overclock your CPU from 3Ghz to 6Ghz and it will only speed up a little because of the cache misses (you can overclock the RAM to increase the bandwidth, but then you'll increase latency...). Throw a CPU with more cores or a faster GPU and it will run as slow as it was before. In other words we're doomed if we don't change this scenario. Specially since AAA companies are lending their engines to the average Joe (thus competing with Ogre & game engines relying on Ogre)._tommo_ wrote:Thinking that devs are leaving Ogre because it is "not fast enough for AAAs" means completely missing the point.
It's true that we're bruteforcing. But the current implementation is trying to be smart and fails misserably. Android phones are going multicore, and NEON is the SSE of ARMs._tommo_ wrote:PPPS: most of the proposed ways of "optimizing" by bruteforcing jumps or switching full-on to SoA + SIMD just ignore that Ogre today needs to run energy-efficiently on cheap ARMs much more than squeezing SSE2 archs and are probably best ignored, and are indeed an ugly case of optimizing without even thinking what the use case will be.
The DICE papers might be good for their very restrictive use cases (next gen consoles and PCs) but fail quite badly when you try to make, say, an Android game.
God no! The threading model is about spliting work on objects that aren't being touched at the same time (hence no need for locking except when the job is done), that's all.saejox wrote:If it is going to thread-safe it means hundreds of mutexes in every function.
Goodbye performance.
I cannot agree more!saejox wrote: Ogre already has many shared_ptrs and locks, even tho it is not thread-safe.
I think all those useless locks and shared_ptr should be removed.
No need to wait for a big release for that.
Part of this "Ogre does SO MUCH" comes (as downside?) from Open Source. Some programmer pops up, decides he needs X thing implemented to render his stuff the way he wants, without investing much time if there was already a way of achieving the same result; then he submit his change and gets into core. This programmer probably doesn't show up again after that.Mako_energy wrote:I hear this often among Ogre users more experienced than I, however I can't really see how this is true. Ogre does SO MUCH, I feel it is half-way to a game engine and as I have stated in some other posts the resource system is a large part of that. I completely agree that what you are saying is how Ogre should be...but I can't at all agree that's what it is. Breaking off more things into components or plugins is needed. Starting with the resource system, imo.
I don't know a huge amount about cache misses, but I am writing a threading library I will gladly re-license to zlib (Currently it is GPL3) for Ogre use. It is not ready for prime time yet, but it has some of the features I want it to have when it is done. I also want to tune it extensively for performance.Xavyiy wrote:I would like to read concrete solutions rather than "general ideas", since I see a very low SNR in all ogre redesign threads
Sorry, I didn't make myself clear. I agree that SCEE paper is exactly the sort of thing that we ought to be doing, but it doesn't mention SoA as I understand it. When I see "SoA" I think of this: http://software.intel.com/en-us/article ... chitecturedark_sylinc wrote:You're right about your concerns. So let me address them:lunkhound wrote:I have some concerns about the whole SoA thing though. I worry that it may be alot of developer-pain for very little gain. Considering that:
1. SoA isn't necessary to fix the cache-misses. Cache misses can be fixed by reorganizing how data is laid out but without interleaving vector components of different vectors together.
2. SoA isn't necessary to improve performance of vector math using SIMD. OK maybe you don't get the full benefit of SIMD and not in all cases but you can probably get 70% of SoA performance simply by using a SIMD-ified vector library.
3. SoA is not easy to work with. Code is harder to read, harder to debug, more effort to maintain going forward. Imagine inspecting Node structures in the debugger when the vector components are interleaved in memory with other Nodes...
I think SoA is best for a limited-scope, highly optimized, tight loop where every cycle counts, and only affecting a small amount of code. Kind of like assembly language, SoA comes with a cost in developer time and I'm just not sure it would be worth it.
Thanks again for all the work on those slides. I'm really glad to see these issues being raised.
1. It is true that there are other ways to optimize the data. However, transformation and culling is something that is actually fairly trivial operations, which are are done sequentially on massive amount of elements. Note that the interleaving is for SIMD. An arragement of "XYZXYZXYZ" is possible by specifying 1 float per object at compile time.
The performance gains of using SoA for critical elements such as position & matrices are documented in the SCEE's paper (reference 4)
Code: Select all
struct StructureOfArrays
{
float x[numVertices];
float y[numVertices];
float z[numVertices];
...
};
You'll notice however, that in those DICE slides, they are not actually storing any of their data structures swizzled in memory. They swizzled the frustum planes (on the fly presumeably), and then loop over un-swizzled bounding spheres. That's a great use of SoA/swizzling because no user-facing data structures are swizzled.dark_sylinc wrote: 2. We already do SIMD math and tries to do it's best. There are huge margins to gain using SoA + SIMD because the access patterns and the massive number of operations to perform fit exactly the way SSE2 works. There's a lot of overhead in unpacking & packing.
DICE's Culling the Battlefield slides show the big gains of using SoA + SIMD (reference 3)
I think looking at those DICE slides again actually convinced me that there is very little to gain from keeping stuff in swizzled format in memory. Just swizzle the frustum planes on the fly and a bit of optimized SIMD code will yield great performance.dark_sylinc wrote: It's very true that debugging becomes much harder, specially when examining a single Entity or SceneNode.
I see two complementary solutions:
- Use getPosition() would retrieve the scalar version; which can be called from the watch window (as long as we ensure it's fully const...)
- There are a few MSVC features (I don't remember if they had to be installed, or if they were defined through pragmas) that tell MSVC how to read objects while debugging. I'm sure gdb probably has something similar.
This is indeed the disadvantage of open source. If you want to redesign Ogre, you need a dedicated team that sticks 'till the end' and has a clear vision. Every change to the core is validated by that team. The problem is that such a team needs time and an incentive (money, no personal life) to stick to the project. That is the difference between Ogre development and companies like Epic and Crytek. Ogre can survice when combined with some kind of commercial activity. This is tried before (by Steve) but I am the first to admit that this is no easy task. Ogre needs at least some substantial gifts from large companies (you know who you are!). Maybe these companies want something in return, but as long as this fits into the teams' vision, I don't see a problem.Some programmer pops up, decides he needs X thing implemented to render his stuff the way he wants, without investing much time if there was already a way of achieving the same result; then he submit his change and gets into core
Oh I see. SCEE's is technically "SoA" which stands for Structure of Arrays (or ptrs). If we look at the SceneNode declaration from SCEE's, it is:lunkhound wrote:Sorry, I didn't make myself clear. I agree that SCEE paper is exactly the sort of thing that we ought to be doing, but it doesn't mention SoA as I understand it. When I see "SoA" I think of this: http://software.intel.com/en-us/article ... chitectureIntel has been telling everyone to swizzle their data like this ever since they came out with MMX. My comments were ONLY directed at this Intel-style swizzling, and not at the sort of grouping of homogeneous data structures featured in the SCEE reference. I will refer to it as "swizzling" and not "SoA" for clarity.Code: Select all
struct StructureOfArrays { float x[numVertices]; float y[numVertices]; float z[numVertices]; ... };
Code: Select all
class SceneNode
{
Vector3 *Position; //ptr
Quaternion *qRot;//ptr
Quaternion *qRot;//ptr
Matrix4 *matrix; //ptr
}
Actually, we have nothing to lose and possible something to gain (performance). And I'll tell you why:lunkhound wrote:I think looking at those DICE slides again actually convinced me that there is very little to gain from keeping stuff in swizzled format in memory. Just swizzle the frustum planes on the fly and a bit of optimized SIMD code will yield great performance.
If there are any performance gains to be had from swizzling the SceneNodes in memory, I would expect them to be tiny and not at all worth the trouble it would cause every user who has to examine a SceneNode in the debugger.
However, I'm sure there are cases where it would make sense, like a particle-system.
I disagree that this is a disadvantage, free labor is rarely bad, particularly if you do have a core team as Ogre appears to. Many open source projects have become very successful exactly because of this kind of free labor. But this is clearly off-toppic.spookyboo wrote:This is indeed the disadvantage of open source. If you want to redesign Ogre, you need a dedicated team that sticks 'till the end' and has a clear vision.Some programmer pops up, decides he needs X thing implemented to render his stuff the way he wants, without investing much time if there was already a way of achieving the same result; then he submit his change and gets into core
It may be easier to understand if you think of all multithreaded code as providing and expecting guarantees. Different task/workunit scheduling algorithms expect different amounts of thread-safety from their workunits and interact with their workunits based on these assumptions. Some schedulers require no thread safety, some require just re-entrancy, some require full data write isolation, and there are other more esoteric requirements that are possible. Tasks/WorkUnits will also implicitly make assumptions of their schedulers. They are written differently if workunits finish in a known order, if two workunits are guaranteed to not access the same resources, if every data access need to be wrapped in a mutex/atomic cas, and based what information the scheduler provides the workunit.Klaim wrote:I don't understand this. To me, whatever the kind of parallel work, it should work with the task scheduler underneath, the same way parallel_for in tbb will spawn tasks for each batch of cycle.As for Ogre's managament of threads:
- The CompositorManager must have a high degree of control over it's batch threads.
There are likely a few other configurations that can used when Ogre starts to adjust it, but I agree the thread count is the obvious one. IMHO a good threading design will allow the game developer to interact with the Ogre threading system in at least three ways.dark_sylinc wrote:To prevent oversubscription, tell Ogre at startup how many threads it can spawn at max.
I've never seen that called SoA before. That's a structure of pointers to structures (or a structure of pointers into arrays of structures). I'm not sure if there is an "official" definition for SoA (nothing on Wikipedia). But I've always seen it mentioned in conjunction with SIMD.dark_sylinc wrote:Oh I see. SCEE's is technically "SoA" which stands for Structure of Arrays (or ptrs). If we look at the SceneNode declaration from SCEE's, it is:lunkhound wrote:Sorry, I didn't make myself clear. I agree that SCEE paper is exactly the sort of thing that we ought to be doing, but it doesn't mention SoA as I understand it. When I see "SoA" I think of this: http://software.intel.com/en-us/article ... chitectureIntel has been telling everyone to swizzle their data like this ever since they came out with MMX. My comments were ONLY directed at this Intel-style swizzling, and not at the sort of grouping of homogeneous data structures featured in the SCEE reference. I will refer to it as "swizzling" and not "SoA" for clarity.Code: Select all
struct StructureOfArrays { float x[numVertices]; float y[numVertices]; float z[numVertices]; ... };
Code: Select all
class SceneNode { Vector3 *Position; //ptr Quaternion *qRot;//ptr Quaternion *qRot;//ptr Matrix4 *matrix; //ptr }
I've used the MSVC autoexp.dat stuff before, and it works OK, but it is an extra hassle. For one thing its global, so if you have different projects with different needs you'll have to merge it all into the global autoexp.dat file somewhere in your "Program Files" directories. Also the syntax of it may vary with different versions of MSVC (see warning here). We'd probably need to put up a wiki page to help people configure their debuggers. My point is simply that this swizzling of data inside user-facing data structures DOES have a cost. And its a cost that will be paid by everyone who tries to debug their Ogre based application (assuming the default is 4 floats per object). If there is no measureable performance gain to be had from it, then it is a net loss.dark_sylinc wrote:Indeed, Intel's proposal since the introduction of MMX sucked hard. Because when we need to go scalar (we know that happens sooner or later) reading X, Y & Z are three cache fetches, because they're too far a part. It's horrible. Not to mention very inflexible.
That's why I came out with the idea of interleaving the data as XXXXYYYYZZZZ: When we go scalar, it is still one fetch (in systems that fetch 64-byte lines).
Actually, we have nothing to lose and possible something to gain (performance). And I'll tell you why:lunkhound wrote:I think looking at those DICE slides again actually convinced me that there is very little to gain from keeping stuff in swizzled format in memory. Just swizzle the frustum planes on the fly and a bit of optimized SIMD code will yield great performance.
If there are any performance gains to be had from swizzling the SceneNodes in memory, I would expect them to be tiny and not at all worth the trouble it would cause every user who has to examine a SceneNode in the debugger.
However, I'm sure there are cases where it would make sense, like a particle-system.
Regardless of whether you want to swizzle in memory, or swizzle using instructions; we still have to write the code that ensure all memory is contiguous. Even if we don't use SSE at all (we would use XYZXYZ model, that is, specifying one float instead of four at compile time) we need continuity, and being able to load from memory without data depedencies.
My idea is that in PC systems, default to four floats, and use SSE. However if you really, really think debugging is going to be a big problem (even with MSVC's custom data display, I admit not everyone uses MSVC), then compile using one float; and there can be also a "SoA_Vector3" implementation that uses packing instructions to swizzle the memory onto the registers on the fly.
After all SoA_Vector3 & co. is platform dependant. In PCs with 4 floats per object, it will use SSE intrinsics. In ARM with 2 & 4 floats per object, it will use NEON.
In PCs with 1 float per object, it can use scalar operations... or packing+shuffling SSE intrinsics and still operate 4 objects at the time, like you suggest.
So, it is a win-win situation. We can have it my way and your way too, with minimal effort (other than writing multiple versions of SoA_Vector3, SoA_Quaternion & SoA_Matrix4). The magic happens in the memory manager that will dictate how the SoA memory gets allocated & arranged. The rest of the systems are totally abstracted from the number of floats interleaved.
So far, my understanding is that all task scheduler implementations (even an synchronous one) only provide a "at some point in the future, the provided task will be executed" guarantee at least.Sqeaky wrote: It may be easier to understand if you think of all multithreaded code as providing and expecting guarantees. Different task/workunit scheduling algorithms expect different amounts of thread-safety from their workunits and interact with their workunits based on these assumptions. Some schedulers require no thread safety, some require just re-entrancy, some require full data write isolation, and there are other more esoteric requirements that are possible. Tasks/WorkUnits will also implicitly make assumptions of their schedulers. They are written differently if workunits finish in a known order, if two workunits are guaranteed to not access the same resources, if every data access need to be wrapped in a mutex/atomic cas, and based what information the scheduler provides the workunit.
The default task scheduler should do exactly: execute task now synchronously. (call the task execution immediately in the same thread) It's the simplest one and don't need any dependency.If the default Ogre task scheduler provides a certain guarantees and the game developer provides a task scheduler of his own it must provide at least the same guarantees. If the new scheduler provides more, the Ogre tasks/workunits will not be able to take full advantage of the because they are already written. If he provides fewer guarantees he will likely introduce race conditions or deadlocks.
My understanding is that LibDispatch is not what I mean by "task scheduler".For a more concrete example please consider Apples LibDispatch ( http://libdispatch.macosforge.org/ ), which uses a custom barrier primitive, custom semaphores and communication with the scheduler to ensure data consistency and an Apache WorkQueue( http://cxf.apache.org/javadoc/latest/or ... Queue.html ) which implicitly assumes that the work unit will provide it own consistency mechanism without any communication from the Queue, and implements a timeout to provide other guarantees. It would be very difficult to write a work unit that would work ideally in both places.
parallel_for from TBB does exactly that (I just checked again the code to be sure I'm correct): it creates a hierarchy of tasks and spawn them (in the global task scheduler). The fact that it's a hierarchy of tasks helps the scheduler manage the tasks real execution time (and more importantly will force each child task to be allocated in separate enough memory adresses to avoid false sharing and other related performance problems), but it's still tasks pushed into the tasks scheduler.The parallel_for construct in Threading Building Blocks really is different class of construct than a scheduler. It is a threading primitive designed to parallelize obviously parellizable problems. I do not know, but I suspect that many parts of Ogre are not obviously parellizable, and I suspect that any threading algorithm used in Ogre must be carefully designed to get maximum performance.
Which is why I think Ogre shouldn't provide an asynchrounous task scheduler, only an interface and a syncrhonous implementation. Let the user plug his solution in.dark_sylinc wrote: What I meant is that the control over the batch (worker) threads is too advanced. Creating a generic task scheduler that would run on is not a trivial issue at all. May be something for the far future, IF it seems to be viable.
I disagree because it is definitely hardcore to define an algorithm that would decide how much threads to use depending on other factors, like the hardware resources. tbb does that though, but it assumes that it's the only task scheduler running.To prevent oversubscription, tell Ogre at startup how many threads it can spawn at max.
I agree that there might be bad communication here (I might not use the right words in fact, I'm not an academic in this domain- or any actually). I don't have a public or well known example but I see a "simple" (maybe simplist) way to do it, that could be a good starting point.Do you have a few links of similar implementations of what you have in mind? Because I think I'm no seeing what you see.
The whole concept of 'task scheduler' is still under heavy research. Just for example, on arxiv and noted as recent,in the section, "computer science/Distributed, Parallel, and Cluster Computing" there is at least one paper clearly talkng about work scheduling and about a dozen others covering tengentially related topics. The are many kinds of possible scheduling algorithms, like the two I posited in my earlier post. I intentionally picked two similar and in production constructs to demonstrate that even when similar, achieving optimal performance with different queues/schedulers would be hard.Klaim wrote: So far, my understanding is that all task scheduler implementations (even an synchronous one) only provide a "at some point in the future, the provided task will be executed" guarantee at least.
This could cost a good deal of performance without tangible benefit. I think evidence/data, theory, or at least use cases should be looked at before before asserting how ogre or any piece of software should be designed. Synchronization mechanisms, mutexes/semaphores, even atomic CAS are many orders of magnitude slower than single threaded code. If a task scheduler provides an environment where synchronization mechanisms could be avoided/minimized it would provide a performance boost as well as allowing code to be simpler.Klaim wrote: The default task scheduler should do exactly: execute task now synchronously. (call the task execution immediately in the same thread) It's the simplest one and don't need any dependency.
I don't see the relationship between race conditions/deadlocks and tasks schedulers because to me it's the user code that have to protect shared data (or not share it).
Even though libdispatch is an ideal example of a modern task scheduler, I don't think it would work on most platforms Ogre does. Apple provides implementations on on OS X and iOS, and it has been ported to FreeBSD, but it requires a "C compiler support for blocks is required" according to http://libdispatch.macosforge.org/ . As far I know this means Clang and leaves windows completely unsupported and would be difficult to use even on Linux.masterfalcon wrote:From what I understand our requirements would be, libdispatch is nearly perfect. But I don't know about its platform support at this time.
From http://www.ogre3d.org/forums/viewtopic. ... 91#p453845jwatte wrote:1) manage mesh->entity objects
2) implement spatial hierarchy
3) do visibility culling
4) do state management/sorting
5) do pass management/sorting
6) load terrain data
7) generate terrain/heightmaps
manage sets of (lights,cameras,billboards,animations,entities)
9) do sky rendering
It seems every body agrees Transforms -> Cull -> Complex Visuals -> Render, but which stages depend on each other for data and which connections are essential? How many of these become into 1 Task/WorkUnit and how many become some data driven amount of Tasks/WorkUnits?tuan kuranes wrote: Transform Stage -> Ogre transforms the buffers (handling page/locality/etc) filling an Cull buffers
Cull Stage -> Culling per render target fills Ogre renderqueues
Shading Stage -> Shade/Render each renderqueue according to its "shading/rendering" type into a dx/gl/etc. command buffer
Execute Stage -> merge (or dispatch between GPU/Tiles/etc) all command buffer and execute them (asynchronously)
wrote:
Couple of thoughts.... I like the idea in principle of separating Ogre into more smaller components, but I also realize there can be more complexity with that model "if" you're using a majority of the components. More DLL's to load and register to use for the components. I guess I'm speaking more towards a person who's new to Ogre as there are so many other components to integrate already before even thinking about integrating components within Ogre. If nothing else, it's a thought to consider if that moves forward.
Was just advertizing a first step, as in just making sub-libs of different ogre core huge big lib, more like code reorganisation than refactoring really. Could starts with just some folder reorganisation really. And only then, next steps would be easier & faster, as spotting candidate for dependencies minimization, components In & Out, and therefore DOD refactoring would be much more obvious. (The "switchable/pluggable" sub-libs is just a bonus side-effect of that and will be possible after the DOD refactoring done, really not a goal on itself)Regarding my view on "separate components":
However, this doesn't mean that each "component" can go into a DLL or static lib. There are multiple issues that can't be easily tackled. This level of modularization is something that sounds nice, but ends up being impractical.Furthermore the bigger the project, the bigger the chances there's some dependency between seemingly independent modules. I could've written about this dependencies in the slides, but it's very implementation-specific, and it would misslead from the main topic.When objects are highly modular, they're easier to mantain, refactor, or even replace. But that doesn't mean it has to go each in it's own dll or lib.
Idea is to make it much simpler than actual structures that are shared by too many stages, by copying only relevent data between stage using buffers.It seems every body agrees Transforms -> Cull -> Complex Visuals -> Render, but which stages depend on each other for data and which connections are essential? How many of these become into 1 Task/WorkUnit and how many become some data driven amount of Tasks/WorkUnits?