Advanced compilers for Ogre's scripts

Threads related to Google Summer of Code
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2

Post by Kencho »

TUT is template based, same as CppUnit and the like... That's what I most hated about them. Getting the grasp was quite ugly and slow in my opinion. CxxTest simply uses small header files with the test suites, that you pass to a perl program that builds the runners. Compile the runners and you're done :) This is the kind of thing you can add to your script/makefile/whatever to make unit testing the most automated possible. From my own experience, this is performing incredibly great :)

Praetor, I'm running on Ubuntu-only right now, and have found Premake a great ally compared to the ugly autotools. Half an hour playing with it and I already had a building system :) Now my script does everything (indeed, I have a --all option that builds the tests, parsers, documents, installs data and media... Ready for compiling the regression tests and closing a task if tests report 100% success :)). If you have problems setting it up, let me know :)

Of course these all are choices you have. I can tell you from my experience that the Premake+CxxTest pair is the most powerful I've tried to date.
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

I need to bite the bullet here and get my ubuntu back in action. Once I do I'll definitely give premake a look. I'm sure I'll have a lot of questions about it. The code I plan on making is pretty generic, and I don't think there is a lot of danger of hitting platform-specific problems, but I'll make the attempt to give my stuff a fair amount of testing on linux throughout the project.

Since I am unabashedly in love with templates I don't mind them in tut. I'm the kind of guy who tries to find the unnecessarily complex template meta-programming route just for fun. Don't worry, I can restrain myself for this project though. The most important thing is ease of integration and quick turn-around. I don't need fancy analysis or output, I just want to maybe scan a text file and see passes and failures. Of course being able to have the tests run automatically upon building will be nice.
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2

Post by Kencho »

Well, i've just put up a "small" guide on development process based on my experience. Sure it's not a deal for you but maybe you can pick some ideas about Premake ;)
Image
User avatar
neocryptek
Gnome
Posts: 335
Joined: Sat Mar 01, 2003 11:21 pm
Location: Idaho, USA

Post by neocryptek »

Not to go too far off topic, but I am rather fond of UnitTest++. It hides pretty much all complexity and "house cleaning" for tests in macros, so you have very little friction in writing tests. It is also very small so it is easy to port to consoles and the like. It was created in a production environment at High Moon Studios for creating games.

The purists might cringe at the usage of macros, but personally this was the only way I could get into unit testing. All the code required to just declare a test in other frameworks just seemed ridiculous to me, why make testing that painful? This is of course IMO, there are a billion other frameworks out there, so find one that suits your fancy.

It really is super simple to use:

Code: Select all

#include "UnitTest++.h"

TEST(MyFirstTest)
{
	CHECK(false);
}
All tests can be automatically run with a single call, returning the number of failures (for build scripts, etc):

Code: Select all

#include "UnitTest++.h"

int main(int, char *[])
{
	return UnitTest::RunAllTests();
}
And that's it. Supports all the usual advanced things like fixtures, suites, etc too.

But enough of me rambling, looking forward to this Summer of Code, good luck on your script improvements Praetor! Definitely looking forward to see this thing in action.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

Here is a quick diagram to show a few relationships in my current design. As you can see there are three distinct concepts in play.

Image

I'll probably be using TUT (i haven't played with it yet, but unless something incredibly bad happens, it looks it fits my style nicely). Logistically it makes sense for this to be in a branch which can be merged with Shoggoth before release, but perhaps I should also make the tests available. Should we put those in addons, or maybe in the same branch? A ScriptCompilerTests folder in addons seems fine to me.

[EDIT] Omission in the diagram. There should be a link between ScriptNode and ScriptToken, as ScriptNode contains a reference to a specific ScriptToken. I'll make the change later today
Last edited by Praetor on Sun Apr 15, 2007 8:24 pm, edited 1 time in total.
User avatar
sinbad
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 19269
Joined: Sun Oct 06, 2002 11:19 pm
Location: Guernsey, Channel Islands
x 66

Post by sinbad »

It's quite alright for this to go directly in the HEAD if you can switch it on / off with a simple #define, since most of this will be new classes. You could keep the switch set locally to use the new compilers - this is how nfz worked on his compiler system without altering anyone else's work.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

I suppose that works even easier than branching. Do the SoC people have commit access yet, or is that still being set up?
User avatar
sinbad
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 19269
Joined: Sun Oct 06, 2002 11:19 pm
Location: Guernsey, Channel Islands
x 66

Post by sinbad »

I was waiting to get initial high-level plans from all projects first, so I could accompany the initial CVS intro with specifics about whether each project is using a branch or an addon or HEAD. I'm still waiting for one contributor license agreement too which I'll chase up.

Since it sounds like you're all ready to go right now I'll add you, but you need to just confirm your Sourceforge account (your licensing mail didn't include that).

On the testing tools side, since we're gonig to be using Boost in v1.6 it would be easiest for you to use boost::test rather than another testing tool (although we've used CppUnit in ogrenew/tests so far), unless there's a really significant reason not to.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

I'm ready to fire up some more discussion. I'll start with a question:

How assured is the mechanism for assigning an enum member a specific value and then counting on the compiler assign incremented values from there?

Code: Select all

enum{Test1=0, Test2, Test3};
Can I be fairly certain Test2=1 and Test3 = 2?

As an update, I have checked out Ogre HEAD, and started to thread in the class ScriptLexer. Once I've got a few methods stubbed out I'll write the unit tests. I'm committing code right into HEAD. The flag that has been added to OgreConfig.h is OGRE_NEW_SCRIPT_COMPILERS. Setting it to 1 enables them. Even set to 0, all the classes will be defined, but the MaterialManager and such will use the old compilers. For now, I'll host the unit tests I use at my uni web space.

I'll post back in a little while with a topic to kick off some design and architecture discussion.
Lioric
Google Summer of Code Mentor
Google Summer of Code Mentor
Posts: 295
Joined: Fri Aug 06, 2004 10:25 pm

Post by Lioric »

Absolutely, if an enumerator value is not defined, then it will be "enumN == prevEnum + 1" (the value will increase by 1 for each succeeding identifier) , or if its the initial enumerator then it will be 0
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2

Post by Kencho »

Praetor wrote:

Code: Select all

enum{Test1=0, Test2, Test3};
Can I be fairly certain Test2=1 and Test3 = 2?
AFAIK, yes. However, I think you can explicitely assign them numbers (though I'm not sure if this is ANSI C or I've seen it in some fancy compilers compliant code).
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

Well, I only got to stubbing out the interface for the lexer and writing a few basic unit tests. The lexer's state machine is already written, I just need to port it into the core and clean up the error handling. I'll be back this week to go in-depth with the lexer, and get some discussion going about its design, and maybe some performance concerns.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

Let's talk about the lexer! Yeah!

I'll start the discussion with the difference between this lexer and a tokenizer, like cstdlib's strtok, boost's tokenizer, etc. Just skip this to the next section if you already get it, or don't want to, or whatever. A tokenizer splits input into a series of substrings based on separators. Tokenizers often are generic in that they can accept their separators as an argument. A lexer, does a little bit more than that. It splits the input into a series of substrings called lexemes. These lexemes are also assigned values which turns the lexeme into a token. Some lexers might be generic, like being able to recognize a general context-free grammar.

This lexer will not be as generic. It is designed for Ogre's scripting language alone. Dropping this genericity gives us some room for optimizations. The lexer will split the input into lexemes and assign values to them. Individual compilers sitting on top of the lexer can map values to specific lexemes (for instance, the material compiler can assign a special value to the lexeme "material"). Why assign values? The short answer is, the string comparisons happen only once. Once a value is assigned that value can be used from now on to more clearly and quickly compare tokens.

The lexer itself is a simple state machine. This is what I've worked up while thinking through the possibilities, trying to find holes, etc.

Image

Originally, I had the lexer storing some extra state. As it is now, only a few pieces of data in case of error need to be stored. The state machine doesn't need to remember old input data. There is one thing to clarify: there are three paths to parsing a lexeme: InToken, InSingleQuote, InDoubleQuote. The quote lines are there to allow some more flexible input in Ogre's scripting system, for instance names with spaces (e.g material "This Name has Spaces"{...}). The quoted paths create tokens without their quotes included (i.e. the above is returned as: This Name has Spaces). I've been thinking if adding the ability to accept quoted strings with embedded quotes is necessary. The embedded quotes would need to be escaped C-style with a backslash ("\'"). This can be achieved in the above state machine with two extra states added in the two quoted paths.

I'm going to be checking in the first version of the lexer soon, and hosting the test files which I'll link to. I just want to make sure the files are in a clean format first.

Questions, comments, suggestions?
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2

Post by Kencho »

Hmm, so are you using the string content as the token name? I assume not, but if so, then you're wrong. The idea would be to create a "String" token with an embedded "value" (or "image", as some lexers call it) with the string contents, number values... Anyways I guess you know about that.

Regarding your diagram, I think you should stick to BNF notation to "standarize" the documentation.
I was about to link you to a tool you would find extremely useful, but the site seems to be down right now! :(
[Update]
Here it is: http://pisuerga.inf.ubu.es/cgosorio/THOTH/
That's the graduation project of Andrés Arnáiz (ShingoHill)
The website is in spanish, though you can download the application (english and spanish supported). You have a zipped version here.
You might find it useful for some finite state machine and lexer testing.
Hope this helps.

One thing: The parser will be backwards-compatible, right? Do you have the grammar available? I would like to take a look into it if it's possible.

You're doing a good work, mate. Keep it up! :)
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

Kencho wrote:Hmm, so are you using the string content as the token name? I assume not, but if so, then you're wrong. The idea would be to create a "String" token with an embedded "value" (or "image", as some lexers call it) with the string contents, number values... Anyways I guess you know about that.
I'm not exactly sure what you mean by this. The string value is the lexeme, an integer value is the "image." My super-simple ScriptToken struct just looks like this:

Code: Select all

struct ScriptToken
{
    String lexeme;
    uint32 type;
};
And yes, the parsers will be fully backwards compatible. I think the step happening here that hasn't formally been done before is freezing the script "language" used across all scripts in Ogre. It'll all be the same format, just that the phrases will be different. I'll work up a BNF grammar, but the state machine shows the exact workings of the lexer. I'll check out your link later today.
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2

Post by Kencho »

Ah, okay, I was mixing lexers and parsers theory. However, the term "image" relates to the lexeme, not the token type. If you later read about other parsers like JavaCC this might confuse you (though you probably won't need to take a look to them :))
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

Another facet to this discussion:

The above state machine does not specify what happens when a token is actually found. Once a lexeme is split from the input, it must be categorized and assigned to a token. There are really two stages to this: 1) A map of string to integer is checked, and if an entry for that lexeme exists, the token is assigned the mapped integer value. 2) If a user-defined type is not found, built-in types (open bracket, close bracket, colon, import, general phrase, newline, or unknown) are checked and assigned to the token.

The map brings up some points about performance. Compilers like the material compiler or particle compiler will have quite a lot of custom token assignments (material, technique, pass, diffuse, specular, etc.) causing long map lookups during lexing. During testing, I noticed no slowdown, but I wouldn't be surprised if more rigorous performance tests reveal this to be one of the bottlenecks. Any real thoughts on this?
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2

Post by Kencho »

Hmm... some thoughts on this:
First, I would check for keywords and key symbols ("material", "{"...) first. Those are protected, and checking them in a different order will cause a lot of problems. For instance, you could have "material material {...}" treated as "<str> <str> <openbr> ... <closebr>", while that should read as "<tkmat> <tkmat> <openbr> ... <closebr>", with the consequent syntax error.

I would just keep a symbols table for the defined symbols, and not all tokens. Saves memory, speed, and is the usual way to parse grammars that allow variables and need some kind of processing. Then, build a syntax analysis tree and walk it in the parser. There, certain leafs can reference to the symbol table, that would at the same time reference other tree branches (something like #include in parsing time).

What do you think?
Image
Lioric
Google Summer of Code Mentor
Google Summer of Code Mentor
Posts: 295
Joined: Fri Aug 06, 2004 10:25 pm

Post by Lioric »

The map performance is not an issue (remember that this is part of the "80%" part of the code)

Later, after profiling, you can switch (with a define or typedef) to a hash_map, that in theory can provide O(1) access to the values (depending on the hash function used)

But here, we are handling a few container values, probably there are not more that 20 to 50 keywords (tokens) total, at this level there is no point in wasting too much time here initially, later if a bottleneck is find, the "algorithm level" is to be revised first, as the tokenizer or string functions and code sections are more better candidates to performance optimizations
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

Lioric wrote: But here, we are handling a few container values, probably there are not more that 20 to 50 keywords (tokens) total
My thoughts exactly.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

I have finally gotten around to polishing and committing the lexer. It is in Ogre's HEAD in OgreScriptLexer.h/cpp.

I did not commit any updated projects. I can only keep the VS8 projects up-to-date at this point, and I wasn't sure if I should do that. As it stands just add those two files into any of the projects and you can compile away.

I've also uploaded the tests I used for the lexer so far. You can get a zip of it here: http://www.rit.edu/~bjj1478/files/SoCTests.zip.
The zip archive contains the test cases, the required TUT .h file and a VS8 project for the whole thing.

I feel like the lexer has gone through enough paces to be ready to move on. I'm going to start on the parser, which creates and does some processing on the abstract syntax trees. I'll post a little later with design and implementation details on the parser. I'll also probably be a little more granular with my commits in the future. This time I committed the lexer only once it was practically finished (minus some more detailed comments).
User avatar
PolyVox
OGRE Contributor
OGRE Contributor
Posts: 1316
Joined: Tue Nov 21, 2006 11:28 am
Location: Groningen, The Netherlands
x 18

Post by PolyVox »

Praetor wrote:I have finally gotten around to polishing and committing the lexer. It is in Ogre's HEAD in OgreScriptLexer.h/cpp.
What does this mean in terms of eventual integration into Ogre? It's anticipated for 1.6 or a later release of 1.4.x?
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

I think the plan was to release it in 1.6.

Though, I'll leave that decision entirely up to Sinbad. I don't foresee any current API changes needed, only API additions.

If Ogre does indeed move to a boost dependency then this code will probably no longer be compatible with 1.4 without a bit of work on your part.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3

Post by Praetor »

I've done a little more work beefing up error handling in the lexer. I've turned my attention to the parser now, but I have an idea to do a quick test/prototype.

I originally rejected the use of flex/bison/yacc for all the reasons people normally reject the parser generators. I've done some more research, and decided to do a prototype using ANTLR. ANTLR is produces LL parsers as opposed to YACC's LALR parsers. LL parsers are more human-readable, and therefore more human editable. I can tweak the parser that gets generated much easier than one produced by YACC (which ends up being a table of ints, yuck!). ANTLR appears to have decent error messages, but I can add custom messages myself later. The issue is that ogre's scripts aren't trivial, and I don't want to waste too much time playing with ANTLR's language. Even after a parser is generated I need to make the actually compilers for each resource type. This should be the main focus of this SoC project. So, hopefully a prototype from ANTLR can be made quickly. For the time being I will only commit the generated parsers and not the grammar file (.g) that produced it. I'll make that available with the tests.

So, the overall plan is the same: a parser creates an AST (this is what ANTLR will do), compilers takes the AST and create specific Ogre resources.

The issue I want feedback on is a design one. The AST will need special processing. This processing needs to accomplish two things *before* a compiler handles it: 1) handle import commands to explicitly handle script dependencies, 2) substitute variables within scripts (so $var becomes the actual text it should). This can be done by wrapping the generated parser with my own class, which invokes the parsing, then does processing on the AST and then returns this processed AST. Or... I could provide a ScriptCompiler base class, which defines the compile function taking in the AST. This base class does the processing and then delegates further AST processing to a virtual function so base classes can do the actual compilation. Anyone have any ideas?
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2

Post by Kencho »

I'm using ANTLR in my project. A good pick you've done ;)

ANTLR has a higher customisation level that I originally thought, so don't underestimate its capabilities! I'm quite sure you can handle all the errors from the grammar file directly instead of tweaking the generated code ;)

About your question... Haven't played enough with ANTLR but I guess you could invoke the lexer/parser recursively from inside the parser, and thus parse the base script and all its imports. Regarding to variables, I still vote for some kind of symbol table and do the substitution in parse-time (adding the key-value pair when declared, reading the value when used).

What do you think?
Image