Advanced compilers for Ogre's scripts

Threads related to Google Summer of Code
Post Reply
User avatar
sinbad
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 19269
Joined: Sun Oct 06, 2002 11:19 pm
Location: Guernsey, Channel Islands
x 66
Contact:

Post by sinbad »

I was waiting to get initial high-level plans from all projects first, so I could accompany the initial CVS intro with specifics about whether each project is using a branch or an addon or HEAD. I'm still waiting for one contributor license agreement too which I'll chase up.

Since it sounds like you're all ready to go right now I'll add you, but you need to just confirm your Sourceforge account (your licensing mail didn't include that).

On the testing tools side, since we're gonig to be using Boost in v1.6 it would be easiest for you to use boost::test rather than another testing tool (although we've used CppUnit in ogrenew/tests so far), unless there's a really significant reason not to.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

I'm ready to fire up some more discussion. I'll start with a question:

How assured is the mechanism for assigning an enum member a specific value and then counting on the compiler assign incremented values from there?

Code: Select all

enum{Test1=0, Test2, Test3};
Can I be fairly certain Test2=1 and Test3 = 2?

As an update, I have checked out Ogre HEAD, and started to thread in the class ScriptLexer. Once I've got a few methods stubbed out I'll write the unit tests. I'm committing code right into HEAD. The flag that has been added to OgreConfig.h is OGRE_NEW_SCRIPT_COMPILERS. Setting it to 1 enables them. Even set to 0, all the classes will be defined, but the MaterialManager and such will use the old compilers. For now, I'll host the unit tests I use at my uni web space.

I'll post back in a little while with a topic to kick off some design and architecture discussion.
Lioric
Google Summer of Code Mentor
Google Summer of Code Mentor
Posts: 295
Joined: Fri Aug 06, 2004 10:25 pm

Post by Lioric »

Absolutely, if an enumerator value is not defined, then it will be "enumN == prevEnum + 1" (the value will increase by 1 for each succeeding identifier) , or if its the initial enumerator then it will be 0
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

Praetor wrote:

Code: Select all

enum{Test1=0, Test2, Test3};
Can I be fairly certain Test2=1 and Test3 = 2?
AFAIK, yes. However, I think you can explicitely assign them numbers (though I'm not sure if this is ANSI C or I've seen it in some fancy compilers compliant code).
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

Well, I only got to stubbing out the interface for the lexer and writing a few basic unit tests. The lexer's state machine is already written, I just need to port it into the core and clean up the error handling. I'll be back this week to go in-depth with the lexer, and get some discussion going about its design, and maybe some performance concerns.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

Let's talk about the lexer! Yeah!

I'll start the discussion with the difference between this lexer and a tokenizer, like cstdlib's strtok, boost's tokenizer, etc. Just skip this to the next section if you already get it, or don't want to, or whatever. A tokenizer splits input into a series of substrings based on separators. Tokenizers often are generic in that they can accept their separators as an argument. A lexer, does a little bit more than that. It splits the input into a series of substrings called lexemes. These lexemes are also assigned values which turns the lexeme into a token. Some lexers might be generic, like being able to recognize a general context-free grammar.

This lexer will not be as generic. It is designed for Ogre's scripting language alone. Dropping this genericity gives us some room for optimizations. The lexer will split the input into lexemes and assign values to them. Individual compilers sitting on top of the lexer can map values to specific lexemes (for instance, the material compiler can assign a special value to the lexeme "material"). Why assign values? The short answer is, the string comparisons happen only once. Once a value is assigned that value can be used from now on to more clearly and quickly compare tokens.

The lexer itself is a simple state machine. This is what I've worked up while thinking through the possibilities, trying to find holes, etc.

Image

Originally, I had the lexer storing some extra state. As it is now, only a few pieces of data in case of error need to be stored. The state machine doesn't need to remember old input data. There is one thing to clarify: there are three paths to parsing a lexeme: InToken, InSingleQuote, InDoubleQuote. The quote lines are there to allow some more flexible input in Ogre's scripting system, for instance names with spaces (e.g material "This Name has Spaces"{...}). The quoted paths create tokens without their quotes included (i.e. the above is returned as: This Name has Spaces). I've been thinking if adding the ability to accept quoted strings with embedded quotes is necessary. The embedded quotes would need to be escaped C-style with a backslash ("\'"). This can be achieved in the above state machine with two extra states added in the two quoted paths.

I'm going to be checking in the first version of the lexer soon, and hosting the test files which I'll link to. I just want to make sure the files are in a clean format first.

Questions, comments, suggestions?
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

Hmm, so are you using the string content as the token name? I assume not, but if so, then you're wrong. The idea would be to create a "String" token with an embedded "value" (or "image", as some lexers call it) with the string contents, number values... Anyways I guess you know about that.

Regarding your diagram, I think you should stick to BNF notation to "standarize" the documentation.
I was about to link you to a tool you would find extremely useful, but the site seems to be down right now! :(
[Update]
Here it is: http://pisuerga.inf.ubu.es/cgosorio/THOTH/
That's the graduation project of Andrés Arnáiz (ShingoHill)
The website is in spanish, though you can download the application (english and spanish supported). You have a zipped version here.
You might find it useful for some finite state machine and lexer testing.
Hope this helps.

One thing: The parser will be backwards-compatible, right? Do you have the grammar available? I would like to take a look into it if it's possible.

You're doing a good work, mate. Keep it up! :)
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

Kencho wrote:Hmm, so are you using the string content as the token name? I assume not, but if so, then you're wrong. The idea would be to create a "String" token with an embedded "value" (or "image", as some lexers call it) with the string contents, number values... Anyways I guess you know about that.
I'm not exactly sure what you mean by this. The string value is the lexeme, an integer value is the "image." My super-simple ScriptToken struct just looks like this:

Code: Select all

struct ScriptToken
{
    String lexeme;
    uint32 type;
};
And yes, the parsers will be fully backwards compatible. I think the step happening here that hasn't formally been done before is freezing the script "language" used across all scripts in Ogre. It'll all be the same format, just that the phrases will be different. I'll work up a BNF grammar, but the state machine shows the exact workings of the lexer. I'll check out your link later today.
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

Ah, okay, I was mixing lexers and parsers theory. However, the term "image" relates to the lexeme, not the token type. If you later read about other parsers like JavaCC this might confuse you (though you probably won't need to take a look to them :))
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

Another facet to this discussion:

The above state machine does not specify what happens when a token is actually found. Once a lexeme is split from the input, it must be categorized and assigned to a token. There are really two stages to this: 1) A map of string to integer is checked, and if an entry for that lexeme exists, the token is assigned the mapped integer value. 2) If a user-defined type is not found, built-in types (open bracket, close bracket, colon, import, general phrase, newline, or unknown) are checked and assigned to the token.

The map brings up some points about performance. Compilers like the material compiler or particle compiler will have quite a lot of custom token assignments (material, technique, pass, diffuse, specular, etc.) causing long map lookups during lexing. During testing, I noticed no slowdown, but I wouldn't be surprised if more rigorous performance tests reveal this to be one of the bottlenecks. Any real thoughts on this?
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

Hmm... some thoughts on this:
First, I would check for keywords and key symbols ("material", "{"...) first. Those are protected, and checking them in a different order will cause a lot of problems. For instance, you could have "material material {...}" treated as "<str> <str> <openbr> ... <closebr>", while that should read as "<tkmat> <tkmat> <openbr> ... <closebr>", with the consequent syntax error.

I would just keep a symbols table for the defined symbols, and not all tokens. Saves memory, speed, and is the usual way to parse grammars that allow variables and need some kind of processing. Then, build a syntax analysis tree and walk it in the parser. There, certain leafs can reference to the symbol table, that would at the same time reference other tree branches (something like #include in parsing time).

What do you think?
Image
Lioric
Google Summer of Code Mentor
Google Summer of Code Mentor
Posts: 295
Joined: Fri Aug 06, 2004 10:25 pm

Post by Lioric »

The map performance is not an issue (remember that this is part of the "80%" part of the code)

Later, after profiling, you can switch (with a define or typedef) to a hash_map, that in theory can provide O(1) access to the values (depending on the hash function used)

But here, we are handling a few container values, probably there are not more that 20 to 50 keywords (tokens) total, at this level there is no point in wasting too much time here initially, later if a bottleneck is find, the "algorithm level" is to be revised first, as the tokenizer or string functions and code sections are more better candidates to performance optimizations
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

Lioric wrote: But here, we are handling a few container values, probably there are not more that 20 to 50 keywords (tokens) total
My thoughts exactly.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

I have finally gotten around to polishing and committing the lexer. It is in Ogre's HEAD in OgreScriptLexer.h/cpp.

I did not commit any updated projects. I can only keep the VS8 projects up-to-date at this point, and I wasn't sure if I should do that. As it stands just add those two files into any of the projects and you can compile away.

I've also uploaded the tests I used for the lexer so far. You can get a zip of it here: http://www.rit.edu/~bjj1478/files/SoCTests.zip.
The zip archive contains the test cases, the required TUT .h file and a VS8 project for the whole thing.

I feel like the lexer has gone through enough paces to be ready to move on. I'm going to start on the parser, which creates and does some processing on the abstract syntax trees. I'll post a little later with design and implementation details on the parser. I'll also probably be a little more granular with my commits in the future. This time I committed the lexer only once it was practically finished (minus some more detailed comments).
User avatar
PolyVox
OGRE Contributor
OGRE Contributor
Posts: 1316
Joined: Tue Nov 21, 2006 11:28 am
Location: Groningen, The Netherlands
x 18
Contact:

Post by PolyVox »

Praetor wrote:I have finally gotten around to polishing and committing the lexer. It is in Ogre's HEAD in OgreScriptLexer.h/cpp.
What does this mean in terms of eventual integration into Ogre? It's anticipated for 1.6 or a later release of 1.4.x?
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

I think the plan was to release it in 1.6.

Though, I'll leave that decision entirely up to Sinbad. I don't foresee any current API changes needed, only API additions.

If Ogre does indeed move to a boost dependency then this code will probably no longer be compatible with 1.4 without a bit of work on your part.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

I've done a little more work beefing up error handling in the lexer. I've turned my attention to the parser now, but I have an idea to do a quick test/prototype.

I originally rejected the use of flex/bison/yacc for all the reasons people normally reject the parser generators. I've done some more research, and decided to do a prototype using ANTLR. ANTLR is produces LL parsers as opposed to YACC's LALR parsers. LL parsers are more human-readable, and therefore more human editable. I can tweak the parser that gets generated much easier than one produced by YACC (which ends up being a table of ints, yuck!). ANTLR appears to have decent error messages, but I can add custom messages myself later. The issue is that ogre's scripts aren't trivial, and I don't want to waste too much time playing with ANTLR's language. Even after a parser is generated I need to make the actually compilers for each resource type. This should be the main focus of this SoC project. So, hopefully a prototype from ANTLR can be made quickly. For the time being I will only commit the generated parsers and not the grammar file (.g) that produced it. I'll make that available with the tests.

So, the overall plan is the same: a parser creates an AST (this is what ANTLR will do), compilers takes the AST and create specific Ogre resources.

The issue I want feedback on is a design one. The AST will need special processing. This processing needs to accomplish two things *before* a compiler handles it: 1) handle import commands to explicitly handle script dependencies, 2) substitute variables within scripts (so $var becomes the actual text it should). This can be done by wrapping the generated parser with my own class, which invokes the parsing, then does processing on the AST and then returns this processed AST. Or... I could provide a ScriptCompiler base class, which defines the compile function taking in the AST. This base class does the processing and then delegates further AST processing to a virtual function so base classes can do the actual compilation. Anyone have any ideas?
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

I'm using ANTLR in my project. A good pick you've done ;)

ANTLR has a higher customisation level that I originally thought, so don't underestimate its capabilities! I'm quite sure you can handle all the errors from the grammar file directly instead of tweaking the generated code ;)

About your question... Haven't played enough with ANTLR but I guess you could invoke the lexer/parser recursively from inside the parser, and thus parse the base script and all its imports. Regarding to variables, I still vote for some kind of symbol table and do the substitution in parse-time (adding the key-value pair when declared, reading the value when used).

What do you think?
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

Once the AST is generated the further processing is up to me. The import thing can be handled fairly easily. The requested script is loaded (I'll provide a listener for "hooking" this, if you want to customize loading), and fed into the parser to generate another AST. This new one is then substituted in for the original script's import tree node. Further processing continues oblivious to the fact that the new elements never actually existed in the original script.

For the variables: handling it at parse time seems rather difficult. My plan was to handle it in between parsing and compiling. It would be one quick pass through the tree to process the variables. And yes, this pass would use a scoped symbol table to do it (I'm thinking something like a Stack structure of symbol tables). The one thing that tricky and nice is that these variables aren't typed. They should be able to set to any arbitrary ogre script construct. So, upon replacement the variable values will need to be expanded and transformed into ASTs themselves. Not worrying about type information though will make it simpler.
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

Praetor wrote:For the variables: handling it at parse time seems rather difficult. My plan was to handle it in between parsing and compiling. It would be one quick pass through the tree to process the variables. And yes, this pass would use a scoped symbol table to do it (I'm thinking something like a Stack structure of symbol tables). The one thing that tricky and nice is that these variables aren't typed. They should be able to set to any arbitrary ogre script construct. So, upon replacement the variable values will need to be expanded and transformed into ASTs themselves. Not worrying about type information though will make it simpler.
It's just the opposite: It's quite easy to handle it at parse time indeed, or at least it's been for me. Instead of retrieving scene nodes created before during the parse (my case), you would retrieve values from a table/map. The scope is quite easy as well; you can create scope strings, and store the variables along their scope string. Then when retrieving its value, you would take the one with the larger scope variable that actually fits in the scope at the call point. For instance:

Code: Select all

material mat1 {
  technique {
    define red 1 0 0  // Stored as red@technique0@mat1 = 1 0 0
    pass {
      define red 0.8 0 0  // red@pass0@technique0@mat1 = 0.8 0 0
      diffuse $red  // uses red@pass0@technique0@mat1 instead of red@technique0@mat1
    }
    pass {
      diffuse $red  // uses red@technique0@mat1 as red@pass0@technique0@mat1 is out of scope (the current scope is red@pass1@technique0@mat1)
    }
  }
}
Not very clear, but hope that explains my idea :P
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

It gets the idea across, but there are other issues. I'll have to mull it over a bit.

Anyway, I got somewhere on the ANTLR front. I generated a basic lexer. However, I quickly realized it is not stand-alone. It depends on ANTLR's C++ runtime. I was hoping not to bring in any extra dependencies with this project. If we are going to accept extra dependencies, I'd just assume use boost's spirit instead of ANTLR. It generates the the parser in-line (define the grammar in C++, and template meta-programming creates the parser as it compiles), I'm familiar with it, like most of boost it is a header-only library, and sinbad already plans on folding in some boost dependency in the future.

That's where I am now. I feel like I should probably work on my hand-crafted lexer parser some more. I will need definitive answers about dependencies soon though. Using spirit could really accelerate this development.
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

Well, as a mentor I can't help but tell you to think carefully about this (I completely forgot about that dependency!).

Though you're allowed to add dependencies if that will accelerate the development and make things cleaner, you know almost everything possible will be ported to boost to remove dependencies as much as possible. That would make this project a candidate for a re-work just to remove one dependency. And that means re-writing code and wasting efforts.

Though I like ANTLR probably above all the other parser generators, this is a delicate issue and you must balance the pros and cons in depth to find a good, satisfying answer.
Image
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

Precisely my point. I don't think ANTLR is a good way to go if only for that reason. It is a shame, but that is just the way things happen sometimes.

I've converted that prototype to see how quickly I can get a spirit parser generating an AST for any arbitrary ogre script (any old .material or .particle, etc.). Depending on the scope of boost that will be integrated into Ogre, using spirit might be a negligible dependency to add.

I'll also think more about handling variable during parsing. The main issue is base elements will define variables, which are overwritten by child elements. It is this inheritance behavior coupled with the variables that makes it a little more complex. Anyway I'll do some more prototyping and find the best approach.
User avatar
Praetor
OGRE Retired Team Member
OGRE Retired Team Member
Posts: 3335
Joined: Tue Jun 21, 2005 8:26 pm
Location: Rochester, New York, US
x 3
Contact:

Post by Praetor »

@everyone I noticed an inconsistency. One of the goals of these new compilers is to allow for more flexible input. That means newlines and tabs wherever the author wants. With one exception: properties. Properties still need to be in the form

Code: Select all

name value1 ... valuen
with a newline at the end. There can be no newlines in the middle of this sequence. So, while formatting is free-form everywhere else, in this instance it isn't. Now, I can make my parser handle this of course, and I will, but I think since we are using a C-like syntax it isn't out-of-the-question to ask that these constructs end in a ';'.

Code: Select all

diffuse 0 1 1 1;
specular 1 1 1 1;
Now, it is unacceptable right now for me to *require* the semi-colon, since I want all current scripts to compile under the new systems. However, I propose that I allow the semi-colons to be there. The old non-semi-coloned format becomes deprecated but acceptable for one release cycle, and then the whole system is transfered to semi-colon only. What do you think?

@kencho To continue our banter about symbols...
The issue that muddies symbol handling is inheritance. One material definition can inherit from another. The child material should then be able to override variables place in the parent. I'm mulling over the mechanisms i'll use to compile the inheritance, but one thing is certain: it will have to happen after parsing is completed. To allow for the overwriting of parent variables I'll have to do variable processing after that, which means I can't do it during parsing. Does that make sense?
User avatar
Kencho
OGRE Retired Moderator
OGRE Retired Moderator
Posts: 4011
Joined: Fri Sep 19, 2003 6:28 pm
Location: Burgos, Spain
x 2
Contact:

Post by Kencho »

First, my opinion regarding the newlines... I might be wrong, but I think that's only a problem where the values are strings themselves (for instance, cube maps or multiframe animations). In all the other cases you have sentences in the form "keyword value(s)", so newlines should be a problem as values are either numbers or variables starting with $ (and the previously told "string values").

Think this is a very problematic issue as you're breaking a lot of the scripts API by requiring extra chars at the end of lines. None of the existing resource tools would work, and you know how appreciated are the few tools existing for Ogre ;) I think that we should put a lot more thought on this issue given its impact and importance.

Onto the symbols, you've got a good point with inheritance, though that would be possible to do it in parse time. As long as you import the ASTs in the moment you find the import statement, you could attach the values to a structure different than a map, say, a non-cyclic (inheritance) graph.
That's my guess though, but I still have faith on parse-time ;) Theoretically, if you require variables to be declared before being used, you can do it in parse-time, am I wrong?

PS: This is a very nice and instructive discussion :) Thanks!
Image
Post Reply