Let's talk about the lexer! Yeah!
I'll start the discussion with the difference between this lexer and a tokenizer, like cstdlib's strtok, boost's tokenizer, etc. Just skip this to the next section if you already get it, or don't want to, or whatever. A tokenizer splits input into a series of substrings based on separators. Tokenizers often are generic in that they can accept their separators as an argument. A lexer, does a little bit more than that. It splits the input into a series of substrings called lexemes. These lexemes are also assigned values which turns the lexeme into a token. Some lexers might be generic, like being able to recognize a general context-free grammar.
This lexer will not be as generic. It is designed for Ogre's scripting language alone. Dropping this genericity gives us some room for optimizations. The lexer will split the input into lexemes and assign values to them. Individual compilers sitting on top of the lexer can map values to specific lexemes (for instance, the material compiler can assign a special value to the lexeme "material"). Why assign values? The short answer is, the string comparisons happen only once. Once a value is assigned that value can be used from now on to more clearly and quickly compare tokens.
The lexer itself is a simple state machine. This is what I've worked up while thinking through the possibilities, trying to find holes, etc.
Originally, I had the lexer storing some extra state. As it is now, only a few pieces of data in case of error need to be stored. The state machine doesn't need to remember old input data. There is one thing to clarify: there are three paths to parsing a lexeme: InToken, InSingleQuote, InDoubleQuote. The quote lines are there to allow some more flexible input in Ogre's scripting system, for instance names with spaces (e.g material "This Name has Spaces"{...}). The quoted paths create tokens without their quotes included (i.e. the above is returned as: This Name has Spaces). I've been thinking if adding the ability to accept quoted strings with embedded quotes is necessary. The embedded quotes would need to be escaped C-style with a backslash ("\'"). This can be achieved in the above state machine with two extra states added in the two quoted paths.
I'm going to be checking in the first version of the lexer soon, and hosting the test files which I'll link to. I just want to make sure the files are in a clean format first.
Questions, comments, suggestions?