Anatomy of an interpreter: the Parser

Posts in this series: Lexer, Parser, Evaluator

Subs has reached version 1.3.4, which means that it can successfully run all the tests from chapter 1 of SICP. This is very exciting.

Last time I explained a bit about the Lexer, which takes in a stream of characters and emits a stream of tokens: individual elements of code such as a bracket, a keyword or a symbol.

Generally, parsers emit some kind of tree structure – they understand the raw tokens as a hierarchical structure which (conceptually, at least) will be executed from the bottom up, with each branch-point in the tree being an operation of some kind.

Our parser takes in a stream of tokens, and emits a stream of parsed trees.

Parsing Scheme is very easy, because (except for a couple of exceptions I haven’t implemented yet) there is essentially one rule: start with an open bracket, see a list of things, and then find a close bracket. Of course, one of the “things” you see may itself be another bracketted list, so after parsing you get a tree structure of nested lists.

The parser in Subs looks like this:

class Parser
{
public:
    Parser( ILexer& lexer );
    std::auto_ptr<Value> NextValue();
private:
    ILexer& lexer_;
};

We supply a Lexer in the constructor, which we know will provide us with tokens when we need them via its NextToken() method. The Parser’s NextValue() method returns a pointer to a Value, which is the base class for all the “things” in the Subs interpreter.

There are lots of types of things that inherit from the Value class, but the “parse tree” (the output of the parser) will only consist of a very small subset of them:

  • CombinationValue
  • DecimalValue
  • IntegerValue
  • StringValue
  • SymbolValue

The CombinationValue class forms the tree structure. Its declaration looks like this:

class CombinationValue : public Value, public std::vector<Value*>
{
    // ...
};

It is simply a list of other Values.

Note that it “owns” those Values in the sense that it deletes them when it is deleted. I have recently made the jump to make Subs depend on BOOST, so it’s on my TODO list to make containers like this use the BOOST smart containers to manage this job for me.

DecimalValue, IntegerValue and StringValue are relatively self-explanatory: they contain numbers and strings that were found as literals in the source code.

SymbolValue is essentially everything else – if the code that recognises the type of a token can’t recognise it as a bracket, a number or a string, we assume it is a symbol, and tuck it away in a SymbolValue to be understood later.

The core of the Parser looks like this (with some error-checking removed):

std::auto_ptr<Value> next_value( ILexer& lexer, Token token )
{
    if( token.Name() == "(" )
    {
        auto_ptr<CombinationValue> ret( new CombinationValue );
        while( true )
        {
            token = lexer.NextToken();
            if( token.Name() == ")" )
            {
                break;
            }
            // Recursive call
            ret->push_back( next_value( lexer, token ).release() );
        }
        return auto_ptr<Value>( ret.release() );
    }
    else
    {
        return ValueFactory::CreateValue( token );
    }
}

(Full code here: Parser.cpp) It’s a simple recursive function that creates a CombinationValue whenever it finds a bracket, and otherwise uses a ValueFactory to create an individual value.

Side note: the wisdom of using recursion could certainly be questioned, since it limits the depth of bracketting we can handle to the size of the C++ stack, but the only other way to get the same result would be to keep our own manual stack of unfinished combinations, and it just seems perverse to re-implement language features like that. What might well be more interesting would be to consider whether we can actually evaluate parts of the tree as we go, without parsing it all at once. This might make the whole setup scale rather better, but would most likely be quite complex. The implementation presented here will work fine for almost any imaginable program – remember we would need not just code whose execution is deeply nested, but whose expression in code had thousands of levels of nesting before the parser would fail.

The ValueFactory uses some basic rules such as “starts and ends with a quote” or “consists of only numbers and a decimal point” to recognise what type of Value to create, and if no rules match it defaults to a SymbolValue.

When we have completed a bracketted expression, we return a complete tree of numbers, strings and symbols, and it is ready to be evaluated, which you can think of as simply expanding the tree we already have into the full expression of itself, and then reducing it back down again to an answer.

Next time, the Evaluator and the famous eval-apply loop.

Anatomy of an interpreter: the Lexer

Posts in this series: Lexer, Parser, Evaluator

I have been having a lot of fun recently writing my Scheme interpreter Subs. I have never implemented a full programming language before, so I am learning fast (mostly through mistakes) and wanted to write down some of the stuff I am discovering.

Note: if you want to learn more about what Scheme is I recommend Scheme (Wikipedia) and the book SICP, which is the inspiration for all this.

I am writing everything from scratch, just because it’s fun (certainly not because I think it is in any way better to do it that way…). As we will see, that gives me opportunities to do things in different ways from the normal way such things are done. So far, every time I find out I’ve deviated from the normal way I’ve quickly discovered why I am wrong, and had to learn the true path.

Text-based programming languages, whether interpreted or compiled, need a lexer. A lexer takes in characters and spits out “tokens”, which are groups of characters that represent a single thing, such as a bracket, a variable name or a number. (Those tokens are then passed on to the parser, which I will cover in a different post.)

Scheme (and other Lisp variants) are fairly easy to lex because they don’t have much syntax – you just need to be able to understand round brackets, numbers and strings, and a couple of special cases that I won’t go into because I haven’t actually implemented them yet. (Mind you, I haven’t implemented strings yet either…)

When I started Subs I took my normal approach of doing whatever I wanted without any research or even much thought, and wrote something that I called a lexer, but which was really something else. It took in a stream of characters, read it one “word” at a time (using whitespace as separators), broke up the word if it contained bracket characters, and emitted a tree structure with each branch representing a bracketted list. It just seemed sensible, while I was watching the brackets flow by, to understand them and create a tree structure.

However, for a number of reasons, that approach turned out to be wrong.

First, reading a “word” at a time made things much harder than simply stepping through each character. It made my initial implementation slightly faster, but as soon as I realised I cared about white space (e.g. keeping track of what line we are on) it had to go. When it went it also turned out to be easier to deal with unusual code layout – for example “a(b” should be lexed as 3 tokens, but would be handed to us as a single word.

Second, and more importantly, creating a tree structure at this point was a waste of time. Creating tree structures is normally the job of a parser, and mixing these responsibilities gave me some pointless inefficiency: the lexer emitted a tree of tokens, which the parser then translated into another tree (of fully-understood code objects). It turned out that walking the tree of tokens and copying that structure in the parser was at least as hard as just taking in a flat stream of tokens and constructing the tree just once.

So, I re-wrote Lexer into something that is starting to become worthy of the name. The most interesting parts of its signature look like this:

class Lexer
{
public:
    Lexer( std::istream& instream );
    Token NextToken();
};

It takes in a reference to a stream, which will provide the characters, and when NextToken is called, it reads enough characters to determine what the next token will be, and returns it.

Side note: Subs is written using Test-Driven Development. I re-implemented the Lexer and Parser from scratch (naming the new classes NewLexer and NewParser until they were ready), modified the code that used them to use the new interfaces, ran the tests, and immediately knew that the new classes worked as well as the old ones. That level of confidence is incredibly freeing. I can’t imagine how I would ever have convinced myself the new classes were ready had I not had that safety net of 100s of tests that ensure the interpreter correctly responds to each type of input.

Currently the Token class it returns is pretty much just a string, with some information attached about where that string was in the original text. In researching this article I realised that most lexers attach more information than that to their tokens – they understand its basic type such as integer, decimal, string, bracket or symbol. At the moment in Subs, this work is done in the parser (so for the lexer each token is just a string), but I can see why it is helpful to do this work in the lexer, because for most types we have the information anyway. For example, in order to recognise that ‘”foo (bar”‘ (where the double-quotes are in the real code) is a single token we must understand that it is a string. Since we know it at this stage, we might as well record it in the Token object so we don’t have to work it out again later. When Subs supports strings, I will probably add a “type” field to the token and move this work from the parser.

On a more general programming point, following on from comments I made in a previous post, it is worth noting that the structure of the lexer (and, as we will see later, the parser) uses a technique called “streams”. What this means is that we write functions like NextToken that process a small part of the total problem, and return their answer. If we chain together functions like that (for example the parser’s equivalent function calls NextToken whenever it needs a new token) we can process arbitrarily large input without using lots of memory or slowing down. The lexer is able to process any number of tokens using a very small amount of memory, and will only fall over if it encounters a single token that is ridiculously large.

The stream style can be very useful for some problems, not only because it can be efficient with memory, but also because it can help break problems into neat, small pieces that are easier to implement correctly. It is also useful for writing code in a functional style, because it allows us to avoid having any internal state (e.g. we could easily implement NextToken to take in the stream it should read from and avoid any member variables at all in Lexer), by pushing state out into the input and output, instead of having it in our program. In this case that means instead of reading, storing and then processing a program, our lexer can simply process a few characters and emit a token without knowing anything about the surrounding code or wider context. This makes it much easier to test, and (potentially) easier to do clever things with, like prove mathematically that it is correct (!)

Next time, the parser.