Anatomy of an interpreter: the Parser

Posts in this series: Lexer, Parser, Evaluator

Subs has reached version 1.3.4, which means that it can successfully run all the tests from chapter 1 of SICP. This is very exciting.

Last time I explained a bit about the Lexer, which takes in a stream of characters and emits a stream of tokens: individual elements of code such as a bracket, a keyword or a symbol.

Generally, parsers emit some kind of tree structure – they understand the raw tokens as a hierarchical structure which (conceptually, at least) will be executed from the bottom up, with each branch-point in the tree being an operation of some kind.

Our parser takes in a stream of tokens, and emits a stream of parsed trees.

Parsing Scheme is very easy, because (except for a couple of exceptions I haven’t implemented yet) there is essentially one rule: start with an open bracket, see a list of things, and then find a close bracket. Of course, one of the “things” you see may itself be another bracketted list, so after parsing you get a tree structure of nested lists.

The parser in Subs looks like this:

class Parser
{
public:
    Parser( ILexer& lexer );
    std::auto_ptr<Value> NextValue();
private:
    ILexer& lexer_;
};

We supply a Lexer in the constructor, which we know will provide us with tokens when we need them via its NextToken() method. The Parser’s NextValue() method returns a pointer to a Value, which is the base class for all the “things” in the Subs interpreter.

There are lots of types of things that inherit from the Value class, but the “parse tree” (the output of the parser) will only consist of a very small subset of them:

  • CombinationValue
  • DecimalValue
  • IntegerValue
  • StringValue
  • SymbolValue

The CombinationValue class forms the tree structure. Its declaration looks like this:

class CombinationValue : public Value, public std::vector<Value*>
{
    // ...
};

It is simply a list of other Values.

Note that it “owns” those Values in the sense that it deletes them when it is deleted. I have recently made the jump to make Subs depend on BOOST, so it’s on my TODO list to make containers like this use the BOOST smart containers to manage this job for me.

DecimalValue, IntegerValue and StringValue are relatively self-explanatory: they contain numbers and strings that were found as literals in the source code.

SymbolValue is essentially everything else – if the code that recognises the type of a token can’t recognise it as a bracket, a number or a string, we assume it is a symbol, and tuck it away in a SymbolValue to be understood later.

The core of the Parser looks like this (with some error-checking removed):

std::auto_ptr<Value> next_value( ILexer& lexer, Token token )
{
    if( token.Name() == "(" )
    {
        auto_ptr<CombinationValue> ret( new CombinationValue );
        while( true )
        {
            token = lexer.NextToken();
            if( token.Name() == ")" )
            {
                break;
            }
            // Recursive call
            ret->push_back( next_value( lexer, token ).release() );
        }
        return auto_ptr<Value>( ret.release() );
    }
    else
    {
        return ValueFactory::CreateValue( token );
    }
}

(Full code here: Parser.cpp) It’s a simple recursive function that creates a CombinationValue whenever it finds a bracket, and otherwise uses a ValueFactory to create an individual value.

Side note: the wisdom of using recursion could certainly be questioned, since it limits the depth of bracketting we can handle to the size of the C++ stack, but the only other way to get the same result would be to keep our own manual stack of unfinished combinations, and it just seems perverse to re-implement language features like that. What might well be more interesting would be to consider whether we can actually evaluate parts of the tree as we go, without parsing it all at once. This might make the whole setup scale rather better, but would most likely be quite complex. The implementation presented here will work fine for almost any imaginable program – remember we would need not just code whose execution is deeply nested, but whose expression in code had thousands of levels of nesting before the parser would fail.

The ValueFactory uses some basic rules such as “starts and ends with a quote” or “consists of only numbers and a decimal point” to recognise what type of Value to create, and if no rules match it defaults to a SymbolValue.

When we have completed a bracketted expression, we return a complete tree of numbers, strings and symbols, and it is ready to be evaluated, which you can think of as simply expanding the tree we already have into the full expression of itself, and then reducing it back down again to an answer.

Next time, the Evaluator and the famous eval-apply loop.

Scalable Graph Coverage

If you’re interested in dealing with large directed graphs of dependent objects and want some tips on how to process them in a way that scales in terms of memory usage, you may be interested in the article I wrote for Overload, which is a journal of the ACCU (an organisation promoting “professionalism in programming”).

The article is Scalable Graph Coverage, or, in my original title, “Comparing scalable algorithms for walking all nodes of a dependency graph”.

It contains lots of code, written in C++, using BOOST Graph Library. The code demonstrates some of the algorithms that are available for choosing batches of nodes to be processed together to reduce the number of nodes that are loaded several times, and without running out of memory.

NNDB 0.1

I’ve managed to get NNDB, my C++ data storage library which is almost, but not entirely unlike SQL, into a fit state for a release.

You can create tables, set indices on columns, insert data, retrieve data using something like a SELECT, filter it using something like WHERE (which uses indices where available), and order it using something like an ORDER BY.

So far it has been a fantastic way to get my hands dirty with some Template metaprogramming, and some C++ as it should be*, but the reason why I started this was to help me think about how databases work, so I’m really looking forward to getting into how to implement JOINs. At the moment I have only very vague ideas.

NNDB is based heavily on the STL (part of C++’s standard library), BOOST (a playground for things that might one day be in C++’s standard library, and hang-out for some of the cleverest people alive), and Loki (the continuation of Andrei Alexandrescu’s Template metaprogramming (but used for good, not evil?) library written for and explained in Modern C++ Design. This book ranks in the top 5 most exciting books I have read). I continue to be more impressed by all three the more I learn.

I have even been having discussions with the Loki devs about some code I needed for NNDB that I think might be helpful for other people using Loki. It’s called ForEachType and it allows you to loop (at runtime) through all the types in a Typelist and do something for each one.

The project is already working in terms of helping me think about databases. For example, I really hadn’t thought before about how expensive ORDER BY is. To implement it I needed to create a temporary std::map covering the entire result set – in a real database this obviously requires reading every single row before we can even begin to return any results. The way to avoid this is to have an index. Which reminds me: the next thing I need to do is make ORDER BY able to use indices (at the moment it’s only WHEREs that take advantage of them).

So next on my list are:

  • ORDER By uses indices
  • Non-unique indices (presumably implemented with a std::multimap)
  • Joins

I am still very excited so you may see more releases over the next few months.

[* NNDB so far contains zero (0) calls to new and zero (0) calls to delete. Obviously the code it uses (e.g. std::vector) calls them, but that code manages all the memory for me, and most of it uses custom allocators to make it very fast. I have no idea how fast NNDB is, but maybe it could be quite fast. I am pretty confident it doesn’t contain any memory errors. Famous last words…]

NNDB’s Not a Database

My latest project is called NNDB.

I’ve worked with databases for quite a long time now, and for a while I’ve been thinking about how they work under the hood. I know very little about it, but I thought I could learn a bit by trying to implement something similar myself.

I’m interested in how queries work against joined tables, how to implement indices and so on.

I’ve also been feeling that I want to do some C++ as an open source project. I do it all day at work, and for some problems it feels like the right tool for the job.

NNDB is sort-of like an in-memory database, but it works with C++ types for its columns, instead of a fixed set like varchar, int etc. You can put your own value-typed classes in the columns, and all values are type-checked at compile time.

It’s always struck me as strange that with a traditional code+SQL setup you have to keep your SQL in sync with your code manually. Of course, there are lots of trendy Object-Relational-Mapping thingies that solve that problem, but I felt it could be approached from another direction: instead of generating code to match your data, or generating SQL to match your code, why not specify your data structure in code?

In NNDB you define a table something like this:

typedef nndb::Values< unsigned long, std::string, std::string, MyDate >
    PersonValues;

class PersonTable : public nndb::Table
{
public:
    enum Columns
    {
        id,
        first_name,
        last_name,
        date_of_birth
    };
};

Actually, defining your own class is unnecessary, but it’s nice to have an enum to name your columns, and making a class gives you a nice place to put it.

To insert a row you do something like this:

PersonTable person_table;
person_table.Insert( PersonValues( 0,
    "Andy", "Balaam", MyDate( 12000000 ) ) );

You can do simple queries with WHERE and ORDER BY clauses, and I’m working on indexes.

After that will come JOINs, and anything else that takes my fancy.

I don’t anticipate NNDB being useful to anyone – it’s really for me to understand why things are as they are in the world of databases. However, you never know – it may turn out to be a fast and convenient way to store data in the C++ world. I think some of the applications that use databases don’t really need the kind of concurrent multi-user network-accessible features they have, but really just want to search, join and store reliably, and NNDB might one day grow into something that can find a niche.

To explore more, check out the complete example.