Archive for the ‘C++’ Category

Anatomy of an interpreter: the Lexer

Wednesday, September 1st, 2010

I have been having a lot of fun recently writing my Scheme interpreter Subs. I have never implemented a full programming language before, so I am learning fast (mostly through mistakes) and wanted to write down some of the stuff I am discovering.

Note: if you want to learn more about what Scheme is I recommend Scheme (Wikipedia) and the book SICP, which is the inspiration for all this.

I am writing everything from scratch, just because it’s fun (certainly not because I think it is in any way better to do it that way…). As we will see, that gives me opportunities to do things in different ways from the normal way such things are done. So far, every time I find out I’ve deviated from the normal way I’ve quickly discovered why I am wrong, and had to learn the true path.

Text-based programming languages, whether interpreted or compiled, need a lexer. A lexer takes in characters and spits out “tokens”, which are groups of characters that represent a single thing, such as a bracket, a variable name or a number. (Those tokens are then passed on to the parser, which I will cover in a different post.)

Scheme (and other Lisp variants) are fairly easy to lex because they don’t have much syntax – you just need to be able to understand round brackets, numbers and strings, and a couple of special cases that I won’t go into because I haven’t actually implemented them yet. (Mind you, I haven’t implemented strings yet either…)

When I started Subs I took my normal approach of doing whatever I wanted without any research or even much thought, and wrote something that I called a lexer, but which was really something else. It took in a stream of characters, read it one “word” at a time (using whitespace as separators), broke up the word if it contained bracket characters, and emitted a tree structure with each branch representing a bracketted list. It just seemed sensible, while I was watching the brackets flow by, to understand them and create a tree structure.

However, for a number of reasons, that approach turned out to be wrong.

First, reading a “word” at a time made things much harder than simply stepping through each character. It made my initial implementation slightly faster, but as soon as I realised I cared about white space (e.g. keeping track of what line we are on) it had to go. When it went it also turned out to be easier to deal with unusual code layout – for example “a(b” should be lexed as 3 tokens, but would be handed to us as a single word.

Second, and more importantly, creating a tree structure at this point was a waste of time. Creating tree structures is normally the job of a parser, and mixing these responsibilities gave me some pointless inefficiency: the lexer emitted a tree of tokens, which the parser then translated into another tree (of fully-understood code objects). It turned out that walking the tree of tokens and copying that structure in the parser was at least as hard as just taking in a flat stream of tokens and constructing the tree just once.

So, I re-wrote Lexer into something that is starting to become worthy of the name. The most interesting parts of its signature look like this:

class Lexer
{
public:
    Lexer( std::istream& instream );
    Token NextToken();
};

It takes in a reference to a stream, which will provide the characters, and when NextToken is called, it reads enough characters to determine what the next token will be, and returns it.

Side note: Subs is written using Test-Driven Development. I re-implemented the Lexer and Parser from scratch (naming the new classes NewLexer and NewParser until they were ready), modified the code that used them to use the new interfaces, ran the tests, and immediately knew that the new classes worked as well as the old ones. That level of confidence is incredibly freeing. I can’t imagine how I would ever have convinced myself the new classes were ready had I not had that safety net of 100s of tests that ensure the interpreter correctly responds to each type of input.

Currently the Token class it returns is pretty much just a string, with some information attached about where that string was in the original text. In researching this article I realised that most lexers attach more information than that to their tokens – they understand its basic type such as integer, decimal, string, bracket or symbol. At the moment in Subs, this work is done in the parser (so for the lexer each token is just a string), but I can see why it is helpful to do this work in the lexer, because for most types we have the information anyway. For example, in order to recognise that ‘”foo (bar”‘ (where the double-quotes are in the real code) is a single token we must understand that it is a string. Since we know it at this stage, we might as well record it in the Token object so we don’t have to work it out again later. When Subs supports strings, I will probably add a “type” field to the token and move this work from the parser.

On a more general programming point, following on from comments I made in a previous post, it is worth noting that the structure of the lexer (and, as we will see later, the parser) uses a technique called “streams”. What this means is that we write functions like NextToken that process a small part of the total problem, and return their answer. If we chain together functions like that (for example the parser’s equivalent function calls NextToken whenever it needs a new token) we can process arbitrarily large input without using lots of memory or slowing down. The lexer is able to process any number of tokens using a very small amount of memory, and will only fall over if it encounters a single token that is ridiculously large.

The stream style can be very useful for some problems, not only because it can be efficient with memory, but also because it can help break problems into neat, small pieces that are easier to implement correctly. It is also useful for writing code in a functional style, because it allows us to avoid having any internal state (e.g. we could easily implement NextToken to take in the stream it should read from and avoid any member variables at all in Lexer), by pushing state out into the input and output, instead of having it in our program. In this case that means instead of reading, storing and then processing a program, our lexer can simply process a few characters and emit a token without knowing anything about the surrounding code or wider context. This makes it much easier to test, and (potentially) easier to do clever things with, like prove mathematically that it is correct (!)

Next time, the parser.

Subs Scheme Lisp Interpreter

Wednesday, August 18th, 2010

Why would you write a Lisp interpreter?

I find that question difficult to answer, but the joy of open source is that I don’t have to answer it.

Subs is a Scheme interpreter written in C++, and growing out of the excitement I have felt while reading Structure and Interpretation of Computer Programs.

Subs is very incomplete, and will probably never be otherwise, but it is exciting how quickly you can write a functional Lisp interpreter. I plan to go through SICP section by section and ensure all the examples work in Subs. So far I am failing on the last example in section 1.1.1 (because I don’t support new-lines inside statements yet!) but in reality most of the work is done to support quite a lot of stuff (NOT including mutable variables…yuck).

One possible explanation, if such a thing were necessary, is that one day I want to write a new programming language that has all the simplicity and metaprogramming capability of Lisp, with the native performance and deployability of C++, and the syntactic elegance of Python. I’ve got a lot to learn first.

Scalable Graph Coverage

Monday, June 21st, 2010

If you’re interested in dealing with large directed graphs of dependent objects and want some tips on how to process them in a way that scales in terms of memory usage, you may be interested in the article I wrote for Overload, which is a journal of the ACCU (an organisation promoting “professionalism in programming”).

The article is Scalable Graph Coverage, or, in my original title, “Comparing scalable algorithms for walking all nodes of a dependency graph”.

It contains lots of code, written in C++, using BOOST Graph Library. The code demonstrates some of the algorithms that are available for choosing batches of nodes to be processed together to reduce the number of nodes that are loaded several times, and without running out of memory.

Don’t design for performance until it’s too late

Friday, November 20th, 2009

There is a piece of ancient wisdom which states:

Premature optimisation is the root of all evil

This ancient wisdom is, like all ancient wisdom, correct.

However.

It appears to have been reinterpreted as essentially meaning:

Don’t design for performance until it’s too late

which is clearly, and very importantly, very wrong.

Performance is a feature

Before I begin I want us all to agree that performance is a feature.

I work on a real-life "enterprise" application. Its features are entirely driven by the need for immediate cash, not by developers following pipe dreams. And yet, for the last 6-12 months the majority of my time has been spent trying to retrofit performance into this application. Believe me, this is not because we have users who are obsessive about wasting valuable seconds – it’s because our performance sucks so hard it’s deeply embarrassing.

What is your favourite program? How well does it perform? What is your least favourite? Why?

For me, and many other people, their answers to those questions demonstrate the importance of performance. Firefox was launched to improve the performance of Mozilla. People love git because of how fast it is. Lotus Notes is hated so much partly because of its performance. My main complaints about programs I use involve performance (e.g. Thunderbird is too slow for IMAP email.)

A fast response to the user is one of those crucial inches on the journey to software that makes people happy. Making people happy gives you the kind of scary fanboyism that surrounds git. Wouldn’t you like that for your product?

What is optimisation?

When my hero said that premature optimisation was the root of all evil, he was talking in the days when you had to hand-optimise your C in assembly language. Or, more likely in his case, you had to hand-optimise your assembly language into faster assembly language. Optimisation like that very often obfuscates your code.

These days, 99% of the time, your compiler does all of this work for you, so you can have relatively comprehensible code in whatever trendy language you like, and still have the fastest possible local implementation of that in machine code.

Meanwhile, Knuth knew that 99% of your code is simply not performance-critical – it only runs a few times, or it’s just so much faster than some other bit that it doesn’t matter. The lesson we all learn eventually is that the slow bit is never quite what you thought, and you have to measure to find out where to concentrate your effort.

So, if optimisation is obfuscation, and 99% of your code isn’t the bit you need to make faster, it becomes clear that premature optimisation is the root of much evil.

But optimisation is not designing for performance.

Design for performance

Fundamentally, to get good performance, you are going to need to measure the time spent in various parts of your code (I suggest Very Sleepy if you’re on Windows) and make the slow bits faster (or happen less often). However, there are still some principles you can follow that will mean you need to spend less time doing this*.

*Which is a pity, because I really love doing it.

If you don’t design for performance you are almost certainly going to need to restructure large parts of your program later, which is very difficult and time-consuming.

There are two aspects to designing for performance: writing good local code, and creating good global structure.

Write good local code

Before you write an algorithm, think for a few minutes about how to make it work efficiently. e.g. if you’re writing C++, consider whether a deque or a list would be better than a vector for how you’re going to use it.

Think about what is idiomatic for your language and why. Think about what the computer really has to do to produce the results you are asking for. Are there going to be a lot of objects about? Maybe you can avoid copying them too many times. Are you creating and deleting a lot of objects? Can you reuse some instead? (Exercise caution with that one, though – if you start obfuscating you come into conflict with the ancient wisdom.)

Often, if you think through what you are doing, and the most efficient way to do it, you will end up with a faster and more memory-efficient algorithm, that expresses your intention better than if you’d written the first thing that came into your head. There is no downside to that.

Try to minimise the number of times you need to ask the operating system for a chunk of memory: this is surprisingly slow. E.g. in C++, prefer creating by-value data members instead of pointers to objects allocated with their own call to new.

By the way, don't worry if this sounds intimidating. The way to learn this stuff is to measure what you have done and then work out why it is slow. Next time you'll jump straight to the fast solution without the detour to the slow one.

Of course, none of this will matter if you don't have good global structure.

Create good global structure

The hardest and most important work you need to do to have good performance is to have good structure in the ways the different parts of your program interact.

This means thinking about how classes and components communicate with and control each other.

It may be helpful to use a streaming style of communication - can you send little chunks of information to be processed one by one instead of a huge great blob?

Try to make sure your components to use a common infrastructure: if different parts use different string classes you are going to spend most of your time copying from one to the other.

The hardest and deepest mystery in getting good performance (and in programming generally) is choosing the right fundamental data structures. I'll never forget the lesson I learnt** when a friend of mine had a conversation with me about a toy project I was doing (that was particularly focussed on trying to be fast) and then went away and produced code that was orders of magnitude faster, simply because he had chosen the right data structure.

**The lesson I learnt was that I am not as good as I think I am.

To be honest this section is a little shorter than I'd like because I know I don't have a lot of answers about how to do this well. I do know, though, that if you don't think about it now you will have the pain of restructuring your program later, when it's full of bug fixes that are going to get rebroken by the restructuring.

Of course, if you do think about it now you're still pretty likely to need to change it later...

Ancient wisdom

Ancient wisdom is usually right, but misinterpreting it and using it as a license to write bad code is a bad idea.

Carry on.

NNDB 0.1

Friday, October 23rd, 2009

I’ve managed to get NNDB, my C++ data storage library which is almost, but not entirely unlike SQL, into a fit state for a release.

You can create tables, set indices on columns, insert data, retrieve data using something like a SELECT, filter it using something like WHERE (which uses indices where available), and order it using something like an ORDER BY.

So far it has been a fantastic way to get my hands dirty with some Template metaprogramming, and some C++ as it should be*, but the reason why I started this was to help me think about how databases work, so I’m really looking forward to getting into how to implement JOINs. At the moment I have only very vague ideas.

NNDB is based heavily on the STL (part of C++’s standard library), BOOST (a playground for things that might one day be in C++’s standard library, and hang-out for some of the cleverest people alive), and Loki (the continuation of Andrei Alexandrescu’s Template metaprogramming (but used for good, not evil?) library written for and explained in Modern C++ Design. This book ranks in the top 5 most exciting books I have read). I continue to be more impressed by all three the more I learn.

I have even been having discussions with the Loki devs about some code I needed for NNDB that I think might be helpful for other people using Loki. It’s called TypelistForEach and it allows you to loop (at runtime) through all the types in a Typelist and do something for each one.

The project is already working in terms of helping me think about databases. For example, I really hadn’t thought before about how expensive ORDER BY is. To implement it I needed to create a temporary std::map covering the entire result set – in a real database this obviously requires reading every single row before we can even begin to return any results. The way to avoid this is to have an index. Which reminds me: the next thing I need to do is make ORDER BY able to use indices (at the moment it’s only WHEREs that take advantage of them).

So next on my list are:

  • ORDER By uses indices
  • Non-unique indices (presumably implemented with a std::multimap)
  • Joins

I am still very excited so you may see more releases over the next few months.

[* NNDB so far contains zero (0) calls to new and zero (0) calls to delete. Obviously the code it uses (e.g. std::vector) calls them, but that code manages all the memory for me, and most of it uses custom allocators to make it very fast. I have no idea how fast NNDB is, but maybe it could be quite fast. I am pretty confident it doesn't contain any memory errors. Famous last words...]

NNDB’s Not a Database

Saturday, October 10th, 2009

My latest project is called NNDB.

I’ve worked with databases for quite a long time now, and for a while I’ve been thinking about how they work under the hood. I know very little about it, but I thought I could learn a bit by trying to implement something similar myself.

I’m interested in how queries work against joined tables, how to implement indices and so on.

I’ve also been feeling that I want to do some C++ as an open source project. I do it all day at work, and for some problems it feels like the right tool for the job.

NNDB is sort-of like an in-memory database, but it works with C++ types for its columns, instead of a fixed set like varchar, int etc. You can put your own value-typed classes in the columns, and all values are type-checked at compile time.

It’s always struck me as strange that with a traditional code+SQL setup you have to keep your SQL in sync with your code manually. Of course, there are lots of trendy Object-Relational-Mapping thingies that solve that problem, but I felt it could be approached from another direction: instead of generating code to match your data, or generating SQL to match your code, why not specify your data structure in code?

In NNDB you define a table something like this:

typedef nndb::Values< unsigned long, std::string, std::string, MyDate >
    PersonValues;

class PersonTable : public nndb::Table
{
public:
    enum Columns
    {
        id,
        first_name,
        last_name,
        date_of_birth
    };
};

Actually, defining your own class is unnecessary, but it’s nice to have an enum to name your columns, and making a class gives you a nice place to put it.

To insert a row you do something like this:

PersonTable person_table;
person_table.Insert( PersonValues( 0,
    "Andy", "Balaam", MyDate( 12000000 ) ) );

You can do simple queries with WHERE and ORDER BY clauses, and I’m working on indexes.

After that will come JOINs, and anything else that takes my fancy.

I don’t anticipate NNDB being useful to anyone – it’s really for me to understand why things are as they are in the world of databases. However, you never know – it may turn out to be a fast and convenient way to store data in the C++ world. I think some of the applications that use databases don’t really need the kind of concurrent multi-user network-accessible features they have, but really just want to search, join and store reliably, and NNDB might one day grow into something that can find a niche.

To explore more, check out the complete example.

Firefox keyword search for finding C++ keywords

Wednesday, September 23rd, 2009

I often want to search the SGI C++ reference for a keyword. The best way I have found to jump straight to the page I want is to use Google’s “I’m Feeling Lucky” search limited to searching within sgi.com.

You can create a Firefox keyword search to allow you to do this quickly from the location bar. Now I just type Ctrl-L then e.g. “c vector” to jump straight to the page about std::vector.

To do this, make a bookmark (to anything) and then right-click it in the Bookmarks menu and choose Properties. Edit it to look like this:

Firefox C++ search keyword bookmark

The Location field is set to http://www.google.com/search?as_q=%s&as_sitesearch=sgi.com&btnI=1, and the Keyword field is just c.

This post was largely for my own benefit to remember this next time I need to set it up, but I thought it might be useful to others as the information about how to set these searches up is not easy to find.

If you want to go to a google search results page, instead of jumping to the “I’m Feeling Lucky” result, remove the &btnI=1 part.

IGCC – a real-eval-print loop for C/C++

Monday, August 31st, 2009

When you first hear about the Read-Eval-Print Loop you might well think “So what?” as I did.

What’s so great about being able to type commands interactively?

But the thing is that it creeps up on you.

Everyone already knows programming is an interactive thing – we need constant feedback to validate our ideas. Programming on paper is incredibly frustrating because you have to plough on with assumptions that are probably wrong.

It’s just so comfortable to be able to try out ideas in an interactive interpreter.

Foosball

I mean, it’s really not much hassle to create a new directory, make a new file, edit the file to contain the code you want to try, remember the right command to compile it, then run the program and see the results, is it?

Well, no, it isn’t, but it’s enough of a hassle that sometimes you don’t bother and you try it out in the code you are really working on, and if your work is like mine that means a minimum of 5 minutes to compile and link, and there you are playing foosball again when you could be getting something done.

The REPL gives you a place to try throwaway things extremely quickly, and when you’re working with something beautiful like Python it’s easy to get addicted.

So my mind started to wander and it struck me that a pale imitation of the REPL could be made for us poor C++ programmers, and it would generally serve the purposes I’ve described above.

So IGCC was born. Its name means “Interactive GCC” and it’s a read-eval-print loop for C++ (and, for most cases it will work for C too).

It uses the real GCC underneath, so you know you are running the exact code you would be (and it’s somewhat easier to write than a custom C/C++ interpreter) and all it does is take away the hassle of creating a simple program and compiling it with GCC.

It wraps your code in a standard C program, includes some common dependencies, and compiles it, printing the results of running them immediately. Using it looks like this:

$ ./igcc
g++> int a = 5;
g++> a += 2;
g++> cout << a << endl;
7
g++> --a;
g++> cout << a << endl;
6
g++>

Apart from all the sugar that I’d love to add, the main missing features are some kind of equivalent of the Python dir command, and code completion.

It’s not rocket science, but it might make you a little bit more interactive in your C and C++ coding, which might save you valuable foosball time.

Enjoy, improve, etc. IGCC.

Foosball image taken from http://en.wikipedia.org/wiki/File:Baby_foot_artlibre_jnl.jpg

Analog literals

Wednesday, May 20th, 2009

I love this: C++ Multi-Dimensional Analog Literals.

I quote:

Have you ever felt that integer literals like “4″ don’t convey the true size of the value they denote? If so, use an analog integer literal instead:

unsigned int b = I---------I;

It goes on to explain that you can use 2- and 3-dimensional “analog literals”. Genius. Read the article. Try to read the code :)

Isn’t C++ … erm … powerful?

You’ll notice that there are 9 dashes used to denote 4. This is because the trick it is using uses operator--. I’m sure the original author did this in his/her sleep and thought it was too trivial to post (or posted it before?) but I thought: if we can use operator! instead, can’t we create analog literals that use the same number of symbols as the number we want?

The answer is yes, and it’s pretty simple:

notliterals.h:

class NotLiteral
{
public:
	NotLiteral( unsigned int ival )
	: val_( ival )
	{
	}

	NotLiteral operator!() const
	{
		return NotLiteral( val_ + 1 );
	}

	operator unsigned int() const
	{
		return val_;
	}

	unsigned int val_;
};

const NotLiteral NL( 0 );

test_notliterals.cpp:

#include "notliterals.h"
#include <cassert>

int main()
{
	assert( !!!!NL == 4 );
	assert( !!NL == 2 );

	assert( !!!!!!!!!!!!!!!NL == 15 );
}

With this simpler form, it’s almost believable that there might be some kind of useful application?

Extending this to 3 dimensions is left as an exercise for the reader. For 2 dimensions, if you just want the area (not the width and height), how about this?:

	assert( !!!
	        !!!
	        !!!NL == 9 );

Update: By the way, if you don’t like all the emphasis! of! using! exclamation! marks! you can do the same thing with the unary one’s complement operator, ~. Just replace “!” everywhere above with “~” and you’re done. Unfortunately, you can’t do the same with – or + because the parser recognises “–” as the decrement operator instead of seeing that it is clearly two calls to the unary negation operator.

Talk in code

Thursday, March 5th, 2009

Last week we had an extended discussion at work about how we were going to implement a specific feature.

This discussion hijacked our entire Scrum sprint planning meeting (yes, I know, we should have time-boxed it). It was painful, but the guy who was going to implement it (yes, I know, we should all collectively own our tasks) needed the discussion: otherwise it wasn’t going to get implemented. It certainly wasn’t going to get broken into short tasks until we knew how we were going to do it.

Anyway, asides aside, I came out of that discussion bruised but triumphant. We had a plan not only on how to write the code, but also how to test it. I believe the key thing that slowly led the discussion from a FUD-throwing contest into a constructive dialogue was the fact that we began to talk in code.

There are two facets to this principle:

1. Show me the code

As Linus once said, “Talk is cheap. Show me the code.“.

If you are at all disagreeing about how what you’re doing will work, open up the source files in question. Write example code – modify the existing methods or sketch a new one. Outline the classes you will need. Code is inherently unambiguous. White board diagrams and hand-waving are not.

Why wouldn’t you do this? Fear you might be wrong? Perhaps you should have phrased your argument a little less strongly?

Is this slower than drawing boxes on a whiteboard? Not if you include time spent resolving the confusion caused by the ambiguities inherent in line drawings.

Does UML make whiteboards less ambiguous? Yes, if all your developers can be bothered to learn it. But why learn a new language when you can communicate using the language you all speak all day – code?

2. Create a formal language to describe the problem

If your problem is sufficiently complex, you may want to codify the problem into a formal (text-based) language.

In last week’s discussion we were constantly bouncing back and forth between different corner cases until we started writing them down in a formal language.

The language I chose was an adaptation of a Domain-specific language I wrote to test a different part of our program. I would love to turn the cases we wrote down in that meeting into real tests that run after every build (in fact I am working on it) but their immediate value was to turn very confusing “what-if”s into concrete cases we could discuss.

Before we started using the formal language, the conversations went something like this:

Developer: “If we implement it like that, this bad thing will happen.”

Manager: “That’s fine – it’s a corner case that we can tidy up later if we need it.”

Developer: (Muttering) “He clearly doesn’t understand what I mean.”

Repeat

After we started using the formal language they went something like this:

Developer: “If we implement it like that, this bad thing will happen.”

Me: “Write it down, I tell you.”

Developer: (Typing) “See, this will happen!”

Manager: “That’s fine – it’s a corner case that we can tidy up later if we need it.”

Developer: (Muttering) “Flipping managers.”

Summary

The conversation progresses if all parties believe the others understand what they are saying. It is not disagreement that paralyses conversations – it is misunderstanding.

To avoid misunderstanding, talk in code – preferably a real programming language, but if that’s too verbose, a text-based code that is unambiguous and understood by everyone involved.

Note on whiteboards

You can’t copy and paste them, and you can’t (easily) keep what you did with them, and you can’t use them to communicate over long distances.

And don’t even try and suggest an electronic whiteboard. In a few years they may solve all of the above problems, but not now. They fail the “can I draw stuff?” test at the moment.

Even when electronic whiteboards solve those problems, they won’t solve the fact that lines and boxes are more ambiguous and less detailed than code in text form.

If you all know and like UML, that makes your diagrams less ambiguous, but still they often don’t allow enough detail: why bother?