Tail call optimisation in C++

This series: Lightning talk, Explanation, Performance, Generalisation.

Some programming languages make recursive programming more practical by providing the tail call optimisation. For a tiny talk at the recent ACCU conference I looked at how we might do something similar in C++.

Update: Source code for this article is available.

Multiplying by two

Here’s a toy problem we will use as our example.

Imagine for a second that you want to write a function that multiplies a number by two. OK, we can do that:

long times_two_hardware( long value )
{
    return value * 2;
}

Now imagine that you don’t have the “*” operator. We’re going have to use “+”:

long times_two_loop( long value )
{
    long ret = 0;
    for ( long i = 0; i < value; ++i )
    {
        ret += 2;
    }
    return ret;
}

(Obviously, this is just a silly example designed to be easy to follow.)

Now imagine that you read somewhere that state was bad, and you could always replace a loop with recursion. Then you might get something like this:

long times_two_naive_recursive( long value )
{
    if ( value == 0 )
    {
        return 0;
    }
    else
    {
        return 2 + times_two_naive_recursive( value - 1 );
    }
}

This is fine, but what happens when you run it for a large input?

$ ulimit -S -s 16
$ ./times_two naive_recursive 100000
Segmentation fault

Note that I set my stack size to be very small (16K) to make the point - actually, this will run successfully for very large arguments, but it will eat all your memory and take a long time to finish.

Why does this fail? Because every time you call a function, the state of the current function is saved, and new information is pushed onto the stack about the new function. When you call a function from within a function multiple times, the stack grows and grows, remembering the state all the way down to the place where you started.

So is programming like this useless in practice?

Tail call optimisation

No, because in several programming languages, the compiler or interpreter performs the "tail call optimisation".

When you call a function from within some other code, you normally need the state of the current code to be preserved. There is a special case where you don't need it, though, and this is called a tail call. A tail call is just the situation where you call a function and immediately return its return value as your return value. In this case, we don't need any of the state of the current code any more - we are just about to throw it away and return.

The tail call optimisation throws away this unneeded state before calling the new function, instead of after.

[In practice, in compiled code, this involves popping all the local variables off the stack, pushing the new function parameters on, and jmping to the new function, instead of calling it. This means that when we hit the ret at the end of the new function, we return to the original caller, instead of the location of the tail call.]

Many recursive functions can be re-cast as tail-call versions (sometimes called iterative versions). The one we're looking at is one of those. Here is the tail-call version:

long times_two_recursive_impl( long total, long counter )
{
    if ( counter == 0 )
    {
        return total;
    }
    else
    {
        return times_two_recursive_impl(
            total + 2, counter - 1 );
    }
}

long times_two_recursive( long value )
{
    return times_two_recursive_impl( 0, value );
}

It consists of an outer function times_two_recursive which just hands off control to the inner function times_two_recursive_impl. The inner function uses a counter variable and calls itself recursively, reducing that counter by one each time, until it reaches zero, when it returns the total, which is increased by 2 each time.

The key feature of this implementation is that the recursive function times_two_recursive_impl uses a tail call to do the recursion: the value of calling itself is immediately returned, without reference to anything else in the function, even temporary variables.

So, let's see what happens when we compile and run this:

$ ulimit -S -s 16
$ ./times_two recursive 100000
Segmentation fault

Did I mention that C++ doesn't do the tail call optimisation?*

* Tail call optimisation isn't in the C++ standard. Apparently, some compilers, including MS Visual Studio and GCC, do provide tail call optimisation under certain circumstances (when optimisations are enabled, obviously). It is difficult to implement for all cases, especially in C++ since destruction of objects can cause code to be executed where you might not have expected it, and it doesn't appear to be easy to tell when a compiler will or will not do it without examining the generated assembly language. Languages which have this feature by design, like Scheme (and D?) can do it more predictably.

So how would we write code that is tail call optimised in C++? Possibly of more interest to me personally: if we were generating C++ as the output format for some other language, what code might we generate for tail call optimised functions?

Tail call optimised C++

Let's imagine for a second we have some classes, which I'll define later. FnPlusArgs holds a function pointer, and some arguments to be passed to it. Answer holds on to one of 2 things: either a FnPlusArgs to call later, or an actual answer (return value) for our function.

Now we can write our function like this:

Answer times_two_tail_call_impl( long acc, long i )
{
    if ( i == 0 )
    {
        return Answer( true, null_fn_plus_args, acc );
    }
    else
    {
        return Answer(
            false,
            FnPlusArgs(
                times_two_tail_call_impl,
                acc + 2,
                i - 1
            ),
            0
        );
    }
}

long times_two_tail_call( long n )
{
    return trampoline( Answer(
        false,
        FnPlusArgs( times_two_tail_call_impl, 0, n ),
        0 ) );
}

This has the same structure as times_two_recursive, if a little more verbose. The important point to note, though, is that times_two_tail_call_impl doesn't call itself recursively. Instead, it returns an Answer object, which is a delegate saying that we have more work to do: calling the provided function with the supplied arguments.

The Trampoline

All we need now is some infrastructure to call this function, and deal with its return value, calling functions repeatedly until we have an answer. This function is called a "trampoline", and you can sort of see why:

long trampoline( Answer answer )
{
    while ( !answer.finished_ )
    {
        answer = answer.tail_call_();
    }
    return answer.value_;
}

While the answer we get back tells us we have more work to do, we call functions, and when we're finished we return the answer.

Now all we need to get this working is the definition of Answer and FnPlusArgs:

struct Answer;
typedef Answer (*impl_fn_type)( long, long );

struct FnPlusArgs
{
    impl_fn_type fn_;
    long arg1_;
    long arg2_;

    FnPlusArgs(
        impl_fn_type fn,
        long arg1,
        long arg2
    );

    Answer operator()();
};

impl_fn_type null_fn = NULL;
FnPlusArgs null_fn_plus_args( null_fn, 0, 0 );

struct Answer
{
    bool finished_;
    FnPlusArgs tail_call_;
    long value_;

    Answer( bool finished, FnPlusArgs tail_call, long value );
};


FnPlusArgs::FnPlusArgs(
    impl_fn_type fn,
    long arg1,
    long arg2
)
: fn_( fn )
, arg1_( arg1 )
, arg2_( arg2 )
{
}

Answer FnPlusArgs::operator()()
{
    return fn_( arg1_, arg2_ );
}


Answer::Answer( bool finished, FnPlusArgs tail_call, long value )
: finished_( finished )
, tail_call_( tail_call )
, value_( value )
{
}

The only notable thing about this is that we use operator() on FnPlusArgs to call the function it holds.

Results

Now, when we run this code, we get what we wanted:

$ ulimit -S -s 16
$ ./times_two tail_call 100000
200000

(I.e. it doesn't crash.)

So, it turns out that the tail call optimisation is just a while loop. Sort of.

Lighting talk – Tail call optimisation in C++

This series: Lightning talk, Explanation, Performance, Generalisation.

Update: watch the video

Here’s the lightning talk I gave at the ACCU 2012 Conference:

Tail Call Optimisation in C++

It’s about how you would generate C++ code that represents a recursive function, without running out of stack space.

Scheme: feel the cool

Update: watch the video

In this series on Scheme: Intro, Basics, Closures.

Here’s a presentation I did recently, introducing the Scheme programming language (a dialect of Lisp).

Scheme: feel the cool

Metaprogramming is very powerful, but is usually so hard that we shy away from it. In Scheme, most metaprogramming is just “normal” programming – you can do it in the same language, and in a macro language that is very similar the normal language.

Anatomy of an interpreter: the Evaluator

Posts in this series: Lexer, Parser, Evaluator

I’m still really enjoying writing my Scheme interpreter Subs, which can now succesfully run all the example code from SICP up to section 2.3.4. I’ve made the changes I mentioned I would in the Lexer article, so now the Lexer returns Tokens that contain information about their basic types, and I’ve gone through a significant refactoring to replace one of the several massive switch statements with a virtual function call (Martin Fowler would be proud).

Last time I explained how the Parser takes the stream of tokens coming from the Lexer and returns a hierarchical tree of Values, each of which represents an operation or thing in the program.

The Evaluator takes in a tree of Values, “evaluates” it, and returns another Value object, which is the answer. The Evaluator class is by far the most complex part of Subs, so in this post we’ll start with an overview of how it works. Future posts will break down the different parts in more detail.

The most interesting parts of the Evaluator class interface look like this:

class Evaluator
{
public:
    std::auto_ptr<Value> EvalInContext( const Value* value,
        boost::shared_ptr<Environment>& environment );
};

The EvalInContext method takes in a Value to evaluate, and an “environment” *in which to evaluate it. Note that the in the real code it takes a couple more arguments, including a mysterious and annoying boolean called is_tail_call which will be explained later.

* More on environments later. All you need to know for now is that they provide a way of keeping hold of all the things we currently know about, identified by name.

A very simplified version of EvalInContext would look like this:

std::auto_ptr<Value> Evaluator::EvalInContext( const Value* value,
    boost::shared_ptr<Environment>& environment )
{
    if( is_symbol( value ) )
    {
        return eval_symbol( value, environment );
    }

    if( !is_combination( value ) )
    {
        return auto_ptr<Value>( value->Clone() );
    }

    const CombinationValue* combo = to_combination( value );
    CombinationValue::const_iterator it = combo->begin();

    auto_ptr<Value> evaldoptr = EvalInContext( *it, environment );

    if( special_symbol( evaldoptr ) )
    {
        return process_special_symbol( evaldoptr, combo );
    }
    else
    {
        ++it;

        CombinationValue argvalues;
        for( ; it != combo->end(); ++it )
        {
            argvalues.push_back( EvalInContext( *it, environment ).release() );
        }

        return run_procedure( evaldoptr.get(), &argvalues, *this, environment );
    }
}

If the Value to be evaluated is just a symbol, we call eval_symbol which basically looks up the symbol’s name in the environment and returns the value it finds.

If the Value is not a combination (i.e. the root of a tree of other values) it must be a basic type such as a string or an integer. In this case we simple copy the Value and return it.

Otherwise, it’s a combination. To evaluate a combination, we follow the “eval-apply” pattern. The principle is to evaluate all the Values in the combination separately, and then “apply” (run) the first value (the “operator”) as a procedure, using the other values as arguments. The first value must evaluate to something that is recognisable as a procedure, or this doesn’t make sense and we will throw an error.

In practice it’s a tiny bit more complicated. We evaluate the first Value in the combination (by calling EvalInContext recursively), then we check whether it’s a special symbol such as if or let and if so, deal with it separately. Otherwise, we evaluate all the other Values (calling EvalInContext recursively again) and put them into a new CombinationValue, and pass the operator and the arguments to run_procedure, which looks something like this:

std::auto_ptr<Value> run_procedure( const Value* operator,
    const CombinationValue* args, Evaluator& ev,
    boost::shared_ptr<Environment>& environment )
{
    if( is_builtin_procedure( operator ) )
    {
        return handle_builtin_procedure( operator, args, environment );
    }
    else
    {
        std::auto_ptr<Value> ret;

        const CompoundProcedureValue* proc = to_compound_procedure( operator );

        boost::shared_ptr<Environment> new_env =
            proc->ExtendEnvironmentWithArgs( args );

        for( CombinationValue::const_iterator it = proc->GetBody()->begin();
            it != proc->GetBody()->end(); ++it )
        {
            ret = ev.EvalInContext( *it, new_env );
        }

        return ret;
    }
}

Running a procedure means either doing something built-in (for example adding up two numbers and returning the result) or evaluating some other code, which comes from the definition of the procedure being run. First we call ExtendEnvironmentWithArgs to create a new Environment, which contains the argument Values that were supplied, and then we loop through all the sections of the body of the procedure, evaluating each one. Note that we throw away the returned Values for all sections except the last one (this is how Scheme works).

Once we’ve evaluated and applied our procedure, which of course potentially includes numerous recursive calls to EvalInContext, we end up with a Value that is returned, and we’re done.

Simple eh?

But now I must make a confession: almost everything I have written above is lie. Why would I lie to you? Because I missed out something wonderful and strange called “tail-call optimisation”. I’ll explain next time.