Generalising tail call optimised C++

This series: Lightning talk, Explanation, Performance, Generalisation.

In previous posts I discussed the construction of some C++ that does the same job that the tail call optimisation does in some other languages. The example code given showed the case where we know that every function in the recursion will take two long integer parameters, and return a long as well.

In fact, it is perfectly possible to generalise this code to cover more complex cases. Fortunately, the trampoline function doesn’t need to know the arguments taken by the functions being called, only the return value. It looks like this:

template<typename RetT>
const RetT trampoline_templ(
    std::auto_ptr< IAnswer<RetT> > answer )
{
    while( !answer->finished() )
    {
        answer = answer->tail_call()();
    }
    return answer->value();
}

So we can call this with the type of the return value as our template parameter, and supply an object which satisfies the IAnswer<RetT> interface:

template<typename RetT>
class IAnswer
{
public:
    virtual const bool             finished()  const = 0;
    virtual const ICallable<RetT>& tail_call() const = 0;
    virtual const RetT             value()     const = 0;
};

where ICallable looks like this:

template<typename RetT>
class ICallable
{
public:
    typedef std::auto_ptr< IAnswer<RetT> > AnswerPtr;
    virtual AnswerPtr operator()() const = 0;
};

The concrete classes that implement these interfaces need to know the types (and number) of the arguments, but that’s ok because they only get instantiated by code that would otherwise (in the standard, non-tail-call recursion case) call the functions themselves. In the toy case we are using of repeatedly adding up numbers to multiply by two, the outer function looks like this:

const long times_two_tail_call_templ( const long n )
{
    typedef Answer2<long, long, long> AnswerType;

    return trampoline_templ(
        AnswerType::newFn(
            times_two_tail_call_impl, 0, n, 0 )
    );
}

and the inner one looks like this:

std::auto_ptr< IAnswer<long> > times_two_tail_call_impl(
    const long acc, const long i )
{
    typedef Answer2<long, long, long> AnswerType;

    if( i == 0 )
    {
        return return AnswerType::newAns( acc );
    }
    else
    {
        return AnswerType::newFn(
            times_two_tail_call_impl,
            acc + 2, i - 1, 0 );
    }
}

Both of the above use static convenience methods newFn and newAns that I have defined on Answer2 to create smart pointers to newly-allocated Answer2 objects. newAns creates an Answer2 that contains a final answer, and newFn creates an Answer2 specifying another function (with arguments) to call.

Answer2 looks like this:

template<typename RetT, typename Arg1T, typename Arg2T>
class Answer2 : public IAnswer<RetT>
{
private:
    typedef FnPlusArgs2<RetT, Arg1T, Arg2T> FnArgs;
    typedef std::auto_ptr< IAnswer<RetT> > AnswerPtr;

    const bool finished_;
    const FnArgs tail_call_;
    const RetT value_;

private:
    Answer2( const bool finished, const FnArgs tail_call, const RetT value )
    : finished_( finished )
    , tail_call_( tail_call )
    , value_( value )
    {
    }

    static AnswerPtr newPtr(
        const bool finished, const FnArgs tail_call, const RetT value )
    {
        return AnswerPtr( new Answer2<RetT, Arg1T, Arg2T>(
            finished, tail_call, value ) );
    }
public:
    static AnswerPtr newFn(
        const typename FnArgs::fn_type fn,
        const Arg1T arg1,
        const Arg2T arg2,
        const RetT zero_val )
    {
        return newPtr( false, FnArgs( fn, arg1, arg2 ), zero_val );
    }

    static AnswerPtr newAns( const RetT value )
    {
        return newPtr( true, FnArgs::null(), value );
    }

    virtual const bool    finished()  const { return finished_; };
    virtual const FnArgs& tail_call() const { return tail_call_; };
    virtual const RetT    value()     const { return value_; };
};

and uses FnPlusArgs2, which looks like this:

template<typename RetT, typename Arg1T, typename Arg2T>
class FnPlusArgs2 : public ICallable<RetT>
{
private:
    typedef typename ICallable<RetT>::AnswerPtr AnswerPtr;
public:
    typedef AnswerPtr (*fn_type)( const Arg1T, const Arg2T );
private:
    const fn_type fn_;
    const Arg1T arg1_;
    const Arg2T arg2_;

public:
    FnPlusArgs2( const fn_type fn, const Arg1T arg1, const Arg2T arg2 )
    : fn_( fn )
    , arg1_( arg1 )
    , arg2_( arg2 )
    {
    }

    virtual AnswerPtr operator()() const
    {
        return fn_( arg1_, arg2_ );
    }

    static FnPlusArgs2<RetT, Arg1T, Arg2T> null()
    {
        return FnPlusArgs2<RetT, Arg1T, Arg2T>( NULL, 0, 0 );
    }
};

I have continued with the 2 long arguments, and long return value example here, but with the above code it is possible to construct recursive code using more than one function, and the functions can have different numbers of arguments, and different argument types, so long as they all co-operate to produce an answer of the required type. The Source code for this article includes an example, in the file tail_call_templ_2fns.cpp, of two different functions that call each other recursively, and take different arguments, using the trampoline function and interfaces listed above, and Answer3 and FnPlusArgs3 class templates similar to Answer2 and FnPlusArgs2 shown above. Implementing the N-args case using variadic templates (C++11) or template meta-programming is left as an exercise for the reader.

This more realistic case where the number and types of arguments are not known beforehand forces us to use dynamic memory to store our AnswerN objects, and causes more pointer dereferences and virtual function calls, and these do hurt performance. In tests on my machine, this code ran approximately 10 times slower than the version using only stack memory. Perhaps we C++ programmers should comfort ourselves that many languages supporting tail-call optimisation require dynamic memory, virtual functions and pointer indirection to do absolutely everything.

JavaScript WTFs Videos

I recorded some videos of my JavaScript WTFs presentations:

You can get the JavaScript WTFs slides.

Update: all six episodes:

Find them all on PeerTube:

Closures in Scheme

Update: watch the video

In this series on Scheme: Intro, Basics, Closures.

Here’s a presentation I did recently, on Closures in the Scheme programming language. Closures are the way the environment in which a function was created hangs around with it as long as the function itself exists. They allow many flexible programming styles, and in this presentation I demonstrate how a simple object-oriented class structure can be implemented using closures.

Closures in Scheme

Performance of tail call optimised C++

This series: Lightning talk, Explanation, Performance, Generalisation.

After I wrote a version of tail-call optimised code in C++ I became interested in its performance relative to normal recursion. The tail call version can process arbitrarily large input, but how much do you pay for that in terms of performance?

Recap: there are 4 different versions of our function, called times_two. The first, “hardware”, uses the “*” operator to multiply by 2. The second, “loop” uses a for loop to add up lots of 2s until we get the answer. The third, “recursive”, uses a recursive function to add up 2s. The fourth, “tail_call” is a reimplementation of “recursive”, with a manual version of the tail call optimisation – see the original article for details.

Let’s look first at memory usage. Here is the stack memory usage over time as reported by Massif of calling the four functions for a relatively small input value of 100000:

Stack usage by the four functions.  recursive uses loads, and the others are constant.

The recursive function uses way more memory than the others (note the logarithmic scale), because it keeps all those stack frames, and the tail_call version takes much longer than the others (possibly because it puts more strain on Massif?), but keeps its memory usage low. Let’s look at how that affects its performance, for different sizes of input:

Time taken by each function for different input values.  When recursive starts using more memory than is available on the machine, its execution time grows rapidly

For these much larger input values, the recursive and tail_call functions take similar amounts of time, until the recursive version starts using all the physical memory on my computer. At this point, its execution times become huge, and erratic, whereas the tail_call function plods on, working fine.

So the overhead of the infrastructure of the tail call doesn’t have much impact on execution time for large input values, but it’s clear from the barely-visible green line at the bottom that using a for-loop with a mutable loop variable instead of function calls is way, way faster, with my compiler, on my computer, in C++. About 18 times faster, in fact.

And, just in case you were wondering: yes those pesky hardware engineers with their new-fangled “*” operator managed to defeat all comers with their unreasonable execution times of 0 seconds every time (to the nearest 10ms). I suppose that shows you something.