Performance – Page 3 – Andy Balaam's Blog

Basic ideas of Python 3 asyncio concurrency

Series: asyncio basics, large numbers in parallel, parallel HTTP requests, adding to stdlib

Update: see the Python Async Basics video on this topic.

Python 3’s asyncio module and the async and await keywords combine to allow us to do cooperative concurrent programming, where a code path voluntarily yields control to a scheduler, trusting that it will get control back when some resource has become available (or just when the scheduler feels like it). This way of programming can be very confusing, and has been popularised by Twisted in the Python world, and nodejs (among others) in other worlds.

I have been trying to get my head around the basic ideas as they surface in Python 3’s model. Below are some definitions and explanations that have been useful to me as I tried to grasp how it all works.

Futures and coroutines are both things that you can wait for.

You can make a coroutine by declaring it with async def:

import asyncio
async def mycoro(number):
    print("Starting %d" % number)
    await asyncio.sleep(1)
    print("Finishing %d" % number)
    return str(number)

Almost always, a coroutine will await something such as some blocking IO. (Above we just sleep for a second.) When we await, we actually yield control to the scheduler so it can do other work and wake us up later, when something interesting has happened.

You can make a future out of a coroutine, but often you don’t need to. Bear in mind that if you do want to make a future, you should use ensure_future, but this actually runs what you pass to it – it doesn’t just create a future:

myfuture1 = asyncio.ensure_future(mycoro(1))
# Runs mycoro!

But, to get its result, you must wait for it – it is only scheduled in the background:

# Assume mycoro is defined as above
myfuture1 = asyncio.ensure_future(mycoro(1))
# We end the program without waiting for the future to finish

So the above fails like this:

$ python3 ./python-async.py
Task was destroyed but it is pending!
task: <Task pending coro=<mycoro() running at ./python-async:10>>
sys:1: RuntimeWarning: coroutine 'mycoro' was never awaited

The right way to block waiting for a future outside of a coroutine is to ask the event loop to do it:

# Keep on assuming mycoro is defined as above for all the examples
myfuture1 = asyncio.ensure_future(mycoro(1))
loop = asyncio.get_event_loop()
loop.run_until_complete(myfuture1)
loop.close()

Now this works properly (although we’re not yet getting any benefit from being asynchronous):

$ python3 python-async.py
Starting 1
Finishing 1

To run several things concurrently, we make a future that is the combination of several other futures. asyncio can make a future like that out of coroutines using asyncio.gather:

several_futures = asyncio.gather(
    mycoro(1), mycoro(2), mycoro(3))
loop = asyncio.get_event_loop()
print(loop.run_until_complete(several_futures))
loop.close()

The three coroutines all run at the same time, so this only takes about 1 second to run, even though we are running 3 tasks, each of which takes 1 second:

$ python3 python-async.py
Starting 3
Starting 1
Starting 2
Finishing 3
Finishing 1
Finishing 2
['1', '2', '3']

asyncio.gather won’t necessarily run your coroutines in order, but it will return a list of results in the same order as its input.

Notice also that run_until_complete returns the result of the future created by gather – a list of all the results from the individual coroutines.

To do the next bit we need to know how to call a coroutine from a coroutine. As we’ve already seen, just calling a coroutine in the normal Python way doesn’t run it, but gives you back a “coroutine object”. To actually run the code, we need to wait for it. When we want to block everything until we have a result, we can use something like run_until_complete but in an async context we want to yield control to the scheduler and let it give us back control when the coroutine has finished. We do that by using await:

import asyncio
async def f2():
    print("start f2")
    await asyncio.sleep(1)
    print("stop f2")
async def f1():
    print("start f1")
    await f2()
    print("stop f1")
loop = asyncio.get_event_loop()
loop.run_until_complete(f1())
loop.close()

This prints:

$ python3 python-async.py
start f1
start f2
stop f2
stop f1

Now we know how to call a coroutine from inside a coroutine, we can continue.

We have seen that asyncio.gather takes in some futures/coroutines and returns a future that collects their results (in order).

If, instead, you want to get results as soon as they are available, you need to write a second coroutine that deals with each result by looping through the results of asyncio.as_completed and awaiting each one.

# Keep on assuming mycoro is defined as at the top
async def print_when_done(tasks):
    for res in asyncio.as_completed(tasks):
        print("Result %s" % await res)
coros = [mycoro(1), mycoro(2), mycoro(3)]
loop = asyncio.get_event_loop()
loop.run_until_complete(print_when_done(coros))
loop.close()

This prints:

$ python3 python-async.py
Starting 1
Starting 3
Starting 2
Finishing 3
Result 3
Finishing 2
Result 2
Finishing 1
Result 1

Notice that task 3 finishes first and its result is printed, even though tasks 1 and 2 are still running.

asyncio.as_completed returns an iterable sequence of futures, each of which must be awaited, so it must run inside a coroutine, which must be waited for too.

The argument to asyncio.as_completed has to be a list of coroutines or futures, not an iterable, so you can’t use it with a very large list of items that won’t fit in memory.

Side note: if we want to work with very large lists, asyncio.wait won’t help us here – it also takes a list of futures and waits for all of them to complete (like gather), or, with other arguments, for one of them to complete or one of them to fail. It then returns two sets of futures: done and not-done. Each of these must be awaited to get their results, so:

asyncio.gather

# is roughly equivalent to:

async def mygather(*args):
    ret = []
    for r in (await asyncio.wait(args))[0]:
        ret.append(await r)
    return ret

Further reading: realpython.com/async-io-python (a very complete and clear explanation, with lots of links)

I am interested in running very large numbers of tasks with limited concurrency – see the next article for how I managed it.

Java game programming: image rendering hints make no difference to rendering time

I am writing an Open Source/Free Software Java desktop game (Rabbit Escape) and am experimenting with allowing you to zoom in so that the images being rendered are fairly big, and I’m seeing some slow-down.

[Note: Rabbit Escape is also available on Android, where it appears to have no problems with rendering speed…]

To speed it up, I have been experimenting with passing various values to Graphics2D‘s setRenderingHint method.

My conclusion is that on my platform (OpenJDK Java 7 on Linux) it makes no difference whatsoever. Please leave a comment if I’m doing it wrong.

I am timing how long each frame takes to render, and comparing the time when I insert this code before my rendering code:

// As slow and high-quality as possible, please!
g.setRenderingHint( RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON );
g.setRenderingHint( RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_QUALITY );
g.setRenderingHint( RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_QUALITY );
g.setRenderingHint( RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_QUALITY );

with inserting this before my rendering code:

// As fast and low-quality as possible, please!
g.setRenderingHint( RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_OFF );
g.setRenderingHint( RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_SPEED );
g.setRenderingHint( RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_SPEED );
g.setRenderingHint( RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_SPEED );

[Documentation, such as it is, is here: RenderingHints.]

I see no appreciable difference in rendering time or image quality between these two setups, or when I leave out the calls to setRenderingHint completely.

Platform details:

$ java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Tail Call Optimisation in C++ – lightning talk video

You can watch the Tail Call Optimisation in C++ lightning talk video, which I gave at the ACCU 2012 conference in April.

You can also read the (clearer and more correct) writeup I did later: Tail Call Optimisation in C++ or the subsequent article published in Overload 109.

Tail Call Optimisation in C++ published in Overload journal

You read it here first, but now you can have a paper version of “Tail Call Optimisation in C++”, published almost as-is, in Overload 109 the journal of ACCU.

Generalising tail call optimised C++

This series: Lightning talk, Explanation, Performance, Generalisation.

In previous posts I discussed the construction of some C++ that does the same job that the tail call optimisation does in some other languages. The example code given showed the case where we know that every function in the recursion will take two long integer parameters, and return a long as well.

In fact, it is perfectly possible to generalise this code to cover more complex cases. Fortunately, the trampoline function doesn’t need to know the arguments taken by the functions being called, only the return value. It looks like this:

template<typename RetT>
const RetT trampoline_templ(
    std::auto_ptr< IAnswer<RetT> > answer )
{
    while( !answer->finished() )
    {
        answer = answer->tail_call()();
    }
    return answer->value();
}

So we can call this with the type of the return value as our template parameter, and supply an object which satisfies the IAnswer<RetT> interface:

template<typename RetT>
class IAnswer
{
public:
    virtual const bool             finished()  const = 0;
    virtual const ICallable<RetT>& tail_call() const = 0;
    virtual const RetT             value()     const = 0;
};

where ICallable looks like this:

template<typename RetT>
class ICallable
{
public:
    typedef std::auto_ptr< IAnswer<RetT> > AnswerPtr;
    virtual AnswerPtr operator()() const = 0;
};

The concrete classes that implement these interfaces need to know the types (and number) of the arguments, but that’s ok because they only get instantiated by code that would otherwise (in the standard, non-tail-call recursion case) call the functions themselves. In the toy case we are using of repeatedly adding up numbers to multiply by two, the outer function looks like this:

const long times_two_tail_call_templ( const long n )
{
    typedef Answer2<long, long, long> AnswerType;

    return trampoline_templ(
        AnswerType::newFn(
            times_two_tail_call_impl, 0, n, 0 )
    );
}

and the inner one looks like this:

std::auto_ptr< IAnswer<long> > times_two_tail_call_impl(
    const long acc, const long i )
{
    typedef Answer2<long, long, long> AnswerType;

    if( i == 0 )
    {
        return return AnswerType::newAns( acc );
    }
    else
    {
        return AnswerType::newFn(
            times_two_tail_call_impl,
            acc + 2, i - 1, 0 );
    }
}

Both of the above use static convenience methods newFn and newAns that I have defined on Answer2 to create smart pointers to newly-allocated Answer2 objects. newAns creates an Answer2 that contains a final answer, and newFn creates an Answer2 specifying another function (with arguments) to call.

Answer2 looks like this:

template<typename RetT, typename Arg1T, typename Arg2T>
class Answer2 : public IAnswer<RetT>
{
private:
    typedef FnPlusArgs2<RetT, Arg1T, Arg2T> FnArgs;
    typedef std::auto_ptr< IAnswer<RetT> > AnswerPtr;

    const bool finished_;
    const FnArgs tail_call_;
    const RetT value_;

private:
    Answer2( const bool finished, const FnArgs tail_call, const RetT value )
    : finished_( finished )
    , tail_call_( tail_call )
    , value_( value )
    {
    }

    static AnswerPtr newPtr(
        const bool finished, const FnArgs tail_call, const RetT value )
    {
        return AnswerPtr( new Answer2<RetT, Arg1T, Arg2T>(
            finished, tail_call, value ) );
    }
public:
    static AnswerPtr newFn(
        const typename FnArgs::fn_type fn,
        const Arg1T arg1,
        const Arg2T arg2,
        const RetT zero_val )
    {
        return newPtr( false, FnArgs( fn, arg1, arg2 ), zero_val );
    }

    static AnswerPtr newAns( const RetT value )
    {
        return newPtr( true, FnArgs::null(), value );
    }

    virtual const bool    finished()  const { return finished_; };
    virtual const FnArgs& tail_call() const { return tail_call_; };
    virtual const RetT    value()     const { return value_; };
};

and uses FnPlusArgs2, which looks like this:

template<typename RetT, typename Arg1T, typename Arg2T>
class FnPlusArgs2 : public ICallable<RetT>
{
private:
    typedef typename ICallable<RetT>::AnswerPtr AnswerPtr;
public:
    typedef AnswerPtr (*fn_type)( const Arg1T, const Arg2T );
private:
    const fn_type fn_;
    const Arg1T arg1_;
    const Arg2T arg2_;

public:
    FnPlusArgs2( const fn_type fn, const Arg1T arg1, const Arg2T arg2 )
    : fn_( fn )
    , arg1_( arg1 )
    , arg2_( arg2 )
    {
    }

    virtual AnswerPtr operator()() const
    {
        return fn_( arg1_, arg2_ );
    }

    static FnPlusArgs2<RetT, Arg1T, Arg2T> null()
    {
        return FnPlusArgs2<RetT, Arg1T, Arg2T>( NULL, 0, 0 );
    }
};

I have continued with the 2 long arguments, and long return value example here, but with the above code it is possible to construct recursive code using more than one function, and the functions can have different numbers of arguments, and different argument types, so long as they all co-operate to produce an answer of the required type. The Source code for this article includes an example, in the file tail_call_templ_2fns.cpp, of two different functions that call each other recursively, and take different arguments, using the trampoline function and interfaces listed above, and Answer3 and FnPlusArgs3 class templates similar to Answer2 and FnPlusArgs2 shown above. Implementing the N-args case using variadic templates (C++11) or template meta-programming is left as an exercise for the reader.

This more realistic case where the number and types of arguments are not known beforehand forces us to use dynamic memory to store our AnswerN objects, and causes more pointer dereferences and virtual function calls, and these do hurt performance. In tests on my machine, this code ran approximately 10 times slower than the version using only stack memory. Perhaps we C++ programmers should comfort ourselves that many languages supporting tail-call optimisation require dynamic memory, virtual functions and pointer indirection to do absolutely everything.