Java game programming: image rendering hints make no difference to rendering time

I am writing an Open Source/Free Software Java desktop game (Rabbit Escape) and am experimenting with allowing you to zoom in so that the images being rendered are fairly big, and I’m seeing some slow-down.

[Note: Rabbit Escape is also available on Android, where it appears to have no problems with rendering speed…]

To speed it up, I have been experimenting with passing various values to Graphics2D‘s setRenderingHint method.

My conclusion is that on my platform (OpenJDK Java 7 on Linux) it makes no difference whatsoever. Please leave a comment if I’m doing it wrong.

I am timing how long each frame takes to render, and comparing the time when I insert this code before my rendering code:

// As slow and high-quality as possible, please!
g.setRenderingHint( RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON );
g.setRenderingHint( RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_QUALITY );
g.setRenderingHint( RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_QUALITY );
g.setRenderingHint( RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_QUALITY );

with inserting this before my rendering code:

// As fast and low-quality as possible, please!
g.setRenderingHint( RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_OFF );
g.setRenderingHint( RenderingHints.KEY_ALPHA_INTERPOLATION, RenderingHints.VALUE_ALPHA_INTERPOLATION_SPEED );
g.setRenderingHint( RenderingHints.KEY_COLOR_RENDERING, RenderingHints.VALUE_COLOR_RENDER_SPEED );
g.setRenderingHint( RenderingHints.KEY_RENDERING, RenderingHints.VALUE_RENDER_SPEED );

[Documentation, such as it is, is here: RenderingHints.]

I see no appreciable difference in rendering time or image quality between these two setups, or when I leave out the calls to setRenderingHint completely.

Platform details:

$ java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

Tail Call Optimisation in C++ – lightning talk video

You can watch the Tail Call Optimisation in C++ lightning talk video, which I gave at the ACCU 2012 conference in April.

You can also read the (clearer and more correct) writeup I did later: Tail Call Optimisation in C++ or the subsequent article published in Overload 109.

Generalising tail call optimised C++

This series: Lightning talk, Explanation, Performance, Generalisation.

In previous posts I discussed the construction of some C++ that does the same job that the tail call optimisation does in some other languages. The example code given showed the case where we know that every function in the recursion will take two long integer parameters, and return a long as well.

In fact, it is perfectly possible to generalise this code to cover more complex cases. Fortunately, the trampoline function doesn’t need to know the arguments taken by the functions being called, only the return value. It looks like this:

template<typename RetT>
const RetT trampoline_templ(
    std::auto_ptr< IAnswer<RetT> > answer )
{
    while( !answer->finished() )
    {
        answer = answer->tail_call()();
    }
    return answer->value();
}

So we can call this with the type of the return value as our template parameter, and supply an object which satisfies the IAnswer<RetT> interface:

template<typename RetT>
class IAnswer
{
public:
    virtual const bool             finished()  const = 0;
    virtual const ICallable<RetT>& tail_call() const = 0;
    virtual const RetT             value()     const = 0;
};

where ICallable looks like this:

template<typename RetT>
class ICallable
{
public:
    typedef std::auto_ptr< IAnswer<RetT> > AnswerPtr;
    virtual AnswerPtr operator()() const = 0;
};

The concrete classes that implement these interfaces need to know the types (and number) of the arguments, but that’s ok because they only get instantiated by code that would otherwise (in the standard, non-tail-call recursion case) call the functions themselves. In the toy case we are using of repeatedly adding up numbers to multiply by two, the outer function looks like this:

const long times_two_tail_call_templ( const long n )
{
    typedef Answer2<long, long, long> AnswerType;

    return trampoline_templ(
        AnswerType::newFn(
            times_two_tail_call_impl, 0, n, 0 )
    );
}

and the inner one looks like this:

std::auto_ptr< IAnswer<long> > times_two_tail_call_impl(
    const long acc, const long i )
{
    typedef Answer2<long, long, long> AnswerType;

    if( i == 0 )
    {
        return return AnswerType::newAns( acc );
    }
    else
    {
        return AnswerType::newFn(
            times_two_tail_call_impl,
            acc + 2, i - 1, 0 );
    }
}

Both of the above use static convenience methods newFn and newAns that I have defined on Answer2 to create smart pointers to newly-allocated Answer2 objects. newAns creates an Answer2 that contains a final answer, and newFn creates an Answer2 specifying another function (with arguments) to call.

Answer2 looks like this:

template<typename RetT, typename Arg1T, typename Arg2T>
class Answer2 : public IAnswer<RetT>
{
private:
    typedef FnPlusArgs2<RetT, Arg1T, Arg2T> FnArgs;
    typedef std::auto_ptr< IAnswer<RetT> > AnswerPtr;

    const bool finished_;
    const FnArgs tail_call_;
    const RetT value_;

private:
    Answer2( const bool finished, const FnArgs tail_call, const RetT value )
    : finished_( finished )
    , tail_call_( tail_call )
    , value_( value )
    {
    }

    static AnswerPtr newPtr(
        const bool finished, const FnArgs tail_call, const RetT value )
    {
        return AnswerPtr( new Answer2<RetT, Arg1T, Arg2T>(
            finished, tail_call, value ) );
    }
public:
    static AnswerPtr newFn(
        const typename FnArgs::fn_type fn,
        const Arg1T arg1,
        const Arg2T arg2,
        const RetT zero_val )
    {
        return newPtr( false, FnArgs( fn, arg1, arg2 ), zero_val );
    }

    static AnswerPtr newAns( const RetT value )
    {
        return newPtr( true, FnArgs::null(), value );
    }

    virtual const bool    finished()  const { return finished_; };
    virtual const FnArgs& tail_call() const { return tail_call_; };
    virtual const RetT    value()     const { return value_; };
};

and uses FnPlusArgs2, which looks like this:

template<typename RetT, typename Arg1T, typename Arg2T>
class FnPlusArgs2 : public ICallable<RetT>
{
private:
    typedef typename ICallable<RetT>::AnswerPtr AnswerPtr;
public:
    typedef AnswerPtr (*fn_type)( const Arg1T, const Arg2T );
private:
    const fn_type fn_;
    const Arg1T arg1_;
    const Arg2T arg2_;

public:
    FnPlusArgs2( const fn_type fn, const Arg1T arg1, const Arg2T arg2 )
    : fn_( fn )
    , arg1_( arg1 )
    , arg2_( arg2 )
    {
    }

    virtual AnswerPtr operator()() const
    {
        return fn_( arg1_, arg2_ );
    }

    static FnPlusArgs2<RetT, Arg1T, Arg2T> null()
    {
        return FnPlusArgs2<RetT, Arg1T, Arg2T>( NULL, 0, 0 );
    }
};

I have continued with the 2 long arguments, and long return value example here, but with the above code it is possible to construct recursive code using more than one function, and the functions can have different numbers of arguments, and different argument types, so long as they all co-operate to produce an answer of the required type. The Source code for this article includes an example, in the file tail_call_templ_2fns.cpp, of two different functions that call each other recursively, and take different arguments, using the trampoline function and interfaces listed above, and Answer3 and FnPlusArgs3 class templates similar to Answer2 and FnPlusArgs2 shown above. Implementing the N-args case using variadic templates (C++11) or template meta-programming is left as an exercise for the reader.

This more realistic case where the number and types of arguments are not known beforehand forces us to use dynamic memory to store our AnswerN objects, and causes more pointer dereferences and virtual function calls, and these do hurt performance. In tests on my machine, this code ran approximately 10 times slower than the version using only stack memory. Perhaps we C++ programmers should comfort ourselves that many languages supporting tail-call optimisation require dynamic memory, virtual functions and pointer indirection to do absolutely everything.

Don’t design for performance until it’s too late

There is a piece of ancient wisdom which states:

Premature optimisation is the root of all evil

This ancient wisdom is, like all ancient wisdom, correct.

However.

It appears to have been reinterpreted as essentially meaning:

Don’t design for performance until it’s too late

which is clearly, and very importantly, very wrong.

Performance is a feature

Before I begin I want us all to agree that performance is a feature.

I work on a real-life "enterprise" application. Its features are entirely driven by the need for immediate cash, not by developers following pipe dreams. And yet, for the last 6-12 months the majority of my time has been spent trying to retrofit performance into this application. Believe me, this is not because we have users who are obsessive about wasting valuable seconds – it’s because our performance sucks so hard it’s deeply embarrassing.

What is your favourite program? How well does it perform? What is your least favourite? Why?

For me, and many other people, their answers to those questions demonstrate the importance of performance. Firefox was launched to improve the performance of Mozilla. People love git because of how fast it is. Lotus Notes is hated so much partly because of its performance. My main complaints about programs I use involve performance (e.g. Thunderbird is too slow for IMAP email.)

A fast response to the user is one of those crucial inches on the journey to software that makes people happy. Making people happy gives you the kind of scary fanboyism that surrounds git. Wouldn’t you like that for your product?

What is optimisation?

When my hero said that premature optimisation was the root of all evil, he was talking in the days when you had to hand-optimise your C in assembly language. Or, more likely in his case, you had to hand-optimise your assembly language into faster assembly language. Optimisation like that very often obfuscates your code.

These days, 99% of the time, your compiler does all of this work for you, so you can have relatively comprehensible code in whatever trendy language you like, and still have the fastest possible local implementation of that in machine code.

Meanwhile, Knuth knew that 99% of your code is simply not performance-critical – it only runs a few times, or it’s just so much faster than some other bit that it doesn’t matter. The lesson we all learn eventually is that the slow bit is never quite what you thought, and you have to measure to find out where to concentrate your effort.

So, if optimisation is obfuscation, and 99% of your code isn’t the bit you need to make faster, it becomes clear that premature optimisation is the root of much evil.

But optimisation is not designing for performance.

Design for performance

Fundamentally, to get good performance, you are going to need to measure the time spent in various parts of your code (I suggest Very Sleepy if you’re on Windows) and make the slow bits faster (or happen less often). However, there are still some principles you can follow that will mean you need to spend less time doing this*.

*Which is a pity, because I really love doing it.

If you don’t design for performance you are almost certainly going to need to restructure large parts of your program later, which is very difficult and time-consuming.

There are two aspects to designing for performance: writing good local code, and creating good global structure.

Write good local code

Before you write an algorithm, think for a few minutes about how to make it work efficiently. e.g. if you’re writing C++, consider whether a deque or a list would be better than a vector for how you’re going to use it.

Think about what is idiomatic for your language and why. Think about what the computer really has to do to produce the results you are asking for. Are there going to be a lot of objects about? Maybe you can avoid copying them too many times. Are you creating and deleting a lot of objects? Can you reuse some instead? (Exercise caution with that one, though – if you start obfuscating you come into conflict with the ancient wisdom.)

Often, if you think through what you are doing, and the most efficient way to do it, you will end up with a faster and more memory-efficient algorithm, that expresses your intention better than if you’d written the first thing that came into your head. There is no downside to that.

Try to minimise the number of times you need to ask the operating system for a chunk of memory: this is surprisingly slow. E.g. in C++, prefer creating by-value data members instead of pointers to objects allocated with their own call to new.

By the way, don’t worry if this sounds intimidating. The way to learn this stuff is to measure what you have done and then work out why it is slow. Next time you’ll jump straight to the fast solution without the detour to the slow one.

Of course, none of this will matter if you don’t have good global structure.

Create good global structure

The hardest and most important work you need to do to have good performance is to have good structure in the ways the different parts of your program interact.

This means thinking about how classes and components communicate with and control each other.

It may be helpful to use a streaming style of communication – can you send little chunks of information to be processed one by one instead of a huge great blob?

Try to make sure your components to use a common infrastructure: if different parts use different string classes you are going to spend most of your time copying from one to the other.

The hardest and deepest mystery in getting good performance (and in programming generally) is choosing the right fundamental data structures. I’ll never forget the lesson I learnt** when a friend of mine had a conversation with me about a toy project I was doing (that was particularly focussed on trying to be fast) and then went away and produced code that was orders of magnitude faster, simply because he had chosen the right data structure.

**The lesson I learnt was that I am not as good as I think I am.

To be honest this section is a little shorter than I’d like because I know I don’t have a lot of answers about how to do this well. I do know, though, that if you don’t think about it now you will have the pain of restructuring your program later, when it’s full of bug fixes that are going to get rebroken by the restructuring.

Of course, if you do think about it now you’re still pretty likely to need to change it later…

Ancient wisdom

Ancient wisdom is usually right, but misinterpreting it and using it as a license to write bad code is a bad idea.

Carry on.