Avoid backslashes anywhere in Java code (Java error “illegal unicode escape”)

Did you know you can insert unicode-escaped characters, anywhere in a Java program?

Most of us are familiar with using unicode escapes like this:

String pound = "\u00A3";

but in fact constructs like \u00A3 can go anywhere, including in a comment.

This is all fine so long as they’re valid, but what if you’re generating Java code without due care and attention?

And what if you’re inserting file paths into your generated code?

And what if one of your directories has a name starting with a “u”?

Then you get code like this (edited example from our real project!):

// DO NOT EDIT!  This file was generated from:
// C:\usr\foo.xml

And, only on the machine with the dir called “usr”, we got a compile error like this:

MyClass.java:14: illegal unicode escape
// C:\usr\foo.xml
       ^
1 error

Which took me a while to track down.

Reference: JLS-3.3 Unicode Escapes.

Note this only applies to unicode escapes, not others like \n or \t – they are only processed within character or string literals (JLS-3.10.6).

Behaviour of Java String.split() when some answers are the empty string

Can you guess the output of this program?

class SplitTest
{
    static void split( String s )
    {
        System.out.println( s.split( ";" ).length );
    }

    public static void main( String[] args )
    {
        split("");
        split(";");
        split("x;");
        split(";y");
    }
}

Here it is:

$ javac SplitTest.java && java SplitTest
1
0
1
2

Wow. Docs: String.split.

Side note: It’s not easy to fit Java examples into tweets.

Checking the case of a filename on Windows

Windows generally uses a case-insensitive but not case-preserving file system.

When writing some code that is intended to be used on Linux as well as Windows, I wanted it to fail on Windows in the same cases that it would fail on Linux, and this meant detecting when the case of a filename differed from its canonical case on the file system.

I want to ask “is this file name correct in terms of case?”

I was working in Java, but I think this issue would be similar in other languages: it’s difficult to ask for the canonical case version of a file name when we currently have a filename with abitrary case.

The only solution I came up with was to list the contents of the parent directory and check whether my arbitrary filename is listed with the correct case in the results:

// CaseCheck.java

import java.util.Arrays;
import java.io.File;
import java.io.IOException;

class CaseCheck
{
    private static File parentFile( File f )
    {
        File ret = f.getParentFile();
        if ( ret == null )
        {
            ret = new File( "." );
        }
        return ret;
    }

    private static boolean existsAndCaseCorrect( String fileName )
    {
        File f = new File( fileName );
        return Arrays.asList( parentFile( f ).list() ).contains( f.getName() );
    }

    public static void main( String[] args ) throws IOException
    {
        System.out.println( existsAndCaseCorrect( args[0] ) );
    }
}

Checking it on its own source file:

javac CaseCheck.java && java CaseCheck cASEcheck.java
false

javac CaseCheck.java && java CaseCheck CaseCheck.java
true

It seems to work.

Note that this also returns false if the file doesn’t exist, and will throw an error if the file name specifies a parent directory that doesn’t exist.

Goodness in programming languages, part 4 – Ownership & Memory

Posts in this series: Syntax, Deployment, Metaprogramming, Ownership

There is often a trade-off between programming language features and how fast (and predictably) the programs run. From web sites that serve millions of visitors to programs running on small devices we need to be able to make our programs run quickly.

One trade-off that is made in many modern programming languages (including Python, Ruby, C#, Java and JVM-based languages) is that the system owns all the memory. This avoids the need for the programmer to think about how long pieces of memory need to live, but it means a lot of memory can hang around a lot longer than it really needs to. In addition, it can mean the CPU has to jump around to lots of different memory locations to find pieces of dynamically-allocated memory in different locations. Where this jumping around causes caches to be invalidated that can really slow things down.

While these garbage collection-based languages have been evolving, C++ has been developing along a different track. C++ allows the programmer to allocate and free up memory manually (as in C), but over time the community of C++ programmers has been developing a new way of thinking about memory, and developing tools in the C++ language to make it easier to work in this way.

Modern C++ code rarely or never uses “delete” or “free” to deallocate memory, but instead defines clearly which object owns each other object. When the owning object is no longer needed, everything it owns can be deleted, immediately freeing their memory. The top-level objects are owned by the current scope, so when the function or block of code we are in ends, the system knows these objects and the ones they own can be deleted. Objects that last for the whole life of the program are owned by the scope of the main function or equivalent.

One advantage of explicit ownership is that the right thing happens automatically when something unexpected happens (e.g. an exception is thrown, or we return early from a function). Because the objects are owned by a scope, as soon as we exit that scope they are automatically deleted, and no memory is “leaked”.

Because ownership is explicit, we can often group owned objects in memory immediately next to the objects that own them. This means we jump around to different memory locations less often, and we have to do less work to find and delete regions of memory. This makes our programs faster.

Here are some things I like:

  • Modern C++’s clarity about who owns what. By expressing ownership explicitly we make clear our intentions, and avoid memory leaks.
  • Modern C++’s fast and cache-friendly memory handling. Allocating memory for several objects together reduces time spent looking for space, and means caches are more likely to be used.

In my experience, the most frequent performance problems I have had to solve have really been memory problems. Explicit ownership can reduce unnecessary memory management overhead by taking back the work from the system (the garbage collector) and allowing programmers to be explicit about who owns what.

setUp and tearDown considered harmful

Some unit test frameworks provide methods (often called setUp and tearDown, or annotated with @Before and @After) that are called automatically before a unit test executes, and afterwards.

This structure is presumably intended to avoid repetition of code that is identical in all the tests within one test file.

I have always instinctively avoided using these methods, and when a colleague used them recently I thought I should try to write up why I feel negative about them. Here it is:

Update: note I am talking about unit tests here. I know these frameworks can be used for other types of test, and maybe in that context these methods could be useful.

1. It’s action at a distance

setUp and tearDown are called automatically, with no indication in your code that you use them, or don’t use them. They are “magic”, and everyone hates magic.

If someone is reading your test (because it broke, probably) they don’t know whether some setUp will be called without manually scanning your code to find out whether it exists. Do you hate them?

2. setUp contains useless stuff

How many tests do you have in one file? When you first write it, maybe, just maybe, all the tests need the exact same setup. Later, you’ll write new tests that only use part of it.

Very soon, you grow an uber-setUp that does all the setup for various different tests, creating objects you don’t need. This adds complexity for everyone who has to read your tests – they don’t know which bits of setUp are used in this test, and which are cruft for something else.

3. They require member variables

The only useful work you can do inside setUp and tearDown is creating and modifying member variables.

Now your tests aren’t self-contained – they use these member variables, and you must make absolutely sure that your test works no matter what state they are in. These member variables are not useful for anything else – they are purely an artifact of the choice to use setUp and tearDown.

4. A named function is better

When you have setup code to share, write a function or method. Give it a name, make it return the thing it creates. By giving it a name you make your test easier to read. By returning what it creates, you avoid the use of member variables. By avoiding the magic setUp method, you give yourself the option of calling more than one setup function, making code re-use more granular (if you want).

5. What goes in tearDown?

If you’re using tearDown, what are you doing?

Are you tearing down some global state? I thought this was a unit test?

Are you ensuring nothing is left in an unpredictable state for future tests? Surely those tests guarantee their state at the start?

What possible use is there for this function?

Conclusion

A unit test should be a self-contained pure function, with no dependencies on other state. setUp and tearDown force you to depend on member variables of your test class, for no benefits over named functions, except that you don’t have to type their names. I consider typing the name of a properly-named function to be a benefit.