Checking the case of a filename on Windows

Windows generally uses a case-insensitive but not case-preserving file system.

When writing some code that is intended to be used on Linux as well as Windows, I wanted it to fail on Windows in the same cases that it would fail on Linux, and this meant detecting when the case of a filename differed from its canonical case on the file system.

I want to ask “is this file name correct in terms of case?”

I was working in Java, but I think this issue would be similar in other languages: it’s difficult to ask for the canonical case version of a file name when we currently have a filename with abitrary case.

The only solution I came up with was to list the contents of the parent directory and check whether my arbitrary filename is listed with the correct case in the results:

// CaseCheck.java

import java.util.Arrays;
import java.io.File;
import java.io.IOException;

class CaseCheck
{
    private static File parentFile( File f )
    {
        File ret = f.getParentFile();
        if ( ret == null )
        {
            ret = new File( "." );
        }
        return ret;
    }

    private static boolean existsAndCaseCorrect( String fileName )
    {
        File f = new File( fileName );
        return Arrays.asList( parentFile( f ).list() ).contains( f.getName() );
    }

    public static void main( String[] args ) throws IOException
    {
        System.out.println( existsAndCaseCorrect( args[0] ) );
    }
}

Checking it on its own source file:

javac CaseCheck.java && java CaseCheck cASEcheck.java
false

javac CaseCheck.java && java CaseCheck CaseCheck.java
true

It seems to work.

Note that this also returns false if the file doesn’t exist, and will throw an error if the file name specifies a parent directory that doesn’t exist.

Passing several values through a pipe in bash

I have been fiddling with some git-related shell scripts, and decided to try and follow the same approach as git in their structure. This means using the Unix system where each piece of functionality is a separate script (or executable) that communicates by using command-line arguments, reading from the standard input stream, and writing output to the standard output stream.

This allows each piece of functionality to be written in any programming or scripting language. In git’s case this has allowed initial versions to be written in bash or perl and later optimised versions (sometimes written in C) to be dropped in, piece by piece. It’s an incredibily flexible way of working and can also be very efficient.

Most of my prototyping has been in bash, and I’ve found sometimes I need to write out multiple values from a script and collect them as input in another script.

Writing the output is simple:

#!/bin/bash

# outputter.bash

# Imagine A, B and C have been created by some complex process:
A="foo bar"
B="  bar"
C="baz   "

# At the end of our script we simply write them out on separate lines in a known order
echo "${A}"
echo "${B}"
echo "${C}"

But reading them in somewhere else gave me some trouble until I learned this recipe:

#!/bin/bash

# inputter.bash

# Read in the values one per line:
IFS=$'\n' read A
IFS=$'\n' read B
IFS=$'\n' read C

# Now we can use them.
echo "A='${A}'"
echo "B='${B}'"
echo "C='${C}'"

And now the values transfer succesfully, preserving whitespace:

$ ./outputter.bash | ./inputter.bash 
A='foo bar'
B='  bar'
C='baz   '

The recipe uses bash’s built-in read command to populate the variables, but sets the IFS variable (Internal Field Separator) to a newline, meaning all the whitespace in the line is treated as part of the value to be read. The $'\n' syntax is a literal newline.

Goodness in programming languages, part 4 – Ownership & Memory

Posts in this series: Syntax, Deployment, Metaprogramming, Ownership

There is often a trade-off between programming language features and how fast (and predictably) the programs run. From web sites that serve millions of visitors to programs running on small devices we need to be able to make our programs run quickly.

One trade-off that is made in many modern programming languages (including Python, Ruby, C#, Java and JVM-based languages) is that the system owns all the memory. This avoids the need for the programmer to think about how long pieces of memory need to live, but it means a lot of memory can hang around a lot longer than it really needs to. In addition, it can mean the CPU has to jump around to lots of different memory locations to find pieces of dynamically-allocated memory in different locations. Where this jumping around causes caches to be invalidated that can really slow things down.

While these garbage collection-based languages have been evolving, C++ has been developing along a different track. C++ allows the programmer to allocate and free up memory manually (as in C), but over time the community of C++ programmers has been developing a new way of thinking about memory, and developing tools in the C++ language to make it easier to work in this way.

Modern C++ code rarely or never uses “delete” or “free” to deallocate memory, but instead defines clearly which object owns each other object. When the owning object is no longer needed, everything it owns can be deleted, immediately freeing their memory. The top-level objects are owned by the current scope, so when the function or block of code we are in ends, the system knows these objects and the ones they own can be deleted. Objects that last for the whole life of the program are owned by the scope of the main function or equivalent.

One advantage of explicit ownership is that the right thing happens automatically when something unexpected happens (e.g. an exception is thrown, or we return early from a function). Because the objects are owned by a scope, as soon as we exit that scope they are automatically deleted, and no memory is “leaked”.

Because ownership is explicit, we can often group owned objects in memory immediately next to the objects that own them. This means we jump around to different memory locations less often, and we have to do less work to find and delete regions of memory. This makes our programs faster.

Here are some things I like:

  • Modern C++’s clarity about who owns what. By expressing ownership explicitly we make clear our intentions, and avoid memory leaks.
  • Modern C++’s fast and cache-friendly memory handling. Allocating memory for several objects together reduces time spent looking for space, and means caches are more likely to be used.

In my experience, the most frequent performance problems I have had to solve have really been memory problems. Explicit ownership can reduce unnecessary memory management overhead by taking back the work from the system (the garbage collector) and allowing programmers to be explicit about who owns what.

How to use git (the basics)

Series: Why git?, Basics, Branches, Merging, Remotes

Git is a very powerful tool, but somewhat intimidating at first. I will be making some videos working through how to use it step by step.

First, we look at how to track your own code on your own computer, and then get a brief look at a killer feature: stash, which lets you pause what you were doing and come back to it later.

Slides: How to use git (the basics) slides.