Diffident – command line side-by-side diff editor

I really like Beyond Compare, which is a proprietary diff program with all those little touches that make it a joy to use*. The way I write code at work generally involves a bit of hacking in jEdit, checking the code myself, and then reviewing the code with a colleague.

*Recently, though, they’ve brought out a newer version that seems to overcomplicate things.

Both my own checking and the code review with a colleague generally involve comparing the code with the previous version in the (perforce) repository. Beyond Compare integrates nicely with the perforce tools and allows you to see a change as diffs of each file involved.

None of this is anything new to the Free Software world, of course. All the version control programs I’ve used allow you to do the equivalent of cvs diff and see what changes you’ve made. Git has a particularly good git diff mode which by default colours your diff and pipes it to less, making it easy to read and use.

What I have found myself missing recently, though, is the ability to edit the files as you diff them. The whole point of reviewing what you’ve done is to make changes as they occur to you, and with perforce + Beyond Compare it is really natural to make those changes within the diff tool.

Incidentally, I also really like the side-by-side style of Beyond Compare. By default it inserts “missing” lines so that all the similar lines are aligned, rather than trying to indicate with balloon-like lines that code has been inserted or transformed. I find those balloons very annoying and confusing (and they take up space on my screen, grr).

Having a side-by-side view also makes copying lines from one side to the other much easier. I often copy from one side to the other – especially when I realise what I have done is stupid and I want to revert a section back to how it was.

So, to curtail an already-long story, I decided there was yet another area of my life where the only solution was to run software I had written, rather than relying on the shoddy stuff put out by others, so I wrote a diff tool.

Diffident is a diff tool inspired by Beyond Compare and git. It shows you a side-by-side diff of your files in a terminal (one day it may have a GTK+ interface too) and allows you to edit them. The editing part is in development – expect a release fairly soon, or get the code from the git repository. The output is coloured, and you can see the whole of both files, using keyboard shortcuts to jump to the next and previous differences.

You might ask “Why not just use Beyond Compare?” For a number of reasons:

  1. It is not Free Software. I can’t improve it or trust it.
  2. It’s sort of Windows-y. I know there is a Linux version, but I bet it’s not very Linux-y. (Disclaimer: I’ve never tried it ;)
  3. It doesn’t work in a terminal.
  4. The inline editing support is not great. Its real strength is copying from side to side.
  5. It doesn’t feel right when used with git. I have got this set up in my Cygwin environment – it works, but it’s no fun.
  6. It isn’t written by me.

I’d really like Diffident to become the de-facto diff tool for git people (and everyone else). That proves to be a bit trickier than I’d like because of the way git interacts with diff tools, but I’ve got a decent solution for using it as the GIT_EXTERNAL_DIFF tool, and I hope to improve it in the future. (For those who are interested, the kind of thing I’m thinking about is how to get git diff --cached to allow me to edit the files in the index.)

So anyway, check out Diffident and if you like it, help me make it better.

Analog literals

I love this: C++ Multi-Dimensional Analog Literals.

I quote:

Have you ever felt that integer literals like “4” don’t convey the true size of the value they denote? If so, use an analog integer literal instead:

unsigned int b = I---------I;

It goes on to explain that you can use 2- and 3-dimensional “analog literals”. Genius. Read the article. Try to read the code :)

Isn’t C++ … erm … powerful?

You’ll notice that there are 9 dashes used to denote 4. This is because the trick it is using uses operator--. I’m sure the original author did this in his/her sleep and thought it was too trivial to post (or posted it before?) but I thought: if we can use operator! instead, can’t we create analog literals that use the same number of symbols as the number we want?

The answer is yes, and it’s pretty simple:

notliterals.h:

class NotLiteral
{
public:
	NotLiteral( unsigned int ival )
	: val_( ival )
	{
	}

	NotLiteral operator!() const
	{
		return NotLiteral( val_ + 1 );
	}

	operator unsigned int() const
	{
		return val_;
	}

	unsigned int val_;
};


const NotLiteral NL( 0 );

test_notliterals.cpp:

#include "notliterals.h"
#include <cassert>

int main()
{
	assert( !!!!NL == 4 );
	assert( !!NL == 2 );

	assert( !!!!!!!!!!!!!!!NL == 15 );
}

With this simpler form, it’s almost believable that there might be some kind of useful application?

Extending this to 3 dimensions is left as an exercise for the reader. For 2 dimensions, if you just want the area (not the width and height), how about this?:

	assert( !!!
	        !!!
	        !!!NL == 9 );

Update: By the way, if you don’t like all the emphasis! of! using! exclamation! marks! you can do the same thing with the unary one’s complement operator, ~. Just replace “!” everywhere above with “~” and you’re done. Unfortunately, you can’t do the same with – or + because the parser recognises “–” as the decrement operator instead of seeing that it is clearly two calls to the unary negation operator.

Separate regular expressions, or one more complex one?

I have asked myself this question several times, so I thought it was about time I did a test and found an answer.

If the user of your program can supply you with a list of regular expressions to match against some text, should you combine those expressions into one big one, or treat them separately?

In my case I need an OR relationship, so combining them just means putting a pipe symbol between them.*

So: one expression made by ORing, or looping through several – which is better? There’s only one way to find out:

import re, sys

line_with_match_foo = "This line contains foo."
line_with_match_baz = "This line contains baz."
line_without_match = "This line does not contain it."

re_strings = ( "foo", "bar1", "bar2", "baz", "bar3", "bar4", )

piped_re = re.compile( "|".join( re_strings ) )

separate_res = list( re.compile( r ) for r in re_strings )

NUM_ITERATIONS = 1000000

def piped( line ):
    for i in range( NUM_ITERATIONS ):
        if piped_re.search( line ):
            print "match!" # do something

def separate( line ):
    for i in range( NUM_ITERATIONS ):
        for s in separate_res:
            if s.search( line ):
                print "match!" # do something
                break # stop looping because we matched

arg = sys.argv[1]

if arg == "--piped-nomatch":
    piped( line_without_match )
elif arg == "--piped-match-begin":
    piped( line_with_match_foo )
elif arg == "--piped-match-middle":
    piped( line_with_match_baz )
elif arg == "--separate-nomatch":
    separate( line_without_match )
elif arg == "--separate-match-begin":
    separate( line_with_match_foo )
elif arg == "--separate-match-middle":
    separate( line_with_match_baz )

And here are the results:

$ time python re_timings.py --piped-nomatch > /dev/null

real    0m0.987s
user    0m0.943s
sys     0m0.032s
$ time python re_timings.py --separate-nomatch > /dev/null

real    0m3.695s
user    0m3.641s
sys     0m0.037s

So when no regular expressions match, the combined expression is 3.6 times faster.

$ time python re_timings.py --piped-match-middle > /dev/null

real    0m1.900s
user    0m1.858s
sys     0m0.033s
$ time python re_timings.py --separate-match-middle > /dev/null

real    0m3.543s
user    0m3.439s
sys     0m0.042s

And when an expression near the middle of the list matches, the combined expression is 1.8 times faster.

$ time python re_timings.py --piped-match-begin > /dev/null

real    0m1.847s
user    0m1.797s
sys     0m0.035s
$ time python re_timings.py --separate-match-begin > /dev/null

real    0m1.649s
user    0m1.597s
sys     0m0.032s

But in the (presumably much rarer) case where all lines match the first expression in the list, the separate expressions are marginally faster.

A clear win for combing the expressions, unless you think it’s likely that most lines will match expressions early in the list.

Note also if you combine the expressions the performance is similar when the matching expression is at different positions in the list (whereas in the other case list order matters a lot), so there is probably no need for you or your user to second-guess what order to put the expressions in, which makes life easier for everyone.

I would guess the results would be similar in other programming languages. I certainly found it to be similar in C# on .NET when I tried it a while ago.

By combining the expressions we ask the regular expression engine to do the heavy lifting for us, and it is specifically designed to be good at that job.

Open questions:

1. Have I made a mistake that makes these results invalid?

2. * Can arbitrary regular expressions be ORed together simply by concatenating them with a pipe symbol in between?

3. Can we do something similar if the problem requires us to AND expressions?