Diffident 0.3

My original plan for Diffident, the side-by-side diff viewer and editor that works in a terminal, was to implement basic editing capabilities before making another release.

Of course, that turned out to be quite ambitious. It involves essentially implementing a full text editor, which is not really what I want to do. I may actually implement a “jump out to $EDITOR” option before the basic text editing facilities.

What I have implemented for this release is the ability to add and remove lines, and copy lines from one side to the other. For my personal use, this covers about 90% of cases, so I think it’s worthy of a release.

There is no undo/redo as yet, but the framework for that is in place, so I may make another release sometime soonish that is just for that.

So the dream of a diff viewer and editor is starting to come true…

Separate regular expressions, or one more complex one?

I have asked myself this question several times, so I thought it was about time I did a test and found an answer.

If the user of your program can supply you with a list of regular expressions to match against some text, should you combine those expressions into one big one, or treat them separately?

In my case I need an OR relationship, so combining them just means putting a pipe symbol between them.*

So: one expression made by ORing, or looping through several – which is better? There’s only one way to find out:

import re, sys

line_with_match_foo = "This line contains foo."
line_with_match_baz = "This line contains baz."
line_without_match = "This line does not contain it."

re_strings = ( "foo", "bar1", "bar2", "baz", "bar3", "bar4", )

piped_re = re.compile( "|".join( re_strings ) )

separate_res = list( re.compile( r ) for r in re_strings )

NUM_ITERATIONS = 1000000

def piped( line ):
    for i in range( NUM_ITERATIONS ):
        if piped_re.search( line ):
            print "match!" # do something

def separate( line ):
    for i in range( NUM_ITERATIONS ):
        for s in separate_res:
            if s.search( line ):
                print "match!" # do something
                break # stop looping because we matched

arg = sys.argv[1]

if arg == "--piped-nomatch":
    piped( line_without_match )
elif arg == "--piped-match-begin":
    piped( line_with_match_foo )
elif arg == "--piped-match-middle":
    piped( line_with_match_baz )
elif arg == "--separate-nomatch":
    separate( line_without_match )
elif arg == "--separate-match-begin":
    separate( line_with_match_foo )
elif arg == "--separate-match-middle":
    separate( line_with_match_baz )

And here are the results:

$ time python re_timings.py --piped-nomatch > /dev/null

real    0m0.987s
user    0m0.943s
sys     0m0.032s
$ time python re_timings.py --separate-nomatch > /dev/null

real    0m3.695s
user    0m3.641s
sys     0m0.037s

So when no regular expressions match, the combined expression is 3.6 times faster.

$ time python re_timings.py --piped-match-middle > /dev/null

real    0m1.900s
user    0m1.858s
sys     0m0.033s
$ time python re_timings.py --separate-match-middle > /dev/null

real    0m3.543s
user    0m3.439s
sys     0m0.042s

And when an expression near the middle of the list matches, the combined expression is 1.8 times faster.

$ time python re_timings.py --piped-match-begin > /dev/null

real    0m1.847s
user    0m1.797s
sys     0m0.035s
$ time python re_timings.py --separate-match-begin > /dev/null

real    0m1.649s
user    0m1.597s
sys     0m0.032s

But in the (presumably much rarer) case where all lines match the first expression in the list, the separate expressions are marginally faster.

A clear win for combing the expressions, unless you think it’s likely that most lines will match expressions early in the list.

Note also if you combine the expressions the performance is similar when the matching expression is at different positions in the list (whereas in the other case list order matters a lot), so there is probably no need for you or your user to second-guess what order to put the expressions in, which makes life easier for everyone.

I would guess the results would be similar in other programming languages. I certainly found it to be similar in C# on .NET when I tried it a while ago.

By combining the expressions we ask the regular expression engine to do the heavy lifting for us, and it is specifically designed to be good at that job.

Open questions:

1. Have I made a mistake that makes these results invalid?

2. * Can arbitrary regular expressions be ORed together simply by concatenating them with a pipe symbol in between?

3. Can we do something similar if the problem requires us to AND expressions?

Announcing Record TV

Last night I uploaded the first public version of my latest project, Record TV. Record TV is a system for recording TV (on a Linux desktop computer) that is designed to allow lots of different user interfaces all to use the same back end. It is currently only useful for people who are quite familiar with the Linux command line. It essentially has no user interface at all, but the back end stuff works for recording TV.

Perhaps more excitingly, I have also managed to get my recorded programmes to play back on my Nintendo Wii, so I can watch them on my TV.

Find out more on the project page linked above. I’ve released this code very early, in the spirit of “release early, release often,” so expect to hack on it a bit to get it working.

If you think MythTV just goes about things the wrong way, and you’d like to help do it right, it might be of interest.

It’s mostly Python, with some PHP and shell scripts.

A job I’d like to do is to be able to use FreeGuide as a UI for selecting programmes to record.