Separate regular expressions, or one more complex one?

I have asked myself this question several times, so I thought it was about time I did a test and found an answer.

If the user of your program can supply you with a list of regular expressions to match against some text, should you combine those expressions into one big one, or treat them separately?

In my case I need an OR relationship, so combining them just means putting a pipe symbol between them.*

So: one expression made by ORing, or looping through several – which is better? There’s only one way to find out:

import re, sys

line_with_match_foo = "This line contains foo."
line_with_match_baz = "This line contains baz."
line_without_match = "This line does not contain it."

re_strings = ( "foo", "bar1", "bar2", "baz", "bar3", "bar4", )

piped_re = re.compile( "|".join( re_strings ) )

separate_res = list( re.compile( r ) for r in re_strings )

NUM_ITERATIONS = 1000000

def piped( line ):
    for i in range( NUM_ITERATIONS ):
        if piped_re.search( line ):
            print "match!" # do something

def separate( line ):
    for i in range( NUM_ITERATIONS ):
        for s in separate_res:
            if s.search( line ):
                print "match!" # do something
                break # stop looping because we matched

arg = sys.argv[1]

if arg == "--piped-nomatch":
    piped( line_without_match )
elif arg == "--piped-match-begin":
    piped( line_with_match_foo )
elif arg == "--piped-match-middle":
    piped( line_with_match_baz )
elif arg == "--separate-nomatch":
    separate( line_without_match )
elif arg == "--separate-match-begin":
    separate( line_with_match_foo )
elif arg == "--separate-match-middle":
    separate( line_with_match_baz )

And here are the results:

$ time python re_timings.py --piped-nomatch > /dev/null

real    0m0.987s
user    0m0.943s
sys     0m0.032s
$ time python re_timings.py --separate-nomatch > /dev/null

real    0m3.695s
user    0m3.641s
sys     0m0.037s

So when no regular expressions match, the combined expression is 3.6 times faster.

$ time python re_timings.py --piped-match-middle > /dev/null

real    0m1.900s
user    0m1.858s
sys     0m0.033s
$ time python re_timings.py --separate-match-middle > /dev/null

real    0m3.543s
user    0m3.439s
sys     0m0.042s

And when an expression near the middle of the list matches, the combined expression is 1.8 times faster.

$ time python re_timings.py --piped-match-begin > /dev/null

real    0m1.847s
user    0m1.797s
sys     0m0.035s
$ time python re_timings.py --separate-match-begin > /dev/null

real    0m1.649s
user    0m1.597s
sys     0m0.032s

But in the (presumably much rarer) case where all lines match the first expression in the list, the separate expressions are marginally faster.

A clear win for combing the expressions, unless you think it’s likely that most lines will match expressions early in the list.

Note also if you combine the expressions the performance is similar when the matching expression is at different positions in the list (whereas in the other case list order matters a lot), so there is probably no need for you or your user to second-guess what order to put the expressions in, which makes life easier for everyone.

I would guess the results would be similar in other programming languages. I certainly found it to be similar in C# on .NET when I tried it a while ago.

By combining the expressions we ask the regular expression engine to do the heavy lifting for us, and it is specifically designed to be good at that job.

Open questions:

1. Have I made a mistake that makes these results invalid?

2. * Can arbitrary regular expressions be ORed together simply by concatenating them with a pipe symbol in between?

3. Can we do something similar if the problem requires us to AND expressions?