<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Andy Balaam's Blog &#187; Regular Expressions</title>
	<atom:link href="http://www.artificialworlds.net/blog/category/regular-expressions/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.artificialworlds.net/blog</link>
	<description>Four in the morning, still writing Free Software</description>
	<lastBuildDate>Mon, 21 Jun 2010 00:27:31 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Separate regular expressions, or one more complex one?</title>
		<link>http://www.artificialworlds.net/blog/2009/04/03/separate-regular-expressions-or-one-more-complex-one/</link>
		<comments>http://www.artificialworlds.net/blog/2009/04/03/separate-regular-expressions-or-one-more-complex-one/#comments</comments>
		<pubDate>Fri, 03 Apr 2009 06:44:21 +0000</pubDate>
		<dc:creator>Andy Balaam</dc:creator>
				<category><![CDATA[Performance]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Regular Expressions]]></category>
		<category><![CDATA[Tech]]></category>

		<guid isPermaLink="false">http://www.artificialworlds.net/blog/?p=140</guid>
		<description><![CDATA[I have asked myself this question several times, so I thought it was about time I did a test and found an answer.
If the user of your program can supply you with a list of regular expressions to match against some text, should you combine those expressions into one big one, or treat them separately?
In [...]]]></description>
			<content:encoded><![CDATA[<p>I have asked myself this question several times, so I thought it was about time I did a test and found an answer.</p>
<p>If the user of your program can supply you with a list of regular expressions to match against some text, should you combine those expressions into one big one, or treat them separately?</p>
<p>In my case I need an OR relationship, so combining them just means putting a pipe symbol between them.*</p>
<p>So: one expression made by ORing, or looping through several &#8211; which is better?  <a href="http://www.youtube.com/watch?v=Np6gyUb0E7o">There&#8217;s only one way to find out</a>:</p>
<pre>import re, sys

line_with_match_foo = "This line contains foo."
line_with_match_baz = "This line contains baz."
line_without_match = "This line does not contain it."

re_strings = ( "foo", "bar1", "bar2", "baz", "bar3", "bar4", )

piped_re = re.compile( "|".join( re_strings ) )

separate_res = list( re.compile( r ) for r in re_strings )

NUM_ITERATIONS = 1000000

def piped( line ):
    for i in range( NUM_ITERATIONS ):
        if piped_re.search( line ):
            print "match!" # do something

def separate( line ):
    for i in range( NUM_ITERATIONS ):
        for s in separate_res:
            if s.search( line ):
                print "match!" # do something
                break # stop looping because we matched

arg = sys.argv[1]

if arg == "--piped-nomatch":
    piped( line_without_match )
elif arg == "--piped-match-begin":
    piped( line_with_match_foo )
elif arg == "--piped-match-middle":
    piped( line_with_match_baz )
elif arg == "--separate-nomatch":
    separate( line_without_match )
elif arg == "--separate-match-begin":
    separate( line_with_match_foo )
elif arg == "--separate-match-middle":
    separate( line_with_match_baz )
</pre>
<p>And here are the results:</p>
<pre>$ time python re_timings.py --piped-nomatch > /dev/null

real    0m0.987s
user    0m0.943s
sys     0m0.032s
$ time python re_timings.py --separate-nomatch > /dev/null

real    0m3.695s
user    0m3.641s
sys     0m0.037s
</pre>
<p>So when no regular expressions match, the combined expression is 3.6 times faster.</p>
<pre>
$ time python re_timings.py --piped-match-middle > /dev/null

real    0m1.900s
user    0m1.858s
sys     0m0.033s
$ time python re_timings.py --separate-match-middle > /dev/null

real    0m3.543s
user    0m3.439s
sys     0m0.042s
</pre>
<p>And when an expression near the middle of the list matches, the combined expression is 1.8 times faster.</p>
<pre>
$ time python re_timings.py --piped-match-begin > /dev/null

real    0m1.847s
user    0m1.797s
sys     0m0.035s
$ time python re_timings.py --separate-match-begin > /dev/null

real    0m1.649s
user    0m1.597s
sys     0m0.032s
</pre>
<p>But in the (presumably much rarer) case where all lines match the first expression in the list, the separate expressions are marginally faster.</p>
<p>A clear win for combing the expressions, unless you think it&#8217;s likely that most lines will match expressions early in the list.</p>
<p>Note also if you combine the expressions the performance is similar when the matching expression is at different positions in the list (whereas in the other case list order matters a lot), so there is probably no need for you or your user to second-guess what order to put the expressions in, which makes life easier for everyone.</p>
<p>I would guess the results would be similar in other programming languages.  I certainly found it to be similar in C# on .NET when I tried it a while ago.</p>
<p>By combining the expressions we ask the regular expression engine to do the heavy lifting for us, and it is specifically designed to be good at that job.</p>
<p>Open questions:</p>
<p>1. Have I made a mistake that makes these results invalid?</p>
<p>2. * Can arbitrary regular expressions be ORed together simply by concatenating them with a pipe symbol in between?</p>
<p>3. Can we do something similar if the problem requires us to AND expressions?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.artificialworlds.net/blog/2009/04/03/separate-regular-expressions-or-one-more-complex-one/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
