Friday, January 27, 2006

Regex Performance - RegexOptions.Compile

I have been working on a small project that utilizes Regular Expressions to do most of the heavy lifting. I have been aware of the Compile option, but haven't really experienced any benefit before. However, in this project it became clear that certain expressions were really bogging down the engine - causing the overall run to crawl. This expression was one that was slow:

(?<1>^\s*)Foo(?<2>[ ]\w*[ ])(Bar)(?<3>[ ][fF]un[ ])(?<4>\w*)(\s')

Strangely, it seems very like the others I was using that were not slow.

In any case, I was using the static IsMatch from the Regex class, so I switched to creating an instance of Regex with the Compile option and everything sped back up. Very handy. I haven't done any real analysis on what caused these certain expressions to be slow; I am sure my moderate understanding of RE has caused some inefficient syntax. I would like to look into it further, but it will have to wait...
Submit this story to DotNetKicks


j. montgomery said...

Bad performance in regular expressions are typically from using greedy quatifiers. Greedy quatifiers are:

* + ? {n} {n,} {x,y}

In regex coach using the step-through feature and you'll see exactly why greedy affects performance. Often, the '*' will consume an entire string (absolutely everything) and then work backwards from the end to find if there are other matches in the regular expression (i think i'm saying this correctly).

Rule 2 of regular expression (from Mastering Regular Expressions, ch#4) : "Items that are allowed to match a variable number of times always attempt to match as much as possible. They are greedy."

Basically you want to try and not be "too greedy" when writing a regular expression, unless it makes good sense.

Check out the difference between greedy and non-greedy (lazy) matching.

Non-greedy quantifiers are:

*? +? ?? {m,n}?

Which match as little text as possible.

j. montgomery said...

BTW, a good way to illustrate this point is the following regular expression:

greedy vs non-greedy:




See how they behave differently against the following target string:

<a href="blah">Blah</a>, <b>BOLD</b>