Home » Articles » Unique Product Features » There is more than one way to evaluate a regular expression

There is more than one way to evaluate a regular expression

By Mark Joseph - September 6, 2010 @ 3:04 pm

P6R’s RGX™ 1.0 Regular Expression Engine has some unique features that make it stand out from the crowd. One such feature is its regex-oriented tracing. Several others are described below.

First, we have added a new global modifier ‘P6MOD_FULLLOOKBEHIND’. This modifier allows the look behind meta character sequences ‘(?<=’ & ‘(?<!’ to match anywhere in the input string already seen. So for example, given the regex of ‘(?<=44 )\\btime\\b’ using the full look behind modifier would match the input string of: ‘Now 44 is the time to pay your bills’. The ’44′ is not right next to the ‘time’ string, but 44 is still found. In normal (Perl implemented) look behind the input string would have to be something like ‘Now is the 44 time to pay your bills’. Notice that in this last example, ’44′ is right before ‘time’.

Also note that with the ‘P6MOD_FULLLOOKBEHIND’ modifier there are NO restrictions on what regex can be placed in the look behind. So as another example: “\\d+-\\d+-\\d+(?<=^\\w+:)” is allowed, and looks to the very beginning of the string. Testing this regex against the input string of: “Subject: unlimited 123-45-5 abc”, (using the search function) will successfully find a match at offset 19 matching 8 characters (i.e., “123-45-5″). The look behind expression “(?<=^\\w+:)” matches the “Subject:” part of the input string. (By the way, our string offsets start at zero.) Another way to put this, is that our look behind feature supports regular expressions that can match a variable amount of text. This provides a powerful look behind capability in P6R’s regex engine.

Second, P6R’s regex engine provides two separate evaluation methods. The first method is the standard Perl Non-deterministic Finite State Automaton (NFA) algorithm ([Friedl, Chapter 4, The Mechanics of Expression Processing). This is the default behavior. The second method uses a hybrid NFA-DFA (Deterministic Finite State Automaton) evaluation. This means that the standard DFA matching is done but back references are also supported in this mode. (Standard DFAs do not support back references). This evaluation results in all greedy matches (e.g., alternation is greedy where in the first method this is not the case) [Friedl, "DFA Speed with NFA Capabilities", pp.121].

A DFA performs a match evaluation by keeping track of all possible matches at the same time. While an NFA, essentially, does a depth first search in the NFA graph and back tracks back up the graph when a match fails. (See [Friedl, and Wall] for a detailed explanation of these topics.) To use the hybrid NFA-DFA method in the P6R::p6IWRegex component, pass the ‘P6MOD_FASTGREEDY’ value in the ‘P6WREGEXMODIFIER modifiers’ parameter in any of the functions: match, search, replace. Thus the evaluation method is not “hard coded” during the regex compile step, but is specified during the call of an evaluation method. It is possible to use both methods after a single compile.

Perl’s Atomic Grouping, (?> …), is treated the same as a non-capturing set of parenthesis (?: …) when using the P6MOD_FASTGREEDY evaluation method. This is because Atomic Grouping is defined to control part of the backtrack algorithm, but the fast & greedy (or DFA oriented) evaluation algorithm does not use backtracking. Also the fast & greedy (or DFA oriented) regex evaluation algorithm, makes alternations greedy. So in the standard Perl backtracking algorithm the alternation: (green|greenhouse|blue), and the input string ‘greenhouse’ will match ‘green’ because it will match the first alternation possible. However in the fast & greedy evaluation algorithm the ‘greenhouse’ choice will match (greedly matching more characters).

Here are some examples of using the P6MOD_FASTGREEDY evaluation method:

p6IWRegex* pRegex = NULL;
P6UINT32 offset = 0;
P6UINT32 strLength = 0;

* * * *
pRegex->compile( P6TEXT("(tour|to|tournament)"), P6MOD_NULL );   

offset = 0;
strLength = 0;
pRegex->search( P6TEXT("three tournaments won"), P6MOD_NULL, &offset, &strLength );
// offset will equal 6, and strLength will equal 4

offset = 0;
strLength = 0;
pRegex->search( P6TEXT("three tournaments won"), P6MOD_FASTGREEDY, &offset, &strLength );
// offset will equal 6 and strLength will equal 10

Another Example:

pRegex->compile( P6TEXT("<B>.*?</B>"), P6MOD_NULL );   

offset= 0;
strLength = 0;
pRegex->search( P6TEXT("See <B>Billions</B> and <B>Zillions</B> of suns"), P6MOD_NULL, 
&offset, &strLength );
// offset will equal 4, and strLength will equal 15

// this should generate a greedy match skipping the first '</B>' tag found
offset = 0;
strLength = 0;
pRegex->search( P6TEXT("See <B>Billions</B> and <B>Zillions</B> of suns"), P6MOD_FASTGREEDY, 
&offset, &strLength );
// offset will equal 4, and strLength will equal 35


There are also examples where the result of using the fast & greedy evaluation method results in the same output as the standard Perl evaluation method. So why would we use the fast & greedy method? The simple answer is speed since no backtracking is performed.

"There is more than one way to evaluate a regular expression" was published on September 6th, 2010 and is listed in Unique Product Features.

Follow comments via the RSS Feed | Leave a comment | Trackback URL


Leave Your Comment