Developer's Guide
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
RGX Reference Guide

Introduction

The P6R::p6IWRegex interface provides regex support for platform specific wide (unicode) strings, while the P6R::p6IRegex interface provides regular expression (regex) support for ASCII (i.e., narrow) strings. Currently, we support the following regex flavors: Perl, ECMAScript, Egrep, and Brief. A description of each flavor is presented below.

References

1) J.E.F.Friedl, "Mastering Regular Expressions", O'Reilly, 2nd Edition, 2002, ISBN 0-596-00289-0.

2) Wall et al, "Programming Perl". O'Reilly, 2000, 3rd Edition, ISBN 0-596-00027-8

3) ECMA-262, ECMAScript Language Specificiation, 5.1 Edition, June 2011

4) D.Flanagan, "Javascript, The Definitive Guide", O'Reilly, 5th Edition, 2006, ISBN 0-596-10199-6.

5) BRIEF, DOS-OS/2 User's Guide, "The Programmer's Editor", 1991, SDC Software Patners II, L.P., Section Regular Expression Characters, pp.103 - 112.

6) Another reference on BRIEF: http://msdn.microsoft.com/en-us/library/ms925487.aspx

Includes

Interfaces

Perl Regular Expression Support

P6R's implementation of Perl's regex flavor is nearly complete. The following describes a few differences and enhancements.

a) The 'x' modifier is only partially supported. ASCII white space, except in character classes, is ignored in the regex definition when the P6MOD_SKIPWHITESPACE modifier (P6MODIFIER_SKIPWHITESPACE for p6IRegex) is passed to the compile method. As in standard Perl, white space is still treated as a delimiter but is not searched for in the input string. So for example: '( a\d | b\d )' is looking for something like 'a5' or 'b6', but not ' a5 ' or ' b6 ' (there are spaces before and after the later) [Friedl, pp.110]. Note this is a global setting for the compile. The 'x' modifier is NOT supported in a cloister (e.g., '(?isx-m: abcdef)', the 'x' in '?isx-m:' will not work).

b) Comments: (?# ) and # using the 'x' modifier are not supported

c) The tr/// Transliteration Operation is not supported

d) Unicode properties are supported in the p6wregex component but not in the (narrow) p6regex component. Supported: { IsASCII, IsAlnum, IsAlpha, IsCntrl, IsDigit, IsGraph, IsLower, IsPrint, IsPunct, IsSpace, IsUpper, IsWord, IsXDigit }

e) The wide version (P6R::p6IWRegex) supports the following Unicode characters as line terminators: {0x000A, 0x000B, 0x000C, 0x000D, 0x0085, 0x2028, 0x2029 }, which are respectively: ASCII Line Feed, ASCII Vertical Tab, ASCII Form Feed, ASCII Carriage Return, Unicode Next Line, Unicode Line Separator, and Unicode Paragraph Separator.

f) Unicode characters can be represented by the \x{} notation which defines the Unicode character's hexadecimal value inside the curly braces. For example, the \x{263A} is the Unicode smiley face.

g) We have also added a new global modifier 'P6MOD_FULLLOOKBEHIND'. This modifier allows the lookbehind meta character sequences '(?<=' & '(?<!' to match anywhere in the input string already seen. So for example, given the regex of '(?<=44 )\btime\b' using the full lookbehind modifier would match the input string of: 'Now 44 is the time to pay your bills'. The '44' is not right next to the 'time' string, but 44 is still found. In normal (Perl implemented) lookbehind the input string would have to be something like 'Now is the 44 time to pay your bills'. Notice that in this last example, '44' is right before 'time'.

Also note that with the 'P6MOD_FULLLOOKBEHIND' modifer there are NO restrictions on what regex can be placed in the lookbehind. So as another example: "\\d+-\\d+-\\d+(?<=^\\w+:)" is allowed, and looks to the very beginning of the string. Testing this regex against the input string of: "Subject: unlimited 123-45-5 abc", (using the search function) will successfully find a match at offset 19 matching 8 characters (i.e., "123-45-5"). The lookbehind expression "(?<=^\\w+:)" matches the "Subject:" part of the input string. (By the way, our string offsets start at zero.) Another way to put this, is that our lookbehind feature supports regular expressions that can match a variable amount of text. This provides a powerful look behind capability in P6R regular expressions.

Mapping from Perl to RGX's API

Perl

RGX

m// (match operator)

Narrow Methods:
P6R::p6IRegex::match()
P6R::p6IRegex::search()
Wide Methods:
P6R::p6IWRegex::match()
P6R::p6IWRegex::search()

s/// (substitution)

Narrow Methods:
P6R::p6IRegex::replace()
P6R::p6IRegex::replaceInPlace()
P6R::p6IRegex::replaceWithCallback()
Wide Methods:
P6R::p6IWRegex::replace()
P6R::p6IWRegex::replaceInPlace()
P6R::p6IWRegex::replaceWithCallback()

$1, .. $n (get backreference values)

Narrow Methods: P6R::p6IRegex::getCaptureText()
Wide Methods: P6R::p6IWRegex::getCaptureText()

/i

P6MODIFIER_INSENSITIVE

Narrow Methods:
P6R::p6IRegex::match()
P6R::p6IRegex::search()
P6R::p6IRegex::replace()
P6R::p6IRegex::replaceInPlace()
P6R::p6IRegex::replaceWithCallback()
Wide Methods:
P6R::p6IWRegex::match()
P6R::p6IWRegex::search()
P6R::p6IWRegex::replace()
P6R::p6IWRegex::replaceInPlace()
P6R::p6IWRegex::replaceWithCallback()

/sP6MODIFIER_NEWLINE

Narrow Methods:
P6R::p6IRegex::match()
P6R::p6IRegex::search()
P6R::p6IRegex::replace()
P6R::p6IRegex::replaceInPlace()
P6R::p6IRegex::replaceWithCallback()
Wide Methods:
P6R::p6IWRegex::match()
P6R::p6IWRegex::search()
P6R::p6IWRegex::replace()
P6R::p6IWRegex::replaceInPlace()
P6R::p6IWRegex::replaceWithCallback()
/mP6MODIFIER_MULTILINE

Narrow Methods:
P6R::p6IRegex::match()
P6R::p6IRegex::search()
P6R::p6IRegex::replace()
P6R::p6IRegex::replaceInPlace()
P6R::p6IRegex::replaceWithCallback()
Wide Methods:
P6R::p6IWRegex::match()
P6R::p6IWRegex::search()
P6R::p6IWRegex::replace()
P6R::p6IWRegex::replaceInPlace()
P6R::p6IWRegex::replaceWithCallback()
/g (all instances)P6MODIFIER_GLOBAL

When applied to searches, calling P6R::p6IRegex::search() multiple times will replicate /g's functionality. See Example 7 - Multiple Matches for an example.

When applied to replace:
Narrow Methods:
P6R::p6IRegex::replace()
P6R::p6IRegex::replaceInPlace()
P6R::p6IRegex::replaceWithCallback()
Wide Methods:
P6R::p6IWRegex::replace()
P6R::p6IWRegex::replaceInPlace()
P6R::p6IWRegex::replaceWithCallback()
/xP6MODIFIER_SKIPWHITESPACE

Narrow Methods:
P6R::p6IRegex::compile()
Wide Methods:
P6R::p6IWRegex::compile()

In P6R's regex engine an expression is first compiled by calling the P6R::p6IWRegex::compile() function. After the compile, the regex can be used over and over again for match, search, and replace functions. A call to the P6R::p6IWRegex::setTrace() function will produce a (detailed) log showing how P6R's regex engine evaluates the current compiled regular expression.

Egrep Regular Expression Support

Our Egrep flavor supports all the basic egrep metacharacters. However, we also support the following extensions making a very powerful tool.

1) Start and end word boundaries: \< and \>

2) Back references with the same syntax as used in Perl

3) Counting quantifiers with the same syntax as used in Perl (e.g., {3,7} and {2} )

4) Empty cases in alternations: (|a|b|c) or (a|b|) or (a||b)

5) Standard character shortcuts: \d \D \w \W \s \S \b \e \f \n \r \v, all of which can appear in a character class or outside of a character class. (e.g., \d digit, \D non-digit, \r carriage return).

6) As in our Perl implementation, our Egrep flavor supports all modifiers (e.g., P6MOD_INSENSITIVE, see p6wregex.h or p6regex.h), except the P6MOD_FULLLOOKBEHIND which is a Perl specific modifier.

7) Our Egrep implementation supports both narrow and wide characters.

8) As in our Perl implementation, our Egrep flavor supports a hybrid NFA-DFA evaluation engine (see below).

What is special about our Regex Engine

Our regex engine provides two separate evaluation methods. The first method is the standard Perl Non-deterministic Finite State Automaton (NFA) algorthm ([Friedl, Chapter 4, The Mechanics of Expression Processing). This is the default behavoir.

The second method uses a hybrid NFA-DFA (Deterministic Finite State Automaton) evaluation. This means that the standard DFA matching is done but backreferences are also supported in this mode. (Standard DFAs do not support backreferences). This evaluation results in all greedy matches (e.g., alternation is greedy where in the first method this is not the case) [Friedl, "DFA Speed with NFA Capabilities", pp.121].

A DFA performs a match evaluation by keeping track of all possible matches at the same time. While an NFA, essentially, does a depth first search in the NFA graph and back tracks back up the graph when a match fails. (Please see [Friedl, and Wall] for a detailed explaination of these topics.)

To use the hybrid NFA-DFA method in the P6R::p6IRegex component, pass the 'P6MODIFIER_FASTGREEDY' value in the 'P6REGEXMODIFIER modifiers' parameter in any of the functions: match, search, replace. For the P6R::p6IWRegex component pass the 'P6MOD_FASTGREEDY' value. Thus the evaluation method is not "hard coded" during the regex compile step, but is specified during the call of an evaluation method. It is possible to use both methods after a single compile.

We closely follow the definition of the Perl's regex flavor as defined in [Wall].

One last note, our regex implemenation is designed to be extensible. If there is a regex flavor or feature that you use and we do not currently support, then we should be able to add it easily. Just make a request for a new product feature.

Usage Tips

  1. A regex only needs to be compiled once and can then be used many times afterwards. A compiled regex can be used with any number of match, search, and replace calls. Also each of the match, search, and replace calls can use a different set of modifiers (e.g., P6MOD_INSENSITIVE, P6MOD_FASTGREEDY). Note, that the same compiled regex can be used with different evaluation algorhtms: the standard backtrack algorithm used in Perl, as well as, the DFA oriented algorthm (selected by the P6MOD_FASTGREEDY) (see last section).
  2. Perl's Atomic Grouping, (?> ...), is treated the same as a non-capturing set of parenthesis (?: ...) when using the P6MOD_FASTGREEDY. This is because Atomic Grouping is defined to control part of the backtrack algorthm, but the fast & greedy (or DFA oriented) evaluation algorthm does not use backtracking.
  3. The fast & greedy (or DFA oriented) regex evaluation algorthm, makes alternations greedy. So in the standard Perl backtracking algorthm the alternation: (green|greenhouse|blue), and the input string 'greenhouse' will match 'green' because it will match the first alternation possible. However in the fast & greedy evaluation algorthm the 'greenhouse' choice will match (greedly matching more characters).
  4. Two levels of regex evaluation tracing is supported. REGEX_TRACE_BASIC shows how each part of the input string matches to the particular parts of the regex. As an example, the following is a log excerpt:
* (Note our string offsets start at zero.)
* WRegex regex: '([abc])+(?(1)(\\w+)|(\\d*))' flavor: Perl
* WRegex function search in input string: 'startcounting 383838383'
* WRegex found '[abc]' at offset 2
* WRegex found '(1)' at offset 3
* WRegex found '\\w' at offset 3
* WRegex found '\\w' at offset 4
* WRegex found '\\w' at offset 5
* WRegex found '\\w' at offset 6
* WRegex found '\\w' at offset 7
* WRegex found '\\w' at offset 8
* WRegex found '\\w' at offset 9
* WRegex found '\\w' at offset 10
* WRegex found '\\w' at offset 11
* WRegex found '\\w' at offset 12
*

At the REGEX_TRACE_DEBUG, the entire regex compile results are show, by outputing the NFA graph(s) produced. In addition, a very verbose output of every step the evalutation algorthm takes is produced. This type of tracing can produce large output files and will slow execution down.

5) Perl's lazy (or minimal) meta character sequences of: *? +? {3,5}? {3,}? {3}? are evaluated as their greedy conterparts WHEN USING THE P6MOD_FASTGREEDY. This is because the P6MOD_FASTGREEDY algorithm only uses greedy evaluations that cannot be over ridden. So '*?' is treated like '*' and so on. The default behavior works as expected in the standard backtracking evaluation algorithm.

Examples

* Example 1:
* p6IWRegex *pRegex;
* P6ERR err = pRegex->initialize( WREGEX_PERL );
* -> should return err == eOk
*
* err = pRegex->compile( P6TEXT("(\\w)(a|b|c)(?:99)(frank)\\1\\3"), MOD_NULL );
* -> should return err == eOk
*
* err = pRegex->match( P6TEXT("Ac99frankAfrank"), MOD_NULL );
* -> should return err == eOk
*
* P6UINT32 offset = 0;
* P6UINT32 strLength = 0;
* err = pRegex->getCaptureText( 2, &offset, &stringLength );
* -> should return offset == 1 stringLength == 1
*
*
* Example 2:
* err = pRegex->compile( P6TEXT("(abc){3,7}"), MOD_NULL );
* -> should return err == eOk
*
* err = pRegex->search( P6TEXT("confabcabcabc-9393939"), MOD_NULL, &offset, &strLength );
* -> should return offset == 4 strLength == 9
*
*
* Example 3:
* err = pRegex->compile( P6TEXT("abc ((?i)bar \\w)\\1"), MOD_NULL );
* -> should return err== eOk
*
* err = pRegex->search( P6TEXT("abc BaR yBaR y"), MOD_FASTGREEDY, &offset, &stingLength );
* -> should return offset == 0 stringLength == 14
*
*
* Example 4:
* err = pRegex->initialize( WREGEX_EGREP );
* -> should return err == eOk
*
* a word by itself:
* err = pRegex->compile( P6TEXT("\\<cat\\>"), MOD_NULL );
* -> should return err == eOk
*
* err = pRegex->search( P6TEXT("Catorama DogAndCats Cat dog lizard"), (MOD_FASTGREEDY | MOD_INSENSITIVE), &offset, &stringLength );
* -> should return offset == 20 stringLength == 3
*
*
* Example 5:
* P6ERR err = pRegex->initialize( WREGEX_PERL );
* -> should return err == eOk
*
* err = pRegex->compile( P6TEXT("[01[:alpha:][:blank:]%]+"), P6MOD_NULL );
* -> should return err == eOk
* -> notice the support for POSIX "[: :]" construct
*
* offset = 0;
* strLength = 0;
* err = pRegex->search( P6TEXT("44abc %1de\t0045"), P6MOD_NULL, &offset, &strLength );
* -> should return offset == 2 strLength == 11
*
*