XPath 2.0 and DOM Tree Implementation

Introduction

P6R's XPath functionality is integrated with its DOM XML parser (i.e., P6R::p6IDOMXML). XPath expressions are compiled into an P6R::p6IXpathExpression component. After compilation the expression can be evaluated over and over again. The expression is evaluated in the context of an XML (or JSON) tree (i.e., a p6IDOMXML component). The same compiled expression can be evaluated against one or more DOM trees.

The DOM XML parser can support data encoded in JSON. To do this only two things are required: (a) the initialize() DOM parser method needs the P6DOMXML_USEJSON flag, and (b) all XPath step expressions must start with "/JSON-document" (e.g., "/JSON-document/book/chapter[1]/title" to find the title of chapter 1). The use of "/JSON-document" is necessary since some JSON documents have no top most element as in XML.

The P6R::p6IXpathExpression component has been integrated with P6R's XSLT processor (i.e., P6R::p6IXSLT) and the XML based Rule Engine components (i.e., P6R::p6IRuleEngine). Thus any XML based application (written by P6R or our customers) can embed XPATH 2.0 into its application as we have done. In doing so these applications can access data in both the XML and JSON encodings. For example, our XSLT processor handles XSL templates which are a XML-based dialect that reads/writes XML or JSON input data.

XPath 2.0 contains many more features than XPath 1.0, for example reqular expressions, see reference 1 below.

In addition, XPath 2.0 can be used totally by itself, that is, it does not have to be embedded into XML for it to be used. Our component architecture allows direct application use of the XPath 2.0 and related components. Thus an application wishing to use this powerful expression language directly would simple do the following steps:

 // This is a code sketch
 P6XPATH_RESULT      result;
 p6IXpathExpression  *pExpress;
 p6IDOMXML           *pDOM; 
 p6IDataStream       *pStream;  

 // get a DOM tree component
 p6CreateInstance( NULL, CID_p6DOMXML, VALIDATEIF( p6IDOMXML, &pDOM ));
 // get a XPath component
 p6CreateInstance( NULL, CID_p6XpathExpression, VALIDATEIF( p6IXpathExpression, &pExpress ));
 // fill the pDOM with XML to be accessed via XPath expressions
 // the XML is streamed in from the p6IDataStream object
 pDOM->parse( &pStream ); ....
 // compile the XPath 2.0 expression
 pExpress->compileExpression( "7 ge 5", ... );
 // evaluate the expression against the pDOM tree with the 'result' type returned.
 pExpress->eval( pDOM, NULL, NULL, &result ); 
 // the same XPath component, e.g., pDOM, can be resued to compile many expressions and evaluated against the same DOM tree 

References

1) M.Kay, XPath 2.0, Programmer's Reference, Wiley Publishing Inc, 2004, ISBN 0-7645-6910-4.

2) M.Kay, XSLT 2.0, Programmer's Reference, 3rd Edition, Wiley Publishing Inc, 2004, ISBN 0-7645-6909-0.

3) N.Bradley, The XSL Companion, Addison-Wesley, 2000, ISBN 0-201-67487-4.

4) J.Fridel, Mastering Regular Expressions, 2nd edition, O'Reilly, 2002, ISBN 0-596-00289-0.

Includes

Interfaces

Collations

Several XPath functions take an optional collation string paramter (e.g., compare, starts-with). The XPath standard defines these collation strings to be URIs. However, for P6R's XPath implemenation these collation strings are not URIs. The collation strings are what the underlying operating system expects for its I18n support.

Extensions to Existing XPath Functions

1) tokenize( input string, regex, flags )

The standard definition of this function states that it has the following limitiation: it is not possible to do anything with the separator substrings. That is, only the substrings between the separators are returned. However, this is not true for our implementation. We have based our implementation of tokenize() on the Perl split() function (see P6R::p6ISplit).

With the Perl split() function, the separator substrings can be obtained by use of capturing parentheses. See reference #4 above, pp.326, "Split's Match Operand with Capturing Parentheses". As an example, given tokenize( "1:2-3;4", "([:-;])" ), would return the sequence: '1' ':' '2' '-' '3' ';' '4'. Without the capturing parentheses the regex would be "[:-;]", and the sequence returned would instead be: '1' '2' '3' '4'.

So P6R's implementation of tokenize() is more powerful than what is defined by the XPath 2.0 standard, yet it follows standard Perl regex rules.

P6R's Functions Not Yet Implemented

The following functions have not yet been implemented in P6R's XPath 2.0 implementation: base-uri, collection, deep-equal, doc, document-uri, format-number (see our extension below), id, idref, iri-to-uri, nilled, normalize-unicode, and resolve-uri.

P6R's Extension Functions

These added functions require the use of the P6R namespace: http://www.p6r.com/XPath/extensions

1) P6R:base64encode

This function encodes a given byte array into a base64 encoded string. However, all strings in our XPath implementation are stored in a wide character, Unicode representation. The output string can be output into UTF8 format along with all other template output. Note that the input to this function can come from an externally defined variable that can contain any binary data via the use of the P6R::p6IXpathVariables::lookupVariable() interface.

 Argument    Data Type     Meaning
 input       byte array    Function takes a standard XPath expression as input.   
 result      byte array    A base64 encoded character string in wide string format.
                           XPath returned type of P6R::P6XPATH_TYPE_STR.

2) P6R:base64decode

This function removes the base64 encoding of the input string. Warning, care must be taken when using these base64 functions when encoding Unicode strings. On Linux each wide character is represented in 4 bytes while Windows and Solaris use 2 bytes. Thus encoding on Linux and trying to decode on Windows or Solaris will not work properly. (Likewise encoding on Windows or Solaris and trying to decode on Linux will also not work.) The calling application needs to normalize strings, before using this standard base64 algorithm, when passing the base64 result between Linux and other operating systems.

 Argument    Data Type     Meaning
 input       xs:string     A base64 encoded string 
 result      byte array    XPath returned type of P6R::P6XPATH_TYPE_STR.

3) P6R:match-attribute

This is an extension of the standard lang() function. This method allows the caller to match any attribute of the context node.

 Argument         Data Type     Meaning
 attribute name   xs:string     To match the lang() function this would be "lang" 
 attribute value  xs:string     To match the lang() function this could be "fr-CA"
 result           xs:boolean    Base64 allow the encoding of binary data

4) P6R:matches-with-capture

This is an extension of the XPath matches() function. It takes the exact same parameters as the matches() function but returns a node set as a result instead. The returned node set is composed of the following values: the first value is the matching string, all other values (if any), are substrings of the first value which are captured by the regular expression via back references. If no match occurs, then an empty node set is returned to indicate false. XPath itself does not currently have a way to return the captured strings.

 Argument         Data Type     Meaning
 attribute input  xs:string     See the XPath 2.0 standard for the standard meaning
 attribute regex  xs:string         of these arguments.
 atrribute flags  xs:string     (optional)
 result           item()*       A node set with zero or more values as defined above

5) P6R:format-number

This is meant as an alternative to the XSLT format-number() function. The XSLT function is complex and does not allow the explicit selection of a language and locale. The third parameter of this function is a standard locale string, for example: 'en' (for English), 'en_us' (for english in the United states, and 'fr_ca' (for French Canadian).

 Argument       Data Type                   Meaning
 numeric        xs:double or xs:integer     any one of the numeric values is valid
                or xs:decimal or xs:float     
 format         xs:string                   a standard format string as used in P6R::p6i18n::formatString function (i.e., %1$)
 field width    xs:integer                  if zero then no default width is used, otherwise use size as maximum length of number
 locale         xs:string                   (optional) indicates language and locale (e.g., en_us)
 result         xs:string                   The format parameter expanded with the 'numeric' parameter

An example XSLT stylesheet using the base64 encode function.

 <?xml version='1.0' encoding='ISO-8859-1'?>
 <xsl:stylesheet version='2.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
                               xmlns:P6R='http://www.p6r.com/XPath/extensions'>
 <xsl:output method='html'/>
                                                      
 <xsl:variable name='gv1' select='output/bye' />
 <xsl:variable name='gv2' select='output/hello'/>

 <xsl:template match='/'>     
 <HTML>
 <BODY>
 <P/>
     Base64 of " <xsl:value-of select='$gv2' /> " is:<BR/>
     <xsl:value-of select="P6R:base64encode( $gv2 )" />
 </BODY>
 </HTML>                           
 </xsl:template>
 </xsl:stylesheet>

Apply the following XML input data to the stylesheet defined above.

 <?xml version='1.0' encoding='UTF-8' ?>
 <output>
     <hello>Hi there1</hello>
     <hello>Hi There2</hello>
     <hello>HI THERE3</hello>
     <hello>HI4</hello>
     <bye>simple period test</bye>
 </output>

The output of the XSLT stylesheet applied to the XML input using the P6R:base64encode() function.

 <HTML><BODY><P>
 Base64 of " Hi there1" is:
 SGkgdGhlcmUx
 </BODY></HTML>

To call one of the extension functions outside of XSLT requires the use of the following, qualified names (i.e., QNames):

http-&&www.p6r.com&XPath&extensions&p-base64encode( string )

http-&&www.p6r.com&XPath&extensions&p-base64decode( string )

http-&&www.p6r.com&XPath&extensions&p-match-attribute( string, string )

http-&&www.p6r.com&XPath&extensions&p-matches-with-capture( string, string, string )

http-&&www.p6r.com&XPath&extensions&p-format-number( numeric, string, integer, string )

The QName encoding is simple: (1) all '/' characters are replaced with '&', (2) all ':' characters are replaced with '-', and (3) the name of the extension function is placed at the very end of the string with a "&p-" connector.

 All Classes Files Functions Variables Typedefs Enumerations Enumerator Defines
Copyright © 2004 - 2010 P6R Inc. - All Rights Reserved.