SAX2 Implementation

Introduction

P6R's XML parser provides a C++ implementation of the SAX2 interface. Our parse implemenation is designed to be high performance and to directly support a streaming IO model. The parser can be invoked with the entire XML document in one buffer, or the XML document feed into the parser a chunk at a time over multiple calls. This streaming IO model is implemented via the P6R::p6IDataStream interface (see p6datastream.h). This interface allows chaining of components (e.g., filters, sources and sinks). To assist a developer in debugging a detailed parse trace can be turned on programaticaly.

References

1) N.Bradley, The XML Companion, Addison-Wesley, 3rd edition, 2002, ISBN 0-201-77059-8.

2) D.Browneww, SAX2, O/Reilly, 2002, ISBN 0-596-00237-8.

3) http://sax.sourceforge.net/

Includes

Interfaces

Differences between the Java SAX2 Interface and P6R's C++ Interface

a) P6R's SAX2 parser reduces the amount of string copying to help improve performance. In the Java definition 'String' objects are frequently passed to the application. However, all this object creation and string copying comes at a cost. We have taken a different approach. Our SAX2 parser return pointers and a length to parsed strings (e.g., start elements). These pointers point into the application provided buffer that contains the XML document to be parsed.

 So, for example, in Java one method in the Content Handler is:
 void startElement( java.lang.String uri, java.lang.String localName, java.lang.String qName, Attributes atts );

 The P6R's modified method is defined as:
 P6ERR startElement( P6SAX2STRING* pURI, P6SAX2STRING* pLocalName, P6SAX2STRING* pQName, p6ISAX2Attributes* pAtts );

 Where P6SAX2STRING is defined as:
 typedef struct 
 {
   const P6CHAR* pStart; 
   P6UINT32      length;
 } P6SAX2STRING;

The length field is important, since we are not copying the XML into a separate string object. The parsed out XML pieces are still in their original document. So for example, with the XML of:

 <?xml version='1.0' ?><money> 35 dollars and 15 cents </money>
 

a pointer to the 'money' start element would also provide a length of 5 characters, reading any more can result in an application reading other XML not part of the start element's name.

b) A special sreaming interface is used and is defined with the p6ISAX2XMLReader component. Looking at the P6R::p6ISAX2XMLReader (p6sax2xmlreadher.h) we notice that there is no separate parse function. That is because this component implements the P6R::p6IDataStream interface (see p6datastream.h). To parse either a single buffer or a stream of XML buffers perform the following steps:

 1. Get an XML reader object:
 P6ERR err = eOk;
 p6ISAX2XMLReader pReader;

 err = p6CreateInstance( NULL, CID_p6SAX2XMLReader, VALIDATEIF( p6ISAX2XMLReader, &pReader ));

 2. Then using the XML reader, get the p6IDataStream interface on that component:
 p6IDataStream pStream;
 err = pReader->queryInterface( VALIDATEIF( p6IDataStream, &pStream ));

 3. Initialize the data stream interface:
 pStream->beginStream();
 
 4. Pass the buffer(s) to be parsed one at a time:
 err = pStream->processStream( buffer, bufSize );     // -> 1st buffer of data of stream
             . . . . . .
 err = pStream->processStream( buffer, bufSize );     // -> nth buffer of data of stream
 
 The input 'buffer' to the processStream() method is where the P6SAX2STRING pointers will point into.
 The processStream() method can return an "eEndOfFile" error code to indicate that it is done with the
 buffer provided and that the buffer is incomplete (i.e., the XML top most element has not yet been closed).   

 5. Close the stream down:
 err = pStream->endStream();
 pStream->release();

The Lifetime of P6SAX2STRING pointers

P6SAX2STRING pointers have two types of lifetimes. Case one, P6SAX2STRINGs that point to strings associated with namespaces are valid until that namespace goes out of scope. If a namespace stays in scope for the entire XML document, then the namespace P6SAX2STRINGS will last until the 'endDocument' event occurs. These are the only strings that our parser makes copies.

Case two, all other P6SAX2STRINGs can point into application provided or SAX2 internal buffers. P6SAX2STRING values are ONLY valid during a callback into an application written content handler. An application that wants to keep a copy of the string MUST make a copy during a callback routine. This way an application has total control of how it manages its own memory and related performance concerns.

Note that in the standard SAX2 definition an application defines a set of 'handlers' to process events (e.g., startDocument). These handlers are the aforementioned callback routines. See the P6R::p6ISAX2XMLReader interface for how to register these handlers.

Example of using XML parser

 P6CHAR* temp1  = "<?xml version='1.0' encoding='UTF-8' ?>\n";
 P6CHAR* temp2  = "<first xmlns:xslt='http://www.w3.org/1999/XSL/Transform'>\n";
 P6CHAR* temp3  = "     <second temp='55' xmlns:X='http://www.w3.org/TR/REC-html40'>\n";
 P6CHAR* temp4  = "          <xslt:if test='123'><![CDATA[ one two three]]></xslt:if>\n";
 P6CHAR* temp5  = "     </second>\n";
 P6CHAR* temp6  = "     <third color  =  'green' name='Jon\r\nSmith' xslt:weight='3453'>\n";
 P6CHAR* temp7  = "          four five six seven\n";
 P6CHAR* temp8  = "     </third>\n";
 P6CHAR* temp9  = "     <test>Now is the time<![CDATA[ for <all >good ]]>men to come to the aid.</test>\n";
 P6CHAR* temp10 = "</first>";

 p6ISAX2XMLReader *pReader;
 p6IDataStream    *pStream;
 p6ISafeString    *pStr;
 P6ERR             err = eOk;
 . . . . .
 err = p6GetRuntimeIface( VALIDATEIF( p6ISafeString, &pStr ));
 err = p6CreateInstance( NULL, CID_p6SAX2XMLReader, VALIDATEIF( p6ISAX2XMLReader, &pReader ));
 err = pReader->initialize( P6SAX2_TRACEON );
 err = pReader->queryInterface( VALIDATEIF( p6IDataStream, &pStream ));


 // applications must define, implement, and register an instance of p6ISAX2ContentHandler
 p6ISAX2ContentHandler *pContent;
 . . . . . 
 err = pReader->setContentHandler( pContent );

 err = pStream->beginStream();
 pStr->strlen( temp1, 1000, &bufSize );            // many calls to these application defined callbacks are made by the parser:
 err = pStream->processStream( temp1, bufSize );   // pContent->startElement() 
                                                   // pContent->characters()
 pStr->strlen( temp2, 1000, &bufSize );            // pContent->endElement()
 err = pStream->processStream( temp2, bufSize );

       . . . . . . . . . .

 m_pStr->strlen( temp10, 1000, &bufSize );
 err = pStream->processStream( temp10, bufSize );
 err = pStream->endStream();

A Default Implementation of p6ISAX2ErrorHandler

An application using P6R's SAX2 parser can define and register its own implementation of the P6R::p6ISAX2ErrorHandler. However, P6R has provided a default component implementation of this interface. In the file, p6sax2errorhandler.h is a definition of the component: p6ISAX2ErrorHandlerInit. This component takes a p6IDataStream as an output destination. To use P6R::p6ISAX2ErrorHandlerInit component follow the following code snipet:

 p6IDataStream            *pErrorResult;     // create this before hand
 p6ISAX2ErrorHandlerInit  *pInitError;       // 
 p6ISAX2ErrorHandler      *pErrorHandler;    // obtained from the queryInterface call
 P6ERR                     err;

 if (P6SUCCEEDED( err = p6CreateInstance( NULL, CID_p6SAX2ErrorHandlerInit, VALIDATEIF( p6ISAX2ErrorHandlerInit, &pInitError )))) 
 {
     if (P6SUCCEEDED( err = pInitError->initialize( pErrorResult ))) 
     {
         if (P6FAILED( err = pInitError->queryInterface( VALIDATEIF( p6ISAX2ErrorHandler, &pErrorHandler )))) return err;
     }
 }

 ...->setErrorHandler( pErrorHandler );
 All Classes Files Functions Variables Typedefs Enumerations Enumerator Defines
Copyright © 2004 - 2010 P6R Inc. - All Rights Reserved.