Event handlers


Name Type
startElementHandler XML_START_ELEMENT_HANDLER
endElementHandler XML_END_ELEMENT_HANDLER
charactersHandler XML_CHARACTERS_HANDLER
resolveEntityHandler/externalEntityParsedHandler XML_RESOLVE_ENTITY_HANDLER
errorHandler XML_ERROR_HANDLER
ignorableWhitespaceHandler XML_CHARACTERS_HANDLER
startDocumentHandler/endDocumentHandler XML_EVENT_HANDLER
commentHandler XML_CHARACTERS_HANDLER
startCDATAHandlerHandler/endCDATAHandler XML_EVENT_HANDLER
processingInstructionHandler XML_PI_HANDLER
skippedEntityHandler XML_SKIPPED_ENTITY_HANDLER
xmlDeclHandler XML_XMLDECL_HANDLER
startDTDHandler XML_START_DTD_HANDLER
endDTDHandler XML_EVENT_HANDLER
encodingAliasHandler XML_ENCODINGALIAS_HANDLER
startEntityHandler/endEntityHandler XML_ENTITY_EVENT_HANDLER
elementDeclHandler XML_ELEMENTDECL_HANDLER
AttributeDeclHandler XML_ATTRIBUTEDECL_HANDLER
entityDeclHandler XML_ENTITY_EVENT_HANDLER
notationDeclHandler XML_NOTATIONDECL_HANDLER

Parsifal events (callbacks) work pretty much the same way as events in original Java SAX2 interface, so besides reading this page you should look for SAX2 reference somewhere - start here: www.saxproject.org. See also parsifal.h for more information on Parsifal API.

Good way to learn SAX is to use xmlplint with -f 1 flag and examine the order events are called when parsing different documents.

Note: Most of the time you're likely to work with the first 5 handlers in the above list.


First parameter for all callbacks is UserData which is usually set before parsing to glue "the this pointer" to callbacks. e.g.

parser->UserData = parser;

Of course UserData is often some kind of "state machine" structure that contains current parser as one of its members.

Return XML_ABORT in callbacks if you want to abort parsing, XML_OK (value 0) otherwise.



startDocumentHandler/endDocumentHandler

int (*XML_EVENT_HANDLER)(void *UserData);

startDocumentHandler reports (surprisingly!) start of document. This is quaranteed to be the first event that's triggered in the processing (preceded by optional xmlDecl event). There's one exception to this: if parser fails to detect the encoding for the document (by reading byte order mark and/or XML declaration or forced encoding given by encoding parameter to XMLParser_Parse) startDocumenentHandler/endDocumentHandler won't be called - this is special case because encoding must be known at startDocument stage.

endDocument is always called in conjunction with startDocument (even in error condition), thus it's safe to initialize/allocate data in startDocument and clean up your data in endDocument.



startElementHandler

int (*XML_START_ELEMENT_HANDLER)(void *UserData, const XMLCH *uri, const XMLCH *localName, 
			         const XMLCH *qName, LPXMLVECTOR atts);

endElementHandler

int (*XML_END_ELEMENT_HANDLER)(void *UserData, const XMLCH *uri, const XMLCH *localName,
			       const XMLCH *qName);

Similar to SAX2 events, note that uri or localName parameter doesn't have value NULL when unavailable but their values are empty string, this can be tested with

if (*uri) ...
of course. Difference from SAX is that qName has always valid value. localName is empty string when element doesn't belong to any namespace or belongs to default namespace Note: changed starting from v.0.9.3. Namespace attributes xmlns:XXX and xmlns belong to predefined http://www.w3.org/2000/xmlns/ namespace and predefined xml:XXX attributes belong to http://www.w3.org/XML/1998/namespace.

Common attribute enumeration example:

if (atts->length) 
{
    int i;
    LPXMLRUNTIMEATT att;
    
    printf("Tag %s has %d attributes:\n", qName, atts->length);
    for (i=0; i<atts->length; i++) {
        att = (LPXMLRUNTIMEATT) XMLVector_Get(atts, i);
        
        if (*att->uri)
            printf("  Name: %s Value: %s Prefix: %s LocalName: %s Uri: %s\n", 
                att->qname, att->value,
                        att->prefix, att->localName, 
                            att->uri);
        else 
            printf("  Name: %s Value: %s\n", 
                att->qname, att->value);
        
        
    }
}


An example of getting named attribute using XMLParser_GetNamedItem function:
(this example assumes that UserData contains current parser object in StartElement callback)

if (att = XMLParser_GetNamedItem((LPXMLPARSER)UserData, "myattribute"))
    printf("myattribute has value: %s\n", att->value);



Some SAX attribute handling methods and their Parsifal equivalents:

SAX Parsifal
atts.GetQName(index) ((LPXMLRUNTIMEATT)XMLVector_Get(atts, index))->qname
atts.getValueFromQName(name) ((LPXMLRUNTIMEATT)XMLParser_GetNamedItem(parser, name))->value


charactersHandler

int (*XML_CHARACTERS_HANDLER)(void *UserData, const XMLCH *chars, int cbSize);

Reports character data. Parameter cbSize is the length of the buffer chars, which is not NUL terminated.

Whether parser reports data in a single chunk or splits data into several chunks is dependant on the parser configuration: If you've compiled Parsifal library without DTD_SUPPORT or flag XMLFLAG_PRESERVE_GENERAL_ENTITIES is TRUE Parsifal calls charactersHandler once for every text content - Parsifal expands character references and predefined entities (e.g. &#60; and &lt;) always into this contiguous data but when you use general entities Parsifal splits data into several chunks and you must use stringbuffer to collect the data. Common way to handle this:

  1. Initialize stringbuffer in startElement (or in endElement, see examples)
  2. Append data into stringbuffer in charactersHandler call
  3. Store the buffer in endElement.
charactersHandler, ignorableWhitespaceHandler and commentHandler are all alike, thus common trick is to declare one handler and call that handler for all XML_CHARACTERS_HANDLER types. Note that only charactersHandler splits their data into several chunks, other handlers report contiguous data.

startCDATAHandler Reports the start of a CDATA section. The contents of the CDATA section will be reported through the characters callback; startCDATAHandlerHandler call is intended only to report the boundary. endCDATAHandler reports the end of a CDATA section.



ignorableWhitespaceHandler

int (*XML_CHARACTERS_HANDLER)(void *UserData, const XMLCH *chars, int cbSize);

Reports "whitespace only" content.

If you're using Parsifal in validating mode IgnorableWhitespaceHandler works exactly as XML 1.0 spec defines it; Whitespace only content is ignorable only in the content of elements that aren't declared as #PCDATA elements. Thus when validating the description below becomes meaningless - it decribes only using this handler in non-validating mode.

<root>CRLF
SPACE<node>CRLF
TAB<data>text</data>CRLF
</node></root>

SPACE, CRLF and TAB (shown as alias values, meaning real tab etc. characters) here are reported in ignorableWhitespaceHandler. xml:space attribute isn't currently handled by Parsifal although xml namespace is special "pre-defined" namespace and Parsifal tracks the scope of all xml prefixed attributes for you -  see XMLParser_GetPrefixMapping for more info on handling xml:space.

Note that if data tag would have been <data>textCRLF</data>, CRLF in element content wouldn't be reported in ignorableWhitespaceHandler but charactersHandler (whitespace in the middle of content isn't considered ignorable whitespace). Parsifal normalizes CRLF pairs and single CR into single LF character. The only way to insert CR into parser character data is to use &#13; character reference or something like
<!ENTITY CRLF "&#13;&#10;">

If whitespace is significant for you, you should set ignorableWhitespaceHandler to point to charactersHandler. Non-validating parsers should actually treat all whitespace as significant but making parsifal non-conformant here and calling ignorableWhitespaceHandler can save you some coding when you want to distinguish between whitespace and content - it's easier/faster this way, you don't have to check the whitespace in charactersHandler.



processingInstructionHandler

int (*XML_PI_HANDLER)(void *UserData, const XMLCH *target, const XMLCH *data);

Parameters:

target Processing instruction target.
data Processing instruction data if present, empty string otherwise.

Parsifal skips all whitespace between target and data i.e. in <?targetTABTABTABTABTABI'm dataTABTABTAB?> first tabs will be skipped but tabs at the end will be included in the data.



resolveEntityHandler/externalEntityParsedHandler

int (*XML_RESOLVE_ENTITY_HANDLER)(void *UserData, LPXMLENTITY entity, LPBUFFEREDISTREAM reader);

Parser calls this handler when it encounters reference to external general entity in element content (e.g. &ent2; in entities) or reference to external parameter entity within DTD (e.g. %latin1; in entities) or when !DOCTYPE tag has systemID and publicID (optional) specified.

Normally you use systemID to resolve external resource (you can think the resource is a file if you like). 

You must set reader->inputData to point to resource you open usually based on entity->systemID. by default reader->inputSrc is NULL which means that reader will use same input source as your main document (specified in XMLParser_Parse). If you don't set reader->inputData (you leave it to default NULL) entity won't be parsed but silently skipped (you can call skippedEntity yourself  if you like!). 

If you fail to open/initialize inputData for reader, you might want to return XML_ABORT. You must  implement some kind of error reporting yourself since Parsifal's inputsource handling is very "abstract". Using XML_ABORT in conjunction with your own "failed to open file entity->systemID" for example would do the trick. See example.

Once parser has finished parsing your inputData you'll get externalEntityParsedHandler call which gives you a chance to close/free you reader->inputData.

If parser wants to open external DTD entity->type is XML_ENTITY_DOCTYPE and entity->name is "[dtd]". Usually this information isn't needed since there's no difference in opening external DTD and external entity - for instance the following example would handle both situations. If parser isn't opening external DTD entity type is XML_ENTITY_EXT_GEN or XML_ENTITY_EXT_PARAM. (See parsifal.h for more type info).

resolveEntityHandler example:
(this example also takes special xml:base attribute into account) 

See xmlplint/uriresolver.h for more complete uri resolver

int ResolveEntity(void *UserData, LPXMLENTITY entity, LPBUFFEREDISTREAM reader)
{
  FILE *f;
  XMLCH filename[MAX_FILE]; 
  XMLCH *base = XMLParser_GetPrefixMapping((LPXMLPARSER)UserData, "xml:base");
  
  if (base) {
    strcpy(filename, base);
    strcat(filename, entity->systemID);
  }
  else {
    strcpy(filename, entity->systemID);
  }
  
  if (!(f = fopen(filename, "rb"))) {
    printf("error opening file '%s'!\n", filename);
    return XML_ABORT;
  }
  	
  reader->inputData = f; 
  return XML_OK;
}

Example assumes that parser uses FILE* stream (uses same reader->inputSrc as provided in XMLParser_Parse call)


Subsequent externalEntityParsedHandler call can contein simply:

fclose((FILE*)reader->inputData);

Note that externalEntityParsedHandler isn't called if you don't specify reader->inputData in resolveEntityHandler.



skippedEntityHandler

int (*XML_SKIPPED_ENTITY_HANDLER)(void *UserData, const XMLCH *Name);

Parsifal calls this handler if it encounters reference to general entity in text content for which it has not seen the declaration or if it encounters reference to external general entity and XMLFLAG_EXTERNAL_GENERAL_ENTITIES flag is FALSE.

If XMLFLAG_UNDEF_GENERAL_ENTITIES flag is set, parsifal triggers ERR_XMLP_UNDEF_ENTITY error for every undeclared entity.



xmlDeclHandler

int (*XML_XMLDECL_HANDLER)(void *UserData, const XMLCH *version, const XMLCH *encoding, const XMLCH *standalone);

Reports XML declaration. Note that although external entities/DTDs might contain XML declaration (called text declaration there) it isn't reported via xmlDeclHandler. If Parameter has value NULL it means that XML declaration doesn't have that attribute specified (version param is mandatory and is never NULL though).



startDTDHandler

int (*XML_START_DTD_HANDLER)(void *UserData, const XMLCH *Name, const XMLCH *publicId,
                             const XMLCH *systemId, int hasInternalSubset);

Parameters:

Name Name of the document (root) element.
publicId publicId or NULL if not specified. Value is trimmed and normalized (linefeeds etc. are removed) but publicID or systemID is not validated by parsifal i.e. they're not necessarily valid uris, filenames etc.
systemId systemId or NULL if not specified. 
hasInternalSubset (boolean) whether XML document contains internal subset

Reports start of the document type declaration (!DOCTYPE tag). startDTDHandler and endDTDHandler are mainly used when roundtripping/serializing input document to output (you want to preserve !DOCTYPE data from input document to output document). If document contains internal DTD subset (stuff between [ and ]), its data will be reported via declaration handlers such as entityDeclHandler or via processingInstructionHandler and commentHandler (depending on what's in the subset of course) between startDTDHandler and endDTDHandler call.

Here's an example !DOCTYPE tag (the result is complex looking but it shows you the order DTD events are executed, normally you don't use most of these events, you probably only implement  resolveEntityHandler and externalEntityParsedHandler)


<!DOCTYPE mydoc SYSTEM "mydoc.dtd" [
  <!-- this is my doctype -->
  <!ENTITY ent1 "wheeeeooo">
]>

Will result in:

  1. startDTDHandler - Name: "mydoc", publicId: NULL, systemId: "mydoc.dtd", hasInternalSubset: 1
  2. commentHandler - " this is my doctype "
  3. entityDeclHandler - reports entity declaration for entity ent1. see entityDeclHandler
  4. startEntityHandler - Name: "[dtd]", type: XML_ENTITY_DOCTYPE, systemId: "mydoc.dtd" ...
  5. resolveEntityHandler - this lets you open file/resource for entity "[dtd]"
    - Here you get declaration events for external DTD
  6. externalEntityParsedHandler - this lets you close/free file/resource for entity "[dtd]" (this call happens only if resolveEntityHandler parsed the external DTD)
  7. endEntityHandler same params as for startEntityHandler
  8. endDTDHandler

Calls 4-7 happened because SYSTEM "mydoc.dtd" specified external DTD that needed to be loaded. Internal subset is parsed first and if there's duplicate entity declarations or attribute declarations etc, declarations in internal subset have higher priority than declarations in external DTD. Note also that parser doesn't report comments and processing instructions if they reside in the external DTD.



elementDeclHandler

int (*XML_ELEMENTDECL_HANDLER)(void *UserData,
    const XMLCH *name, void *contentModel);

Report an element type declaration.

From SAX docs: The content model will consist of the string "EMPTY", the string "ANY", or a parenthesised group, optionally followed by an occurrence indicator. The model will be normalized so that all parameter entities are fully resolved and all whitespace is removed, and will include the enclosing parentheses.

contentModel parameter is of type void* and you can cast it to XMLCH* unless XMLFLAG_REPORT_DTD_EXT has been set TRUE, then contentModel contains tree representation of content particles for declared element (when for example validating).

See dtddecl.c for more info.



attributeDeclHandler

int (*XML_ATTRIBUTEDECL_HANDLER)(void *UserData, const XMLCH *eName,
    const XMLCH *aName, int type, const XMLCH *typeStr, int valueDef,
    const XMLCH *def);

Report an attribute type declaration.

Parameters:

eName The name of the associated element
aName The name of the attribute
type Attribute type. See XMLATTDECL_TYPE_ types in parsifal.h
typeStr Attribute type string. If type parameter is XMLATTDECL_TYPE_NOTATION or XMLATTDECL_TYPE_ENUMERATED, this parameter contains a parenthesized token group with the separator "|" and all whitespace removed.
valueDef XMLATTDECL_DEF_FIXED, XMLATTDECL_DEF_REQUIRED, XMLATTDECL_DEF_IMPLIED or 0.
def Attribute default value or null if there is none.

See dtddecl.c for more info.



entityDeclHandler

int (*XML_ENTITY_EVENT_HANDLER)(void *UserData, LPXMLENTITY entity);

Report an entity declaration.

LPXMLENTITY members:

type Entity type XMLENTITYTYPE. See parsifal.h
len length of value parameter if this is internal general or internal parameter entity
name Name of entity. Starts with '%' if this is parameter entity.
value Value of an internal general or internal parameter entity. NULL otherwise.
publicID, systemID and notation XMLCH * or NULL if not available.

See parsifal.h and dtddecl.c for more info.



notationDeclHandler

int (*XML_NOTATIONDECL_HANDLER)(void *UserData, const XMLCH *name,
    const XMLCH *publicID, const XMLCH *systemID);

Report a notation declaration.

At least one of publicId and systemId must be non-NULL

See dtddecl.c for more info.



encodingAliasHandler

XMLCH* (*XML_ENCODINGALIAS_HANDLER)(void *UserData, const XMLCH *enc);

Sets encoding alias for encoding specified in enc parameter (this value comes from encoding detection). Return NULL if you don't want to set alias. This handler can also be used to completely override document encoding by simply returning wanted encoding. Note that in that case you should also specify some bogus encoding in XMLParser_Parse to ensure that this handler gets called - this bogus encoding will then be the value of enc parameter. Note also that document might trigger multiple calls to this handler because of external entities/dtd etc. and they might all use different encoding.


XMLCH* encAlias(void *UserData, const XMLCH *enc)
{
  return (!strcmp(enc, "windows-1251")) ? "ISO-8859-1" : NULL;
}


startEntityHandler/endEntityHandler

int (*XML_ENTITY_EVENT_HANDLER)(void *UserData, LPXMLENTITY entity);

Reports entity boundaries.



errorHandler

void (*XML_ERROR_HANDLER)(LPXMLPARSER parser);

examine ErrorCode, ErrorString, ErrorLine and ErrorColumn for error information. You can also check XMLParser_Parse return value which will be 0 if some error occurred.

Note: If you don't specify errorHandler, only ErrorCode will have valid value (this is like using parser in unattended mode when you don't need detailed error info).

See parsifal.h for list of XMLERRCODE values


Error handler example:

void ErrorHandler(LPXMLPARSER parser) 
{
  /* you should treat ERR_XMLP_ABORT as "user error" and give somekind of
     description before returning from callbacks, otherwise we present parser error: */
     
  if (parser->ErrorCode != ERR_XMLP_ABORT) {
    XMLCH *SystemID = XMLParser_GetSystemID(parser); 		
    printf("\nParsing Error: %s\nCode: %d", parser->ErrorString, parser->ErrorCode);
    if (SystemID) printf("\nSystemID: '%s'", SystemID);
  }
  printf("\nLine: %d Column: %d", parser->ErrorLine, parser->ErrorColumn);
}


Property members of LPXMLPARSER struct


Name Type Read-only Description
DocumentElement XMLCH* yes root element of current document (test for !NULL)
UserData void* no UserData passed to all callbacks
ErrorCode int yes see errorHandler
ErrorString XMLCH* yes see errorHandler
ErrorLine int yes see errorHandler
ErrorColumn int yes see errorHandler


XMLFlags


Name Default Description
XMLFLAG_NAMESPACES TRUE Namespace processing
XMLFLAG_NAMESPACE_PREFIXES FALSE Report namespace (xmlns and xmlns:XXX) attributes in startElementHandler
XMLFLAG_EXTERNAL_GENERAL_ENTITIES TRUE Process external general entities (FALSE they will be skipped)
XMLFLAG_PRESERVE_GENERAL_ENTITIES FALSE Preserve all general entity references (character references, predefined entities and parameter entities are expanded. See entities). You can set this flag during parsing but don't set it in the midst of charactersHandler calls! This flag overrides all other entity flags.
XMLFLAG_UNDEF_GENERAL_ENTITIES FALSE Setting this flag makes parsifal trigger ERR_XMLP_UNDEF_ENTITY error when it encounters undefined general entity  instead of skippedEntity call.
XMLFLAG_PRESERVE_WS_ATTRIBUTES FALSE By default all attribute values are normalized (linefeeds and extra space removed) and trimmed. You can set this flag to prevent the attribute normalization to happen.
XMLFLAG_CONVERT_EOL TRUE Always TRUE starting from v0.7.2. By default all CRLF pairs and CRs are converted into single LF. Set this flag to false if you want to preserve linefeeds as they appear in the input document.
XMLFLAG_REPORT_DTD_EXT FALSE Reports element declaration content particles as tree structure in the elementDeclHandler and gathers additional attribute declaration information from DTDs. Currently used only internally (not documented).
XMLFLAG_VALIDATION_WARNINGS FALSE When this flag is set validation errors are treated as warnings - this means that parsing continues after validation errors. Tip: You might want to set your startElement etc. handlers to NULL when first validation warning occurs or set some user flag to inform your handlers that from this point on you're only collecting warnings/parsing errors and not building any data model in your handlers.
XMLFLAG_SPLIT_LARGE_CONTENT FALSE Splits character handler calls when content bytes grow larger than 4000 bytes (not exact value, you can't rely on that). When parsing for example documents that contain large binary content w/o this flag memory usage might become a problem and performance might degrade. See also characters handler.
XMLFLAG_USE_SIMPLEPULL FALSE Uses simple pull/progressive model for parsing. See XMLParser_HasMoreEvents


These flags can be set by macro _XMLParser_SetFlag(LPXMLPARSER parser, [FLAG], [TRUE/FALSE]).

Values can be retrieved by calling macro _XMLParser_GetFlag(LPXMLPARSER parser, [FLAG]).



Methods


XMLParser_Create

LPXMLPARSER XMLParser_Create(LPXMLPARSER *parser);

Creates new parser. Parser parameter is an address of pointer which will be allocated and prepared by XMLParser_Create. Return value and pointer will be NULL if parser can't be created due to memory allocation problem.

After parser has been created, you can reuse it calling XMLParser_Parse multiple times or you can create for example pool of pre-allocated parser objects for some special case.



XMLParser_Parse

int XMLParser_Parse(LPXMLPARSER parser, LPFNINPUTSRC inputSrc, 
                    void *inputData, const XMLCH *encoding);

Parameters:

parser Pointer to parser object
inputSrc Pointer to callback function that will feed data to parser.
inputData For example FILE* to be passed to inputSrc function
encoding Sets document encoding. Note that forced encoding doesn't override (see encodingAliasHandler for info about overriding encodings) encoding declaration or BOM and should be used only in situations when document doesn't contain encoding declaration and/or BOM i.e. encoding is specified in MIME header etc. Forced encoding can infact conflict with encoding set by bom or xml declaration. Set NULL (recommended) to let bom, document encoding declaration or default UTF-8 to be used.

Valid values are:
  • NULL (most of the time)
  • UTF-8
  • ISO-8859-1
  • US-ASCII
  • UTF-16, UTF-32 etc. when compiled with GNU libiconv support. See Notes about encodings
Case is insignificant.

inputSrc function has the following prototype:


int (*LPFNINPUTSRC)(BYTE *buf, int cBytes, int *cBytesActual, void *inputData);


Parser will send read request to this callback each time it needs more data. buf is allocated for cBytes, set cBytesActual to byte count you actually put into buf.

Return TRUE if EOF or some error occurred, EOF can be normally determined by following:


return (*cBytesActual < cBytes);


NOTE: If you want to distinguish between an end of stream (EOF) and a stream error, you must return BIS_ERR_INPUT on stream error and BIS_EOF (or TRUE) on EOF condition. Returning BIS_ERR_INPUT will trigger ERR_XMLP_IO error accordingly. Stream error check makes code more explicit, for example document and more importantly external entity can be well-formed and parsing can be succesfull even though there was a stream error! For this reason it's always good practise to do error checking in inputsource callback.

One option for error checking would be to check feof(inputdata) after parsing. Infact this is preferred method and works well when you wrap XMLParser_Parse call to handle different inputsources.


Here's an example FILE* inputsrc that handles ferror() too:

int cstream(BYTE *buf, int cBytes, int *cBytesActual, void *inputData)
{
	*cBytesActual = fread(buf, 1, cBytes, (FILE*)inputData);	
	if (ferror((FILE*)inputData)) return BIS_ERR_INPUT;
	
	return (feof((FILE*)inputData)) ? BIS_EOF : 0;
}

Samples include an example of parsing C file stream and parsing from Windows specific url stream. Xmlplint project contains curlread.c and curlread.h for parsing libcurl specific input sources like http and ftp. See also isrcmem.h for info on parsing memory buffers.



XMLParser_GetNamedItem

LPXMLRUNTIMEATT XMLParser_GetNamedItem(LPXMLPARSER parser, XMLCH *name);

Returns attribute of type LPXMLRUNTIMEATT by name or NULL if there isn't such attribute.
Call this function only in startElementHandler when attributes are valid!



XMLParser_GetPrefixMapping

XMLCH* *XMLParser_GetPrefixMapping(LPXMLPARSER parser, const XMLCH *prefix);

Returns uri for prefix or NULL if there isn't such prefix in scope. Parsifal tracks also the scope for predefined xml prefix i.e. xml:space, xml:base or xml:whatever can be retrieved by calling XMLParser_GetPrefixMapping(parser, "xml:space");

Tip: checking xml:space equals 'preserve' in ignorableWhitespaceHandler and delegating call to charactersHandler is method for handling xml:space properly. Of course using validation and declaring #PCDATA is the best way to handle whitespace.



XMLParser_GetSystemID / GetPublicID

XMLCH* *XMLParser_GetSystemID(LPXMLPARSER parser);

Returns systemID/publicID for current external entity or NULL if parsing the main document.
These functions (and XMLParser_GetCurrentEntity) can be used to provide more information on error condition for example. They return meaningful values only during parsing i.e. in the callback events - usually you call these in errorHandler. Return value of these functions is undefined AFTER errorHandler event call (when parser->ErrorCode has been set)



XMLParser_GetCurrentLine

int *XMLParser_GetCurrentLine(LPXMLPARSER parser);

Returns line number of current parsing position. Return value -1 means that position info isn't available

NOTE: during parsing internal entity line and column can be relative to internal entity string not current document position. If you need REALLY accurate error information you must call XMLParser_GetCurrentEntity (LPXMLENTITY or NULL if parsing main document). Note also that parser's line and column information isn't file/resource byte offset information.



XMLParser_GetCurrentColumn

int *XMLParser_GetCurrentColumn(LPXMLPARSER parser);

Returns column number of current parsing position (in UTF-8 characters - not bytes). Return value -1 means that position info isn't available



XMLParser_GetContextBytes

int XMLParser_GetContextBytes(LPXMLPARSER parser, XMLCH **Bytes, int *cBytes);

Returns column number of current parsing position (in bytes 0-based). Bytes parameter can be address of XMLCH* that receives context bytes. cBytes receives length of Bytes. Both might be NULL when you need only the return value. Note that Bytes might be NUL terminated before cBytes and it might not be complete line, see helper.c for example of handling context bytes. Return value -1 means that position info/context isn't available



XMLParser_GetCurrentEntity

LPXMLENTITY *XMLParser_GetCurrentEntity(LPXMLPARSER parser);

Returns LPXMLENTITY or NULL if parsing main document. This info can be used for determining accurate error position - whether parser was processing internal entity when error occurred etc. Return value of this function is undefined AFTER errorHandler event call (when parser->ErrorCode has been set)



XMLParser_SetExternalSubset

LPXMLENTITY XMLAPI XMLParser_SetExternalSubset(LPXMLPARSER parser,
                const XMLCH *publicID, const XMLCH *systemID);

This function can be used to set external subset (DTD) for documents that have no DOCTYPE declaration or to set additional DTD to be loaded before any other DTD (if present) gets loaded. This means that startDTDHandler and endDTDHandler are always called once (no matter whether there is only external DTD set by this function or both DOCTYPE declaration and DTD set by this function)

XMLParser_SetExternalSubset returns pointer to LPXMLENTITY object (return value is always valid LPXMLENTITY) - this can be used to distinguish user DTD from other external DTD in resolveEntityHandler.

To turn off user subset loading call this with NULL values for publicID and systemID parameters.

See also SAX2 java API getExternalSubset



XMLParser_HasMoreEvents

int XMLAPI XMLParser_HasMoreEvents(LPXMLPARSER parser);

When XMLFLAG_USE_SIMPLEPULL flag is set XMLParser_Parse (or ParseValidateDTD) only initializes parser and comes out of the event loop without parsing. You then call this method to trigger/parse next event(s). This feature can be used to implement pull parser (for language bindings or XMPP etc. parser) on top of Parsifal. See xmlreader sample which implements pull parser and xmlreader README which explains the technical details.

XMLParser_Free

void XMLParser_Free(LPXMLPARSER parser);

Frees parser and its resources.



XMLNormalizeBuf

int XMLNormalizeBuf(XMLCH *buf, int len);

XMLNormalizeBuf can be used to normalize character content. Normalization converts all LFs, CRs, CRLF pairs and TABS in buf into single space and trims the start and end of the buffer. XMLNormalizeBuf returns length of normalized buffer (most likely less than len after normalization) but it doesn't nul terminate the buffer. You must do this yourself.



XMLVector...

Dynamically allocating array implementation that is also exported/available for end-user of Parsifal. See XMLVector.c and XMLVector.h for details. Samples demonstrate XMLVector usage. Samples (and Parsifal internally) use XMLVector as stack too. xmldef.h provides stack macro wrappers for XMLVector.


XMLStringbuf...

Simple stringbuffer implementation that is also exported/available for end-user of Parsifal. See XMLStringbuf.c and XMLStringbuf.h for details. Sample files demonstrate XMLStringbuf usage.


XMLPool...

Can be used in conjunction with XMLStrinbuf for allocating fixed length strings etc. See also test_pool.c for example of using XMLVector, XMLStringbuf and XMLPool.


XMLHtable...

Hashtable implementation. Can be used to store generic data. See xmlhash.h




Validation


DTD validation with parsifal requires the following steps besides creating the usual LPXMLPARSER object:

You can reuse LPXMLDTDVALIDATOR objects and call XMLParser_ParseValidateDTD multiple times without doing any cleaning up etc. LPXMLDTDVALIDATOR is "tied into" LPXMLPARSER only during parsing (during XMLParser_ParseValidateDTD call) so XMLParser_FreeDTDValidator can also be called independently of XMLParser_Free.

Validation function prototypes:

LPXMLDTDVALIDATOR XMLAPI XMLParser_CreateDTDValidator(void);
void XMLAPI XMLParser_FreeDTDValidator(LPXMLDTDVALIDATOR dtd);
int XMLAPI XMLParser_ParseValidateDTD(LPXMLDTDVALIDATOR dtd,
    LPXMLPARSER parser, LPFNINPUTSRC inputSrc, 
    void *inputData, const XMLCH *encoding);

An example (assumes LPXMLPARSER has been created):

  LPXMLDTDVALIDATOR dtd = XMLParser_CreateDTDValidator();
  if (!dtd) {
    puts("Out of memory!");
    return 1;
  }
  parser->startElementHandler = MyStartElement;
  XMLParser_ParseValidateDTD(dtd, parser, cstream, stdin, 0);
  XMLParser_FreeDTDValidator(dtd);

Important:

When parsing with XMLParser_ParseValidateDTD, UserData parameter for your LPXMLPARSER will be LPXMLDTDVALIDATOR (which contains your LPXMLPARSER as one of its members called simply 'parser'). LPXMLDTDVALIDATOR contains UserData and extra int UserFlag that you can use.

Another thing is that if you want to alter your handlers during validation you must alter the handlers that are the members of LPXMLDTDVALIDATOR object (see dtdvalid.h). Modifying same handlers in your LPXMLPARSER doesn't do what you want (infact they just don't change anything). Also parser's elementDeclHandler gets called with content particle tree structure and not with string representation of the content model - so if you need the string representation you must study parsifal.c and ContentModelToString() function etc.


Validation errors/warnings

When validation errors occur XMLParser_ParseValidateDTD returns false and parser ErrorHandler gets normally called before that (if specified). If ErrorCode is ERR_XMLP_VALIDATION then parser ErrorColumn or ErrorString etc. WILL NOT BE SET but ErrorCode, ErrorString etc. of LPXMLDTDVALIDATOR will be available. If XMLFLAG_VALIDATION_WARNINGS has been set parsing continues after reporting "error" (which in this case is merely a warning) in ErrorHandler.

An example ErrorHandler:

void ErrorHandler(LPXMLPARSER parser) 
{
    if (parser->ErrorCode == ERR_XMLP_VALIDATION) {
        LPXMLDTDVALIDATOR vp = (LPXMLDTDVALIDATOR)parser->UserData;
        printf("Validation Error: %s\nErrorLine: %d ErrorColumn: %d\n", 
            vp->ErrorString, vp->ErrorLine, vp->ErrorColumn);
    }
    else {
        printf("Parsing Error: %s\nErrorLine: %d ErrorColumn: %d\n", 
            parser->ErrorString, parser->ErrorLine, parser->ErrorColumn);
    }
}


DTD, namespaces and selective (filtering) validation

You can use DTDValidate_StartElement and other DTDValidate... handlers specified in dtdvalid.h to filter and selectively validate DTDs. This feature also makes possible to correctly validate elements that are namespace prefixed - this is easily accomplished in your StartElement filter where you just check that uri matches your preferred one and then pass localName as qName to DTDValidate_StartElement.

You can filter attributes in the filter handlers too. Recommended way is to use separate/new LPXMLVECTOR object where you copy attributes selectively from startElementHandler's atts parameter using XMLVector_Append(newVect, att); Adding or modifying attributes isn't currently recommended (not documented properly).

see example nsvalid.c and xmlplint sources (vfilter.c)

Note: You can use default namespace in your root element like xhtml1-transitional.dtd uses #FIXED xmlns attribute in its html element BUT if you wan't to allow any prefixes and/or elements containing elements from other namespaces you must do the filtering.


Basic tutorials/info about DTDs:

Wikipedia about DTD

ZVON.ORG XML tutorials




Building


On Windows:


If you have Visual C++ compiler all sample programs can be build by running BUILD.BAT in sample directories. Sample executables will be build into WIN32\BIN directory.

Directory WIN32\VC6 contains VC6 project files and sources for building Windows dll to be linked to your own projects. Simply link to parsifal.lib, include parsifal.h and define XMLAPI=__declspec(dllimport) preprocessor definition in your target project. Also make sure that dll is found in your path.

You can find parsifal.dll prebuild from WIN32\BIN directory and parsifal.lib from WIN32\LIB directory. Prebuild dll is linked with static multithreaded run-time library (option -MT).

VC6 project file also contains target for building dll with libiconv support that is tested with binary distribution libiconv-1.9.1.bin.woe32. This builds parsifal.dll that is linked with dynamic multithreaded run-time library - using same RTL as libiconv is essential. See also win_iconv option in links.

VC6 is quite old version of Microsoft's Visual C++. Microsoft offers versions of Visual C++ express for free and you should be able to import VC6 projects into newer versions easily. Since Parsifal is a cross-platform project you might want to use gcc in windows too, see MinGW options below.


MinGW:

Directory WIN32\MINGW\DLL contains makefile for creating parsifal.dll. See makefile for options/usage. Sample SAMPLES\PULL\BUILD_MINGW.BAT is a batch file example for linking with WIN32\MINGW\DLL\PARSIFAL.DLL.

Directory WIN32\MINGW etc. contains static library for mingw compiler and Dev-Cpp IDE project files. See README in that directory.


C++ Builder command line tools V.5.5:

Directory WIN32\BCC etc. contains makefile for building static library for free BCC 5.5 compiler. Makefile was contributed by Carsten Heuer. Links with static and singlethreaded runtime library which is default for BCC. See README in that directory.


Other:

If you aren't using compiler mentioned above, you must build your Parsifal library by means specific to your C compiler. Just generate static or dynamic library from files in src directory, link your program with it and include parsifal.h in your source. WIN32\MINGW\DLL makefile can also be a starting point for you


See also README


On Linux/Unix:


Install library
tar xzf libparsifal-X.X.X.tar.gz
cd libparsifal-X.X.X
./configure
make
make install

installs shared library into /usr/lib and include files into /usr/include/libparsifal by default.

Each sample directory contains Makefile that build sample exe's and installs them to /usr/local/bin by default, build samples by executing:
make
make install (optional, you can place executables in to directory of 
              your choice of course)
Samples aren't build when installing shared library. Library must be build and installed before making sample files of course.

Startinng from version 1.1 configure offers --disable-gccflags for disabling default compiler flags for gcc. 1.1 also contains parsifal-config tool for querying information about how libparsifal was configured and installed.

See also README


Compiling options:
#define Default Description
DTD_SUPPORT TRUE Compiles with DTD support. Compiling w/o DTD_SUPPORT defined makes smaller parser which doesn't support parsing DTDs at all (accepts simple DOCTYPE tag optionally containing PUBLIC and SYSTEM declared BUT NOT an internal subset of any sort). w/o DTD_SUPPORT library size will be around 48K with dynamic RTL against 68K full dtd support build. This has also effect on how character data is reported via charactersHandler.
DTDVALID_SUPPORT TRUE DTD Validation support. Includes dtdvalid.c
MAX_SPEED TRUE Performs some optimizations (basically inlining stuff). Makes the parser about 4 KB bigger but a bit faster.
ICONV_SUPPORT FALSE/TRUE* Compiles with GNU libiconv support. Note: Starting from version 1.0 linux configure script tries to locate and use iconv, this can be disabled by setting --disable-iconv configure option.

All makefiles and pre-build libraries use these default compiling options.



Notes about encodings


Parsifal uses UTF-8 encoding internally.. All XMLCH type parameters passed to callbacks are UTF-8 encoded i.e. parsifal output is always UTF-8 encoded no matter what input encoding document uses. This makes Parsifal usable with many languages although sometimes you need to do some extra work converting UTF-8 to character encoding used by your target program. See also helper.c for UTF-8 to iso-latin1 conversion routine.

Since UTF-8 is becoming very popular form of unicode in various platforms you may consider using UTF-8 as your program's internal encoding, here's an excerpt from UTF-8 and Unicode FAQ for Unix/Linux - http://www.cl.cam.ac.uk/~mgk25/unicode.html:

"the major Linux distributors and application developers now foresee and hope that Unicode will eventually replace all these older legacy encodings, primarily in the UTF-8 form."

TIP: to enable UTF-8 support for xterm use

xterm -u8
Read linux Unicode-HOWTO for more details/language/font support for xterm etc.


GNU libiconv

If you compile with GNU libiconv you must make sure Parsifal build uses the same kind of runtime library as your libiconv. If you're going to use libiconv on win this is easily accomplished by making libiconv target from VC6 project file and downloading libiconv binary distribution from gnu ftp sites (tested with 1.9.1). See also libiconv READMEs for more information

You can test your build by running IconvTest() in your code. test_iconv.c can be found in samples/misc directory.

Note: Libiconv defaults to big endian unicode encodings i.e. if your document doesn't have byte order mark you must explicitly specify UTF-16LE or UCS-4LE, when for example only UCS-4 is specified, libiconv defaults to UCS-4BE.



Entities


Parsifal supports character references, predefined entities, internal and external general entities. + internal and external parameter entities (used in DTD authoring only)


Example documents with supported entity types:

<!DOCTYPE doc [
<!ENTITY ent1 "internal entity data">
<!ENTITY ent2 SYSTEM "external.ent">
]>
<doc>
    &#60;  <!-- this is character reference -->
    &lt;   <!-- this is reference to predefined entity -->
    &ent1; <!-- this is reference to internal general entity -->
    &ent2; <!-- this is reference to external general entity -->
</doc>


An example that includes common xhtml entity declarations into document:


<!DOCTYPE doc [
<!ENTITY % latin1 SYSTEM "xhtml-lat1.ent">
%latin1; <!-- this is a reference to external parameter entity -->
]>
<doc>
     Copyright &copy; 2004 Sir Elmo Von Eltonzon
</doc>



Thread safety


Parsifal is re-entrant - this means that you can run multiple parsers concurrently and parser works in the threaded environments as long as you handle thread synchronization etc. when needed.



Bug reports


Bug reports should include minimal source code that produces the bug and minimal XML document used if its relevant for producing the buggy behaviour. Double-check that bug originates from Parsifal library or perhaps from specific compiler configuration before sending bug reports. You should of course test your XML documents with other tools (XML 1.0 compliant browsers etc.) to rule out simple well-formedness or validation errors - if other tools process your document and parsifal don't (or vice versa) you definitely have found a bug!


Copyright © 2002-2008 Toni Uusitalo.
Send mail, suggestions and bug reports to

Last modified: 04.10.2008 00:00