Name | Type | startElementHandler | XML_START_ELEMENT_HANDLER | endElementHandler | XML_END_ELEMENT_HANDLER | charactersHandler | XML_CHARACTERS_HANDLER | resolveEntityHandler/externalEntityParsedHandler | XML_RESOLVE_ENTITY_HANDLER | errorHandler | XML_ERROR_HANDLER | ignorableWhitespaceHandler | XML_CHARACTERS_HANDLER | startDocumentHandler/endDocumentHandler | XML_EVENT_HANDLER | commentHandler | XML_CHARACTERS_HANDLER | startCDATAHandlerHandler/endCDATAHandler | XML_EVENT_HANDLER | processingInstructionHandler | XML_PI_HANDLER | skippedEntityHandler | XML_SKIPPED_ENTITY_HANDLER | xmlDeclHandler | XML_XMLDECL_HANDLER | startDTDHandler | XML_START_DTD_HANDLER | endDTDHandler | XML_EVENT_HANDLER | encodingAliasHandler | XML_ENCODINGALIAS_HANDLER | startEntityHandler/endEntityHandler | XML_ENTITY_EVENT_HANDLER | elementDeclHandler | XML_ELEMENTDECL_HANDLER | AttributeDeclHandler | XML_ATTRIBUTEDECL_HANDLER | entityDeclHandler | XML_ENTITY_EVENT_HANDLER | notationDeclHandler | XML_NOTATIONDECL_HANDLER |
---|
Parsifal events (callbacks) work pretty much the same way as events in original Java SAX2 interface, so besides reading this page you should look for SAX2 reference somewhere - start here: www.saxproject.org. See also parsifal.h for more information on Parsifal API.
Good way to learn SAX is to use xmlplint with -f 1 flag and examine the order events are called when parsing different documents.
Note: Most of the time you're likely to work with the first 5 handlers in the above list.
First parameter for all callbacks is UserData
which is usually set
before parsing to glue "the this pointer" to callbacks. e.g.
parser->UserData = parser;
Of course UserData is often some kind of "state machine" structure that contains current parser as one of its members.
Return XML_ABORT in callbacks if you want to abort parsing, XML_OK (value 0) otherwise.
int (*XML_EVENT_HANDLER)(void *UserData);
startDocumentHandler reports (surprisingly!) start of document. This is quaranteed to be the first
event that's triggered in the processing (preceded by optional xmlDecl event). There's one exception to this:
if parser fails to detect the encoding for the
document (by reading byte order mark and/or XML declaration or forced encoding given by encoding
parameter to XMLParser_Parse
) startDocumenentHandler/endDocumentHandler won't be called - this
is special case because encoding must be known at startDocument stage.
endDocument is always called in conjunction with startDocument (even in error condition), thus it's safe to initialize/allocate data in startDocument and clean up your data in endDocument.
int (*XML_START_ELEMENT_HANDLER)(void *UserData, const XMLCH *uri, const XMLCH *localName, const XMLCH *qName, LPXMLVECTOR atts);
int (*XML_END_ELEMENT_HANDLER)(void *UserData, const XMLCH *uri, const XMLCH *localName, const XMLCH *qName);
Similar to SAX2 events, note that uri
or
localName
parameter doesn't have value NULL
when unavailable but their values are
empty string, this can be tested with
if (*uri) ...
of course. Difference from SAX is that qName
has
always valid value. localName
is empty string
when element doesn't belong to any namespace Common attribute enumeration example:
if (atts->length)
{
int i;
LPXMLRUNTIMEATT att;
printf("Tag %s has %d attributes:\n", qName, atts->length);
for (i=0; i<atts->length; i++) {
att = (LPXMLRUNTIMEATT) XMLVector_Get(atts, i);
if (*att->uri)
printf(" Name: %s Value: %s Prefix: %s LocalName: %s Uri: %s\n",
att->qname, att->value,
att->prefix, att->localName,
att->uri);
else
printf(" Name: %s Value: %s\n",
att->qname, att->value);
}
}
An example of getting named attribute using
XMLParser_GetNamedItem function:
(this example assumes that UserData contains current parser object in StartElement callback)
if (att = XMLParser_GetNamedItem((LPXMLPARSER)UserData, "myattribute"))
printf("myattribute has value: %s\n", att->value);
Some SAX attribute handling methods and their Parsifal equivalents:
SAX | Parsifal | atts.GetQName(index) |
((LPXMLRUNTIMEATT)XMLVector_Get(atts,
index))->qname |
atts.getValueFromQName(name) |
((LPXMLRUNTIMEATT)XMLParser_GetNamedItem(parser,
name))->value |
---|
int (*XML_CHARACTERS_HANDLER)(void *UserData, const XMLCH *chars, int cbSize);
Reports character data. Parameter cbSize
is the length
of the buffer chars
, which is not NUL
terminated.
Whether parser reports data in a single
chunk or splits data into several chunks is dependant on the parser
configuration:
If you've compiled Parsifal library without DTD_SUPPORT
or flag XMLFLAG_PRESERVE_GENERAL_ENTITIES
is TRUE Parsifal calls charactersHandler once
for every text content - Parsifal expands character references and predefined
entities (e.g. <
and <
) always into this contiguous data but when you use general
entities Parsifal splits data into several chunks and you must use stringbuffer
to collect the data. Common way to handle this:
- Initialize stringbuffer in startElement (or in endElement, see examples)
- Append data into stringbuffer in charactersHandler call
- Store the buffer in endElement.
charactersHandler
,
ignorableWhitespaceHandler
and commentHandler
are all alike, thus common trick is to declare one handler and call that handler
for all XML_CHARACTERS_HANDLER types. Note that only charactersHandler
splits their data into several chunks, other handlers report contiguous data.
startCDATAHandler Reports the start of a CDATA section. The contents of the CDATA section will be reported through the characters callback; startCDATAHandlerHandler call is intended only to report the boundary. endCDATAHandler reports the end of a CDATA section.
int (*XML_CHARACTERS_HANDLER)(void *UserData, const XMLCH *chars, int cbSize);
Reports "whitespace only" content.
If you're using Parsifal in validating mode IgnorableWhitespaceHandler works exactly as XML 1.0 spec defines it; Whitespace only content is ignorable only in the content of elements that aren't declared as #PCDATA elements. Thus when validating the description below becomes meaningless - it decribes only using this handler in non-validating mode.
<root>CRLF
SPACE<node>CRLF
TAB<data>text</data>CRLF
</node></root>
SPACE, CRLF and TAB (shown as alias values, meaning real tab etc. characters) here are reported in
ignorableWhitespaceHandler. xml:space
attribute
isn't currently handled by Parsifal although xml
namespace is special
"pre-defined"
namespace and Parsifal tracks the scope of all xml prefixed attributes
for you - see XMLParser_GetPrefixMapping for more info on handling
xml:space
.
Note that if data
tag would have been <data>textCRLF</data>
, CRLF in element content wouldn't be reported in
ignorableWhitespaceHandler but charactersHandler (whitespace in the middle of
content isn't considered ignorable whitespace). Parsifal normalizes CRLF pairs
and single CR into single LF character. The only way to insert CR into parser character data
is to use character reference or something like<!ENTITY CRLF " ">
If whitespace is significant for you, you should set ignorableWhitespaceHandler
to point to
charactersHandler
. Non-validating parsers should actually treat all whitespace as significant but
making parsifal non-conformant here and calling ignorableWhitespaceHandler
can save you
some coding when you want to distinguish between whitespace and content - it's easier/faster this way,
you don't have to check the whitespace in charactersHandler.
int (*XML_PI_HANDLER)(void *UserData, const XMLCH *target, const XMLCH *data);
Parameters:
target |
Processing instruction target. | data |
Processing instruction data if present, empty string otherwise. |
Parsifal skips all whitespace between target
and data
i.e.
in <?targetTABTABTABTABTABI'm dataTABTABTAB?>
first tabs will be skipped but tabs at the end will be included in the data.
int (*XML_RESOLVE_ENTITY_HANDLER)(void *UserData, LPXMLENTITY entity, LPBUFFEREDISTREAM reader);
Parser calls this handler when it encounters reference to external general entity in
element content (e.g. &ent2;
in entities) or reference
to external parameter entity within DTD (e.g. %latin1;
in entities) or when
!DOCTYPE tag has systemID
and publicID
(optional) specified.
Normally you use systemID
to resolve external resource (you can think
the resource is a file if you like).
You must set reader->inputData
to point to resource you open usually
based on entity->systemID
. by default reader->inputSrc
is NULL
which means
that reader will use same input source as your main document (specified in
XMLParser_Parse). If you don't set reader->inputData
(you leave it to default
NULL
)
entity won't be parsed but silently skipped (you can call skippedEntity yourself
if you like!).
If you fail to open/initialize inputData for reader, you might want to
return XML_ABORT
. You must implement some kind of error reporting yourself
since Parsifal's inputsource handling is very "abstract". Using
XML_ABORT
in conjunction with your own "failed to open file entity->systemID
"
for example would do the trick. See example.
Once parser has finished parsing your inputData you'll get
externalEntityParsedHandler call which gives you a chance to close/free you reader->
inputData
.
If parser wants to open external DTD entity->type
is XML_ENTITY_DOCTYPE
and entity->name
is "[dtd]". Usually this information isn't needed
since there's no difference in opening external DTD and external entity - for
instance the following example would handle both situations. If parser isn't
opening external DTD entity type is XML_ENTITY_EXT_GEN
or XML_ENTITY_EXT_PARAM
. (See parsifal.h for more type
info).
resolveEntityHandler example:
(this example also takes special xml:base
attribute into account)
See xmlplint/uriresolver.h for more complete uri resolver
int ResolveEntity(void *UserData, LPXMLENTITY entity, LPBUFFEREDISTREAM reader)
{
FILE *f;
XMLCH filename[MAX_FILE];
XMLCH *base = XMLParser_GetPrefixMapping((LPXMLPARSER)UserData, "xml:base");
if (base) {
strcpy(filename, base);
strcat(filename, entity->systemID);
}
else {
strcpy(filename, entity->systemID);
}
if (!(f = fopen(filename, "rb"))) {
printf("error opening file '%s'!\n", filename);
return XML_ABORT;
}
reader->inputData = f;
return XML_OK;
}
Example assumes that parser uses FILE* stream (uses same reader->inputSrc
as provided in XMLParser_Parse call)
Subsequent externalEntityParsedHandler call can contein simply:
fclose((FILE*)reader->inputData);
Note that externalEntityParsedHandler isn't called if you don't specify reader->inputData
in
resolveEntityHandler.
int (*XML_SKIPPED_ENTITY_HANDLER)(void *UserData, const XMLCH *Name);
Parsifal calls this handler if it encounters reference to general entity in text content for which it has not seen the declaration or if it encounters reference to external general entity and XMLFLAG_EXTERNAL_GENERAL_ENTITIES flag is FALSE.
If XMLFLAG_UNDEF_GENERAL_ENTITIES flag is set, parsifal triggers ERR_XMLP_UNDEF_ENTITY error for every undeclared entity.
int (*XML_XMLDECL_HANDLER)(void *UserData, const XMLCH *version, const XMLCH *encoding, const XMLCH *standalone);
Reports XML declaration. Note that although external entities/DTDs might
contain XML declaration (called text declaration there) it isn't reported via
xmlDeclHandler. If Parameter has value NULL
it means that XML declaration
doesn't have that attribute specified (version param is mandatory and is never NULL though).
int (*XML_START_DTD_HANDLER)(void *UserData, const XMLCH *Name, const XMLCH *publicId, const XMLCH *systemId, int hasInternalSubset);
Parameters:
Name |
Name of the document (root) element. | publicId |
publicId or NULL if not specified. Value is trimmed and
normalized (linefeeds etc. are removed) but publicID or systemID is not
validated by parsifal i.e. they're not necessarily valid uris, filenames etc. |
systemId |
systemId or NULL if not specified. |
hasInternalSubset | (boolean) whether XML document contains internal subset |
Reports start of the document type declaration (!DOCTYPE tag).
startDTDHandler and endDTDHandler are mainly used when roundtripping/serializing
input document to output (you want to preserve !DOCTYPE data from input
document to output document). If document contains internal DTD subset (stuff
between [
and ])
, its data
will be reported via declaration handlers such as entityDeclHandler or via processingInstructionHandler and
commentHandler (depending on what's in the subset of course) between
startDTDHandler and endDTDHandler call.
Here's an example !DOCTYPE tag (the result is complex looking but it shows you the order DTD events are executed, normally you don't use most of these events, you probably only implement resolveEntityHandler and externalEntityParsedHandler)
<!DOCTYPE mydoc SYSTEM "mydoc.dtd" [
<!-- this is my doctype -->
<!ENTITY ent1 "wheeeeooo">
]>
Will result in:
- startDTDHandler -
Name:
"mydoc",publicId:
NULL,systemId:
"mydoc.dtd",hasInternalSubset:
1- commentHandler - " this is my doctype "
- entityDeclHandler - reports entity declaration for entity ent1. see entityDeclHandler
- startEntityHandler -
Name:
"[dtd]", type: XML_ENTITY_DOCTYPE,systemId:
"mydoc.dtd" ...- resolveEntityHandler - this lets you open file/resource for entity "[dtd]"
- Here you get declaration events for external DTD- externalEntityParsedHandler - this lets you close/free file/resource for entity "[dtd]" (this call happens only if resolveEntityHandler parsed the external DTD)
- endEntityHandler same params as for startEntityHandler
- endDTDHandler
Calls 4-7 happened because SYSTEM "mydoc.dtd" specified external DTD that needed to be loaded. Internal subset is parsed first and if there's duplicate entity declarations or attribute declarations etc, declarations in internal subset have higher priority than declarations in external DTD. Note also that parser doesn't report comments and processing instructions if they reside in the external DTD.
int (*XML_ELEMENTDECL_HANDLER)(void *UserData, const XMLCH *name, void *contentModel);
Report an element type declaration.
From SAX docs: The content model will consist of the string "EMPTY", the string "ANY", or a parenthesised group, optionally followed by an occurrence indicator. The model will be normalized so that all parameter entities are fully resolved and all whitespace is removed, and will include the enclosing parentheses.
contentModel
parameter is of type void* and you can cast it to XMLCH*
unless
XMLFLAG_REPORT_DTD_EXT has been set TRUE, then contentModel contains tree representation of
content particles for declared element (when for example validating).
See dtddecl.c for more info.
int (*XML_ATTRIBUTEDECL_HANDLER)(void *UserData, const XMLCH *eName, const XMLCH *aName, int type, const XMLCH *typeStr, int valueDef, const XMLCH *def);
Report an attribute type declaration.
Parameters:
eName |
The name of the associated element | aName |
The name of the attribute | type |
Attribute type. See XMLATTDECL_TYPE_ types in parsifal.h |
typeStr |
Attribute type string. If type parameter is XMLATTDECL_TYPE_NOTATION or XMLATTDECL_TYPE_ENUMERATED ,
this parameter contains a parenthesized token group with the separator "|" and all whitespace removed.
|
valueDef |
XMLATTDECL_DEF_FIXED, XMLATTDECL_DEF_REQUIRED, XMLATTDECL_DEF_IMPLIED or 0.
|
def |
Attribute default value or null if there is none. |
See dtddecl.c for more info.
int (*XML_ENTITY_EVENT_HANDLER)(void *UserData, LPXMLENTITY entity);
Report an entity declaration.
LPXMLENTITY members:
type |
Entity type XMLENTITYTYPE . See parsifal.h |
len |
length of value parameter if this is internal general or internal parameter entity |
name |
Name of entity. Starts with '%' if this is parameter entity. | value |
Value of an internal general or internal parameter entity. NULL otherwise. |
publicID, systemID and notation |
XMLCH * or NULL if not available. |
See parsifal.h and dtddecl.c for more info.
int (*XML_NOTATIONDECL_HANDLER)(void *UserData, const XMLCH *name, const XMLCH *publicID, const XMLCH *systemID);
Report a notation declaration.
At least one of publicId and systemId must be non-NULL
See dtddecl.c for more info.
XMLCH* (*XML_ENCODINGALIAS_HANDLER)(void *UserData, const XMLCH *enc);
Sets encoding alias for encoding specified in enc
parameter (this value comes
from encoding detection). Return NULL
if you don't want to set alias. This handler
can also be used to completely override document encoding by simply returning
wanted encoding. Note that in that case you should also specify some bogus encoding
in XMLParser_Parse
to ensure that this handler gets called - this bogus encoding
will then be the value of enc parameter. Note also that document might trigger
multiple calls to this handler because of external entities/dtd etc. and they
might all use different encoding.
XMLCH* encAlias(void *UserData, const XMLCH *enc)
{
return (!strcmp(enc, "windows-1251")) ? "ISO-8859-1" : NULL;
}
int (*XML_ENTITY_EVENT_HANDLER)(void *UserData, LPXMLENTITY entity);
Reports entity boundaries.
void (*XML_ERROR_HANDLER)(LPXMLPARSER parser);
examine ErrorCode
,
ErrorString
,
ErrorLine
and ErrorColumn
for error information. You can
also check XMLParser_Parse
return value which will be
0 if some error occurred.
Note: If you don't specify errorHandler
,
only ErrorCode
will have valid value (this
is like using parser in unattended mode when you don't need
detailed error info).
See parsifal.h for list of XMLERRCODE
values
Error handler example:
void ErrorHandler(LPXMLPARSER parser)
{
/* you should treat ERR_XMLP_ABORT as "user error" and give somekind of
description before returning from callbacks, otherwise we present parser error: */
if (parser->ErrorCode != ERR_XMLP_ABORT) {
XMLCH *SystemID = XMLParser_GetSystemID(parser);
printf("\nParsing Error: %s\nCode: %d", parser->ErrorString, parser->ErrorCode);
if (SystemID) printf("\nSystemID: '%s'", SystemID);
}
printf("\nLine: %d Column: %d", parser->ErrorLine, parser->ErrorColumn);
}
Name | Type | Read-only | Description | DocumentElement | XMLCH* | yes | root element of current document (test for
!NULL ) |
UserData | void* | no | UserData passed to all callbacks | ErrorCode | int | yes | see errorHandler | ErrorString | XMLCH* | yes | see errorHandler | ErrorLine | int | yes | see errorHandler | ErrorColumn | int | yes | see errorHandler |
---|
Name | Default | Description | XMLFLAG_NAMESPACES | TRUE | Namespace processing | XMLFLAG_NAMESPACE_PREFIXES | FALSE | Report namespace (xmlns and xmlns:XXX) attributes in startElementHandler | XMLFLAG_EXTERNAL_GENERAL_ENTITIES | TRUE | Process external general entities (FALSE they will be skipped) | XMLFLAG_PRESERVE_GENERAL_ENTITIES | FALSE | Preserve all general entity references (character references, predefined entities and parameter entities are expanded. See entities). You can set this flag during parsing but don't set it in the midst of charactersHandler calls! This flag overrides all other entity flags. | XMLFLAG_UNDEF_GENERAL_ENTITIES | FALSE | Setting this flag makes parsifal trigger ERR_XMLP_UNDEF_ENTITY error when it encounters undefined general entity instead of skippedEntity call. | XMLFLAG_PRESERVE_WS_ATTRIBUTES | FALSE | By default all attribute values are normalized (linefeeds and extra space removed) and trimmed. You can set this flag to prevent the attribute normalization to happen. | XMLFLAG_CONVERT_EOL | TRUE | Always TRUE starting from v0.7.2. By default all CRLF pairs and CRs are converted into single LF. Set this flag to false if you want to preserve linefeeds as they appear in the input document. | XMLFLAG_REPORT_DTD_EXT | FALSE | Reports element declaration content particles as tree structure in the elementDeclHandler and gathers additional attribute declaration information from DTDs. Currently used only internally (not documented). | XMLFLAG_VALIDATION_WARNINGS | FALSE | When this flag is set validation errors are treated as warnings - this means that parsing continues after validation errors. Tip: You might want to set your startElement etc. handlers to NULL when first validation warning occurs or set some user flag to inform your handlers that from this point on you're only collecting warnings/parsing errors and not building any data model in your handlers. | XMLFLAG_SPLIT_LARGE_CONTENT | FALSE | Splits character handler calls when content bytes grow larger than 4000 bytes (not exact value, you can't rely on that). When parsing for example documents that contain large binary content w/o this flag memory usage might become a problem and performance might degrade. See also characters handler. | XMLFLAG_USE_SIMPLEPULL | FALSE | Uses simple pull/progressive model for parsing. See XMLParser_HasMoreEvents |
---|
These flags can be set by macro _XMLParser_SetFlag(LPXMLPARSER parser, [FLAG],
[TRUE/FALSE]).
Values can be retrieved by calling macro _XMLParser_GetFlag(LPXMLPARSER parser, [FLAG]).
LPXMLPARSER XMLParser_Create(LPXMLPARSER *parser);
Creates new parser. Parser
parameter is an address
of pointer which will be allocated and prepared by XMLParser_Create. Return value and pointer will be
NULL
if parser can't be created due to memory
allocation problem.
After parser has been created, you can reuse it calling XMLParser_Parse multiple times or you can create for example pool of pre-allocated parser objects for some special case.
int XMLParser_Parse(LPXMLPARSER parser, LPFNINPUTSRC inputSrc, void *inputData, const XMLCH *encoding);
Parameters:
parser |
Pointer to parser object | inputSrc |
Pointer to callback function that will feed data to parser. | inputData |
For example FILE* to be passed to inputSrc
function |
encoding |
Sets document encoding. Note that forced encoding doesn't override (see encodingAliasHandler for
info about overriding encodings) encoding declaration or BOM and
should be used only in situations when document doesn't contain encoding declaration and/or BOM i.e.
encoding is specified in MIME header etc. Forced encoding can infact conflict with encoding set by bom or
xml declaration.
Set NULL (recommended) to let bom, document encoding declaration or
default UTF-8 to be used.Valid values are:
|
inputSrc
function has the following prototype:
int (*LPFNINPUTSRC)(BYTE *buf, int cBytes, int *cBytesActual, void *inputData);
Parser will send read request to this callback each time it
needs more data. buf
is allocated for
cBytes
, set cBytesActual
to byte count
you actually put into buf
.
Return TRUE if EOF or some error occurred, EOF can be normally determined by following:
return (*cBytesActual < cBytes);
NOTE: If you want to distinguish between an end of stream (EOF) and a stream error,
you must return BIS_ERR_INPUT
on stream error and BIS_EOF
(or TRUE) on EOF condition.
Returning BIS_ERR_INPUT
will trigger
ERR_XMLP_IO
error accordingly. Stream error check makes code more explicit, for example
document and more importantly external entity can be well-formed and parsing can be succesfull even though
there was a stream error! For this reason it's always good practise to do error checking in
inputsource callback.
One option for error checking would be to check feof(inputdata)
after parsing. Infact
this is preferred method and works well when you wrap XMLParser_Parse
call to handle
different inputsources.
Here's an example FILE* inputsrc that handles ferror() too:
int cstream(BYTE *buf, int cBytes, int *cBytesActual, void *inputData)
{
*cBytesActual = fread(buf, 1, cBytes, (FILE*)inputData);
if (ferror((FILE*)inputData)) return BIS_ERR_INPUT;
return (feof((FILE*)inputData)) ? BIS_EOF : 0;
}
Samples include an example of parsing C file stream and parsing from Windows specific url stream. Xmlplint project contains curlread.c and curlread.h for parsing libcurl specific input sources like http and ftp. See also isrcmem.h for info on parsing memory buffers.
LPXMLRUNTIMEATT XMLParser_GetNamedItem(LPXMLPARSER parser, XMLCH *name);
Returns attribute of type LPXMLRUNTIMEATT
by name
or NULL
if there isn't such attribute.
Call this function only in startElementHandler when
attributes are valid!
XMLCH* *XMLParser_GetPrefixMapping(LPXMLPARSER parser, const XMLCH *prefix);
Returns uri for prefix
or NULL
if there isn't such prefix in
scope.
Parsifal tracks also the scope for predefined xml
prefix i.e. xml:space
,
xml:base
or xml:whatever
can be retrieved by calling
XMLParser_GetPrefixMapping(parser, "xml:space");
Tip: checking xml:space equals 'preserve' in ignorableWhitespaceHandler and delegating call to charactersHandler is method for handling xml:space properly. Of course using validation and declaring #PCDATA is the best way to handle whitespace.
XMLCH* *XMLParser_GetSystemID(LPXMLPARSER parser);
Returns systemID/publicID for current external entity or NULL
if parsing the main
document.
These functions (and XMLParser_GetCurrentEntity) can be used to provide more information on error condition for
example. They return meaningful values
only during parsing i.e. in the callback events - usually you call these in
errorHandler. Return value of these functions is undefined AFTER errorHandler event call
(when parser->ErrorCode has been set)
int *XMLParser_GetCurrentLine(LPXMLPARSER parser);
Returns line number of current parsing position. Return value -1 means that position info isn't available
NOTE: during parsing internal entity line and column can be relative to internal entity
string not current document position. If you need REALLY accurate error information
you must call XMLParser_GetCurrentEntity
(LPXMLENTITY or
NULL if parsing main document). Note also that parser's line
and column information isn't file/resource byte offset information.
int *XMLParser_GetCurrentColumn(LPXMLPARSER parser);
Returns column number of current parsing position (in UTF-8 characters - not bytes). Return value -1 means that position info isn't available
int XMLParser_GetContextBytes(LPXMLPARSER parser, XMLCH **Bytes, int *cBytes);
Returns column number of current parsing position (in bytes 0-based). Bytes
parameter can be address of XMLCH* that receives context bytes. cBytes
receives length of Bytes
. Both might be NULL when you need only the return value.
Note that Bytes
might be NUL terminated before
cBytes
and it might not be complete line, see helper.c for example
of handling context bytes. Return value -1 means that position info/context isn't available
LPXMLENTITY *XMLParser_GetCurrentEntity(LPXMLPARSER parser);
Returns LPXMLENTITY or NULL if parsing main document. This info can be used for determining accurate error position - whether parser was processing internal entity when error occurred etc. Return value of this function is undefined AFTER errorHandler event call (when parser->ErrorCode has been set)
LPXMLENTITY XMLAPI XMLParser_SetExternalSubset(LPXMLPARSER parser, const XMLCH *publicID, const XMLCH *systemID);
This function can be used to set external subset (DTD) for documents that have no DOCTYPE declaration or to set additional DTD to be loaded before any other DTD (if present) gets loaded. This means that startDTDHandler and endDTDHandler are always called once (no matter whether there is only external DTD set by this function or both DOCTYPE declaration and DTD set by this function)
XMLParser_SetExternalSubset returns pointer to LPXMLENTITY object (return value is always valid LPXMLENTITY) - this can be used to distinguish user DTD from other external DTD in resolveEntityHandler.
To turn off user subset loading call this with NULL
values for publicID
and systemID
parameters.
See also SAX2 java API getExternalSubset
int XMLAPI XMLParser_HasMoreEvents(LPXMLPARSER parser);
When XMLFLAG_USE_SIMPLEPULL
flag is set XMLParser_Parse
(or ParseValidateDTD
) only initializes parser and comes out
of the event loop without parsing. You then call this method to trigger/parse next event(s). This feature can be used
to implement pull parser (for language bindings or XMPP etc. parser) on top of Parsifal. See xmlreader
sample which
implements pull parser and xmlreader README
which explains the technical details.
void XMLParser_Free(LPXMLPARSER parser);
Frees parser and its resources.
int XMLNormalizeBuf(XMLCH *buf, int len);
XMLNormalizeBuf can be used to normalize character content. Normalization converts all
LFs, CRs, CRLF pairs and TABS in buf
into single space and trims the start and end of the buffer. XMLNormalizeBuf returns length of normalized buffer (most likely less than len
after normalization) but it doesn't nul terminate the buffer. You must
do this yourself.
Dynamically allocating array implementation that is also exported/available for end-user of Parsifal. See XMLVector.c and XMLVector.h for details. Samples demonstrate XMLVector usage. Samples (and Parsifal internally) use XMLVector as stack too. xmldef.h provides stack macro wrappers for XMLVector.
Simple stringbuffer implementation that is also exported/available for end-user of Parsifal. See XMLStringbuf.c and XMLStringbuf.h for details. Sample files demonstrate XMLStringbuf usage.
Can be used in conjunction with XMLStrinbuf for allocating fixed length strings etc. See also test_pool.c for example of using XMLVector, XMLStringbuf and XMLPool.
Hashtable implementation. Can be used to store generic data. See xmlhash.h
DTD validation with parsifal requires the following steps besides creating the usual LPXMLPARSER object:
libparsifal/dtdvalid.h
You can reuse LPXMLDTDVALIDATOR objects and call XMLParser_ParseValidateDTD multiple times without doing any cleaning up etc. LPXMLDTDVALIDATOR is "tied into" LPXMLPARSER only during parsing (during XMLParser_ParseValidateDTD call) so XMLParser_FreeDTDValidator can also be called independently of XMLParser_Free.
Validation function prototypes:
LPXMLDTDVALIDATOR XMLAPI XMLParser_CreateDTDValidator(void);
void XMLAPI XMLParser_FreeDTDValidator(LPXMLDTDVALIDATOR dtd);
int XMLAPI XMLParser_ParseValidateDTD(LPXMLDTDVALIDATOR dtd,
LPXMLPARSER parser, LPFNINPUTSRC inputSrc,
void *inputData, const XMLCH *encoding);
An example (assumes LPXMLPARSER has been created):
LPXMLDTDVALIDATOR dtd = XMLParser_CreateDTDValidator();
if (!dtd) {
puts("Out of memory!");
return 1;
}
parser->startElementHandler = MyStartElement;
XMLParser_ParseValidateDTD(dtd, parser, cstream, stdin, 0);
XMLParser_FreeDTDValidator(dtd);
Important:
When parsing with XMLParser_ParseValidateDTD, UserData parameter for your LPXMLPARSER will be LPXMLDTDVALIDATOR (which contains your LPXMLPARSER as one of its members called simply 'parser'). LPXMLDTDVALIDATOR contains UserData and extra int UserFlag that you can use.
Another thing is that if you want to alter your handlers during validation you must alter the handlers that are the members of LPXMLDTDVALIDATOR object (see dtdvalid.h). Modifying same handlers in your LPXMLPARSER doesn't do what you want (infact they just don't change anything). Also parser's elementDeclHandler gets called with content particle tree structure and not with string representation of the content model - so if you need the string representation you must study parsifal.c and ContentModelToString() function etc.
Validation errors/warnings
When validation errors occur XMLParser_ParseValidateDTD returns false and parser ErrorHandler gets normally called before that (if specified). If ErrorCode is ERR_XMLP_VALIDATION then parser ErrorColumn or ErrorString etc. WILL NOT BE SET but ErrorCode, ErrorString etc. of LPXMLDTDVALIDATOR will be available. If XMLFLAG_VALIDATION_WARNINGS has been set parsing continues after reporting "error" (which in this case is merely a warning) in ErrorHandler.
An example ErrorHandler:
void ErrorHandler(LPXMLPARSER parser)
{
if (parser->ErrorCode == ERR_XMLP_VALIDATION) {
LPXMLDTDVALIDATOR vp = (LPXMLDTDVALIDATOR)parser->UserData;
printf("Validation Error: %s\nErrorLine: %d ErrorColumn: %d\n",
vp->ErrorString, vp->ErrorLine, vp->ErrorColumn);
}
else {
printf("Parsing Error: %s\nErrorLine: %d ErrorColumn: %d\n",
parser->ErrorString, parser->ErrorLine, parser->ErrorColumn);
}
}
DTD, namespaces and selective (filtering) validation
You can use DTDValidate_StartElement and other DTDValidate... handlers
specified in dtdvalid.h
to filter and selectively validate DTDs. This feature also makes possible to
correctly validate elements that are namespace prefixed - this is easily
accomplished in your StartElement filter where you just check
that uri matches your preferred one and then pass localName as qName to
DTDValidate_StartElement.
You can filter attributes in the filter handlers too. Recommended way is to use separate/new LPXMLVECTOR object where you copy attributes selectively from startElementHandler's atts parameter using XMLVector_Append(newVect, att); Adding or modifying attributes isn't currently recommended (not documented properly).
see example nsvalid.c and xmlplint sources (vfilter.c)
Note: You can use default namespace in your root element like xhtml1-transitional.dtd uses #FIXED xmlns attribute in its html element BUT if you wan't to allow any prefixes and/or elements containing elements from other namespaces you must do the filtering.
Basic tutorials/info about DTDs:
On Windows:
If you have Visual C++ compiler all sample
programs can be build by running BUILD.BAT
in sample
directories. Sample executables will be build into WIN32\BIN
directory.
Directory WIN32\VC6
contains VC6 project files and sources
for building Windows dll to be linked to your own projects. Simply
link to parsifal.lib
, include parsifal.h
and define XMLAPI=__declspec(dllimport)
preprocessor
definition in your target project. Also make sure that dll is found
in your path.
You can find parsifal.dll
prebuild from
WIN32\BIN
directory and parsifal.lib
from
WIN32\LIB
directory. Prebuild dll is linked with static
multithreaded run-time library (option -MT
).
VC6 project file also contains target for building dll with libiconv support that is tested with binary distribution libiconv-1.9.1.bin.woe32. This builds parsifal.dll that is linked with dynamic multithreaded run-time library - using same RTL as libiconv is essential. See also win_iconv option in links.
VC6 is quite old version of Microsoft's Visual C++. Microsoft offers versions of Visual C++ express for free and you should be able to import VC6 projects into newer versions easily. Since Parsifal is a cross-platform project you might want to use gcc in windows too, see MinGW options below.
MinGW:
Directory WIN32\MINGW\DLL
contains makefile for creating parsifal.dll. See makefile for options/usage. Sample
SAMPLES\PULL\BUILD_MINGW.BAT
is a batch file example for linking with WIN32\MINGW\DLL\PARSIFAL.DLL
.
Directory WIN32\MINGW
etc. contains static library for mingw compiler
and Dev-Cpp IDE project files. See README in that directory.
C++ Builder command line tools V.5.5:
Directory WIN32\BCC
etc. contains makefile for building static library for free BCC 5.5 compiler.
Makefile was contributed by Carsten Heuer. Links with static and singlethreaded runtime library which is default
for BCC. See README in that directory.
Other:
If you aren't using compiler mentioned above, you must build your Parsifal library by
means specific to your C compiler. Just generate static or
dynamic library from files in src directory, link your
program with it and include parsifal.h
in your
source. WIN32\MINGW\DLL
makefile can also be a starting point for you
See also README
On Linux/Unix:
tar xzf libparsifal-X.X.X.tar.gz
cd libparsifal-X.X.X
./configure
make
make install
/usr/lib
and include files into /usr/include/libparsifal
by default.
/usr/local/bin
by default,
build samples by executing:
make
make install (optional, you can place executables in to directory of
your choice of course)
Samples aren't build when installing shared library. Library must be build and installed
before making sample files of course.
Startinng from version 1.1 configure offers --disable-gccflags
for disabling default compiler flags for gcc.
1.1 also contains parsifal-config
tool for querying information about how
libparsifal was configured and installed.
See also README
#define | Default | Description | DTD_SUPPORT | TRUE | Compiles with DTD support. Compiling w/o DTD_SUPPORT defined makes smaller parser which doesn't support parsing DTDs at all (accepts simple DOCTYPE tag optionally containing PUBLIC and SYSTEM declared BUT NOT an internal subset of any sort). w/o DTD_SUPPORT library size will be around 48K with dynamic RTL against 68K full dtd support build. This has also effect on how character data is reported via charactersHandler. | DTDVALID_SUPPORT | TRUE | DTD Validation support. Includes dtdvalid.c
| MAX_SPEED | TRUE | Performs some optimizations (basically inlining stuff). Makes the parser about 4 KB bigger but a bit faster. | ICONV_SUPPORT | FALSE/TRUE* | Compiles with GNU libiconv support.
Note: Starting from version 1.0 linux configure script tries to locate and use iconv, this
can be disabled by setting --disable-iconv configure option.
|
---|
All makefiles and pre-build libraries use these default compiling options.
Parsifal uses UTF-8 encoding internally.. All XMLCH
type parameters passed to callbacks are UTF-8 encoded i.e.
parsifal output is always UTF-8 encoded no matter what
input encoding document uses. This makes
Parsifal usable with many languages although sometimes you need to
do some extra work converting UTF-8 to character encoding used by
your target program. See also helper.c for UTF-8 to iso-latin1 conversion routine.
Since UTF-8 is becoming very popular form of unicode in various platforms you may consider using UTF-8 as your program's internal encoding, here's an excerpt from UTF-8 and Unicode FAQ for Unix/Linux - http://www.cl.cam.ac.uk/~mgk25/unicode.html:
"the major Linux distributors and application developers now foresee and hope that Unicode will eventually replace all these older legacy encodings, primarily in the UTF-8 form."
TIP: to enable UTF-8 support for xterm use
Read linux Unicode-HOWTO for more details/language/font support
for xterm etc.
xterm -u8
GNU libiconv
If you compile with GNU libiconv you must make sure Parsifal build uses the same kind of runtime library as your libiconv. If you're going to use libiconv on win this is easily accomplished by making libiconv target from VC6 project file and downloading libiconv binary distribution from gnu ftp sites (tested with 1.9.1). See also libiconv READMEs for more information
You can test your build by running IconvTest()
in your code.
test_iconv.c can be found in samples/misc
directory.
Note: Libiconv defaults to big endian unicode encodings i.e. if your document doesn't have byte order mark you must explicitly specify UTF-16LE or UCS-4LE, when for example only UCS-4 is specified, libiconv defaults to UCS-4BE.
Parsifal supports character references, predefined entities, internal and external general entities. + internal and external parameter entities (used in DTD authoring only)
Example documents with supported entity types:
<!DOCTYPE doc [
<!ENTITY ent1 "internal entity data">
<!ENTITY ent2 SYSTEM "external.ent">
]>
<doc>
< <!-- this is character reference -->
< <!-- this is reference to predefined entity -->
&ent1; <!-- this is reference to internal general entity -->
&ent2; <!-- this is reference to external general entity -->
</doc>
An example that includes common xhtml entity declarations into document:
<!DOCTYPE doc [
<!ENTITY % latin1 SYSTEM "xhtml-lat1.ent">
%latin1; <!-- this is a reference to external parameter entity -->
]>
<doc>
Copyright © 2004 Sir Elmo Von Eltonzon
</doc>
Parsifal is re-entrant - this means that you can run multiple parsers concurrently and parser works in the threaded environments as long as you handle thread synchronization etc. when needed.
Copyright © 2002-2008 Toni
Uusitalo.
Send mail, suggestions and bug reports to
Last modified: 04.10.2008 00:00