rapidxml
Classes | Enumerations | Functions | Variables
rapidxml Namespace Reference

Classes

Enumerations

Functions

Variables


Detailed Description

Copyright (C) 2006, 2009 Marcin Kalicinski
See accompanying file license.txt for license information.


Table of Contents


What is RapidXml?

RapidXml is an attempt to create the fastest XML DOM parser possible, while retaining useability, portability and reasonable W3C compatibility. It is an in-situ parser written in C++, with parsing speed approaching that of strlen() function executed on the same data.

Entire parser is contained in a single header file, so no building or linking is neccesary. To use it you just need to copy rapidxml.hpp file to a convenient place (such as your project directory), and include it where needed. You may also want to use printing functions contained in header rapidxml_print.hpp.

Dependencies And Compatibility

RapidXml has no dependencies other than a very small subset of standard C++ library (<cassert>, <cstdlib>, <new> and <exception>, unless exceptions are disabled). It should compile on any reasonably conformant compiler, and was tested on Visual C++ 2003, Visual C++ 2005, Visual C++ 2008, gcc 3, gcc 4, and Comeau 4.3.3. Care was taken that no warnings are produced on these compilers, even with highest warning levels enabled.

Character Types And Encodings

RapidXml is character type agnostic, and can work both with narrow and wide characters. Current version does not fully support UTF-16 or UTF-32, so use of wide characters is somewhat incapacitated. However, it should succesfully parse wchar_t strings containing UTF-16 or UTF-32 if endianness of the data matches that of the machine. UTF-8 is fully supported, including all numeric character references, which are expanded into appropriate UTF-8 byte sequences (unless you enable parse_no_utf8 flag).

Note that RapidXml performs no decoding - strings returned by name() and value() functions will contain text encoded using the same encoding as source file. Rapidxml understands and expands the following character references: &apos; &amp; &quot; &lt; &gt; &#...; Other character references are not expanded.

Error Handling

By default, RapidXml uses C++ exceptions to report errors. If this behaviour is undesirable, RAPIDXML_NO_EXCEPTIONS can be defined to suppress exception code. See parse_error class and parse_error_handler() function for more information.

Memory Allocation

RapidXml uses a special memory pool object to allocate nodes and attributes, because direct allocation using new operator would be far too slow. Underlying memory allocations performed by the pool can be customized by use of memory_pool::set_allocator() function. See class memory_pool for more information.

W3C Compliance

RapidXml is not a W3C compliant parser, primarily because it ignores DOCTYPE declarations. There is a number of other, minor incompatibilities as well. Still, it can successfully parse and produce complete trees of all valid XML files in W3C conformance suite (over 1000 files specially designed to find flaws in XML processors). In destructive mode it performs whitespace normalization and character entity substitution for a small set of built-in entities.

API Design

RapidXml API is minimalistic, to reduce code size as much as possible, and facilitate use in embedded environments. Additional convenience functions are provided in separate headers: rapidxml_utils.hpp and rapidxml_print.hpp. Contents of these headers is not an essential part of the library, and is currently not documented (otherwise than with comments in code).

Reliability

RapidXml is very robust and comes with a large harness of unit tests. Special care has been taken to ensure stability of the parser no matter what source text is thrown at it. One of the unit tests produces 100,000 randomly corrupted variants of XML document, which (when uncorrupted) contains all constructs recognized by RapidXml. RapidXml passes this test when it correctly recognizes that errors have been introduced, and does not crash or loop indefinitely.

Another unit test puts RapidXml head-to-head with another, well estabilished XML parser, and verifies that their outputs match across a wide variety of small and large documents.

Yet another test feeds RapidXml with over 1000 test files from W3C compliance suite, and verifies that correct results are obtained. There are also additional tests that verify each API function separately, and test that various parsing modes work as expected.

Acknowledgements

I would like to thank Arseny Kapoulkine for his work on pugixml, which was an inspiration for this project. Additional thanks go to Kristen Wegner for creating pugxml, from which pugixml was derived. Janusz Wohlfeil kindly ran RapidXml speed tests on hardware that I did not have access to, allowing me to expand performance comparison table.

Two Minute Tutorial

Parsing

The following code causes RapidXml to parse a zero-terminated string named text:

using namespace rapidxml;
xml_document<> doc;    // character type defaults to char
doc.parse<0>(text);    // 0 means default parse flags

doc object is now a root of DOM tree containing representation of the parsed XML. Because all RapidXml interface is contained inside namespace rapidxml, users must either bring contents of this namespace into scope, or fully qualify all the names. Class xml_document represents a root of the DOM hierarchy. By means of public inheritance, it is also an xml_node and a memory_pool. Template parameter of xml_document::parse() function is used to specify parsing flags, with which you can fine-tune behaviour of the parser. Note that flags must be a compile-time constant.

Accessing The DOM Tree

To access the DOM tree, use methods of xml_node and xml_attribute classes:

cout << "Name of my first node is: " << doc.first_node()->name() << "\n";
xml_node<> *node = doc.first_node("foobar");
cout << "Node foobar has value " << node->value() << "\n";
for (xml_attribute<> *attr = node->first_attribute();
     attr; attr = attr->next_attribute())
{
    cout << "Node foobar has attribute " << attr->name() << " ";
    cout << "with value " << attr->value() << "\n";
}

Modifying The DOM Tree

DOM tree produced by the parser is fully modifiable. Nodes and attributes can be added/removed, and their contents changed. The below example creates a HTML document, whose sole contents is a link to google.com website:

xml_document<> doc;
xml_node<> *node = doc.allocate_node(node_element, "a", "Google");
doc.append_node(node);
xml_attribute<> *attr = doc.allocate_attribute("href", "google.com");
node->append_attribute(attr);

One quirk is that nodes and attributes do not own the text of their names and values. This is because normally they only store pointers to the source text. So, when assigning a new name or value to the node, care must be taken to ensure proper lifetime of the string. The easiest way to achieve it is to allocate the string from the xml_document memory pool. In the above example this is not necessary, because we are only assigning character constants. But the code below uses memory_pool::allocate_string() function to allocate node name (which will have the same lifetime as the document), and assigns it to a new node:

xml_document<> doc;
char *node_name = doc.allocate_string(name);        // Allocate string and copy name into it
xml_node<> *node = doc.allocate_node(node_element, node_name);  // Set node name to node_name

Check Reference section for description of the entire interface.

Printing XML

You can print xml_document and xml_node objects into an XML string. Use print() function or operator <<, which are defined in rapidxml_print.hpp header.

using namespace rapidxml;
xml_document<> doc;    // character type defaults to char
// ... some code to fill the document

// Print to stream using operator <<
std::cout << doc;   

// Print to stream using print function, specifying printing flags
print(std::cout, doc, 0);   // 0 means default printing flags

// Print to string using output iterator
std::string s;
print(std::back_inserter(s), doc, 0);

// Print to memory buffer using output iterator
char buffer[4096];                      // You are responsible for making the buffer large enough!
char *end = print(buffer, doc, 0);      // end contains pointer to character after last printed character
*end = 0;                               // Add string terminator after XML

Differences From Regular XML Parsers

RapidXml is an in-situ parser, which allows it to achieve very high parsing speed. In-situ means that parser does not make copies of strings. Instead, it places pointers to the source text in the DOM hierarchy.

Lifetime Of Source Text

In-situ parsing requires that source text lives at least as long as the document object. If source text is destroyed, names and values of nodes in DOM tree will become destroyed as well. Additionally, whitespace processing, character entity translation, and zero-termination of strings require that source text be modified during parsing (but see non-destructive mode). This makes the text useless for further processing once it was parsed by RapidXml.

In many cases however, these are not serious issues.

Ownership Of Strings

Nodes and attributes produced by RapidXml do not own their name and value strings. They merely hold the pointers to them. This means you have to be careful when setting these values manually, by using xml_base::name(const Ch *) or xml_base::value(const Ch *) functions. Care must be taken to ensure that lifetime of the string passed is at least as long as lifetime of the node/attribute. The easiest way to achieve it is to allocate the string from memory_pool owned by the document. Use memory_pool::allocate_string() function for this purpose.

Destructive Vs Non-Destructive Mode

By default, the parser modifies source text during the parsing process. This is required to achieve character entity translation, whitespace normalization, and zero-termination of strings.

In some cases this behaviour may be undesirable, for example if source text resides in read only memory, or is mapped to memory directly from file. By using appropriate parser flags (parse_non_destructive), source text modifications can be disabled. However, because RapidXml does in-situ parsing, it obviously has the following side-effects:

Performance

RapidXml achieves its speed through use of several techniques:

This results in a very small and fast code: a parser which is custom tailored to exact needs with each invocation.

Comparison With Other Parsers

The table below compares speed of RapidXml to some other parsers, and to strlen() function executed on the same data. On a modern CPU (as of 2007), you can expect parsing throughput to be close to 1 GB/s. As a rule of thumb, parsing speed is about 50-100x faster than Xerces DOM, 30-60x faster than TinyXml, 3-12x faster than pugxml, and about 5% - 30% faster than pugixml, the fastest XML parser I know of.

Platform
Compiler
strlen() RapidXml pugixml 0.3 pugxml TinyXml
Pentium 4
MSVC 8.0
2.5
5.4
7.0
61.7
298.8
Pentium 4
gcc 4.1.1
0.8
6.1
9.5
67.0
413.2
Core 2
MSVC 8.0
1.0
4.5
5.0
24.6
154.8
Core 2
gcc 4.1.1
0.6
4.6
5.4
28.3
229.3
Athlon XP
MSVC 8.0
3.1
7.7
8.0
25.5
182.6
Athlon XP
gcc 4.1.1
0.9
8.2
9.2
33.7
265.2
Pentium 3
MSVC 8.0
2.0
6.3
7.0
30.9
211.9
Pentium 3
gcc 4.1.1
1.0
6.7
8.9
35.3
316.0

(*) All results are in CPU cycles per character of source text

Reference

This section lists all classes, functions, constants etc. and describes them in detail.