http://xml.apache.org/http://www.apache.org/http://www.w3.org/

Home

Readme
Installation
Download
CVS Repository

Samples
API JavaDoc
FAQs

Features
Properties

XNI Manual
XML Schema
SAX
DOM
Limitations

Release Info
Report a Bug

Questions
 

Answers
 
How do I create a SAX parser?
 

You can create a SAX parser by using the Java APIs for XML Processing (JAXP). The following source code shows how:

import java.io.IOException; 
import javax.xml.parsers.FactoryConfigurationError;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

  ...

String xmlFile = "file:///xerces-2_6_2/data/personal.xml"; 
try {
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser parser = factory.newSAXParser();
    DefaultHandler handler = /* custom handler class */;
    parser.parse(xmlFile, handler);
} 
catch (FactoryConfigurationError e) {
    // unable to get a document builder factory
} 
catch (ParserConfigurationException e) {
    // parser was unable to be configured
catch (SAXException e) {
    // parsing error
} 
catch (IOException e) {
    // i/o error
}

Why does the SAX parser lose some character data or why is the data split into several chunks?
 

If you read the SAX documentation, you will find that SAX may deliver contiguous text as multiple calls to characters, for reasons having to do with parser efficiency and input buffering. It is the programmer's responsibility to deal with that appropriately, e.g. by accumulating text until the next non-characters event.

Xerces will split calls to characters at the end of an internal buffer, at a new line and also at a few other boundaries. You can never rely on contiguous text to be passed in a single characters callback.


Why doesn't the SAX parser report ignorable whitespace for XML Schemas?
 

SAX is very clear that ignorableWhitespace is only called for element content whitespace, which is defined in the context of a DTD. The result of schema validation is the Post-Schema-Validation Infoset (PSVI). Schema processors augment the base Infoset by adding new properties to element and attribute information items, but not character information items. Schemas do not change whether a character is element content whitespace.


Why does the SAX parser report that xmlns attributes have no namespace?
 

An erratum for the Namespaces in XML recommendation put namespace declaration attributes in the namespace "http://www.w3.org/2000/xmlns/". SAX2 (SAX 2.0.1) does not agree with this change so conforming parsers must report that these attributes have no namespace. Xerces behaves according to SAX2. Your code must handle this discrepancy when interacting with APIs such as DOM and applications which expect a namespace for xmlns attributes.


Is there any way I can determine what encoding an entity was written in, or what XML version the document conformed to, if I'm using SAX?
 

The answer to this question is that, yes there is a way, but it's not particularly beautiful. There is no way in SAX 2.0.0 or 2.0.1 to get hold of these pieces of information; the SAX Locator2 interface from the 1.1 extensions--still in Beta at the time of writing--does provide methods to accomplish this, but since Xerces is required to support precisely SAX 2.0.1 by Sun TCK rules, we cannot ship this interface. However, we can still support the appropriate methods on the objects we provide to implement the SAX Locator interface. Therefore, assuming Locator is an instance of the SAX Locator interface that Xerces has passed back in a setDocumentLocator call, you can use a method like this to determine the encoding of the entity currently being parsed:

import java.lang.reflect.Method;
private String getEncoding(Locator locator) {
    String encoding = null;
    Method getEncoding = null;
    try {
        getEncoding = 
            locator.getClass().getMethod("getEncoding", new Class[]{});
        if(getEncoding != null) {
            encoding = (String)getEncoding.invoke(locator, null);
        }
    } catch (Exception e) {
        // either this locator object doesn't have this
        // method, or we're on an old JDK
    }
    return encoding;
}

This code has the advantage that it will compile on JDK 1.1.8, though it will only produce non-null results on 1.2.x JDK's and later. Substituting getXMLVersion for getEncoding will enable you to determine the version of XML to which the instance document conforms.




Copyright © 1999-2004 The Apache Software Foundation. All Rights Reserved.