org.cyberneko.html.filters
Class Purifier

java.lang.Object
  extended byorg.cyberneko.html.filters.DefaultFilter
      extended byorg.cyberneko.html.filters.Purifier
All Implemented Interfaces:
HTMLComponent, XMLComponent, XMLDocumentFilter, XMLDocumentHandler, XMLDocumentSource

public class Purifier
extends DefaultFilter

This filter purifies the HTML input to ensure XML well-formedness. The purification process includes:

Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".

In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.

The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.

Version:
$Id$
Author:
Andy Clark

Field Summary
protected static String AUGMENTATIONS
          Include infoset augmentations.
protected  boolean fAugmentations
          Augmentations.
protected  boolean fInCDATASection
          True if inside a CDATA section.
protected  NamespaceContext fNamespaceContext
          Namespace information.
protected  boolean fNamespaces
          Namespaces.
protected  String fPublicId
          Public identifier of doctype declaration.
protected  boolean fSeenDoctype
          True if the doctype declaration was seen.
protected  boolean fSeenRootElement
          True if root element was seen.
protected  int fSynthesizedNamespaceCount
          Synthesized namespace binding count.
protected  String fSystemId
          System identifier of doctype declaration.
protected static String NAMESPACES
          Namespaces.
protected static HTMLEventInfo SYNTHESIZED_ITEM
          Synthesized event info item.
static String SYNTHESIZED_NAMESPACE_PREFX
          Synthesized namespace binding prefix.
 
Fields inherited from class org.cyberneko.html.filters.DefaultFilter
fDocumentHandler, fDocumentSource
 
Constructor Summary
Purifier()
           
 
Method Summary
 void characters(XMLString text, Augmentations augs)
          Characters.
 void comment(XMLString text, Augmentations augs)
          Comment.
 void doctypeDecl(String root, String pubid, String sysid, Augmentations augs)
          Doctype declaration.
 void emptyElement(QName element, XMLAttributes attrs, Augmentations augs)
          Empty element.
 void endCDATA(Augmentations augs)
          End CDATA section.
 void endElement(QName element, Augmentations augs)
          End element.
protected  void handleStartDocument()
          Handle start document.
protected  void handleStartElement(QName element, XMLAttributes attrs)
          Handle start element.
 void processingInstruction(String target, XMLString data, Augmentations augs)
          Processing instruction.
protected  String purifyName(String name, boolean localpart)
          Purify name.
protected  QName purifyQName(QName qname)
          Purify qualified name.
protected  XMLString purifyText(XMLString text)
          Purify content.
 void reset(XMLComponentManager manager)
          Resets the component.
 void startCDATA(Augmentations augs)
          Start CDATA section.
 void startDocument(XMLLocator locator, String encoding, Augmentations augs)
          Start document.
 void startDocument(XMLLocator locator, String encoding, NamespaceContext nscontext, Augmentations augs)
          Start document.
 void startElement(QName element, XMLAttributes attrs, Augmentations augs)
          Start element.
protected  void synthesizeBinding(XMLAttributes attrs, String ns)
          Synthesize namespace binding.
protected  Augmentations synthesizedAugs()
          Returns an augmentations object with a synthesized item added.
protected static String toHexString(int c, int padlen)
          Returns a padded hexadecimal string for the given value.
 void xmlDecl(String version, String encoding, String standalone, Augmentations augs)
          XML declaration.
 
Methods inherited from class org.cyberneko.html.filters.DefaultFilter
endDocument, endGeneralEntity, endPrefixMapping, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, ignorableWhitespace, merge, setDocumentHandler, setDocumentSource, setFeature, setProperty, startGeneralEntity, startPrefixMapping, textDecl
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SYNTHESIZED_NAMESPACE_PREFX

public static final String SYNTHESIZED_NAMESPACE_PREFX
Synthesized namespace binding prefix.

See Also:
Constant Field Values

NAMESPACES

protected static final String NAMESPACES
Namespaces.

See Also:
Constant Field Values

AUGMENTATIONS

protected static final String AUGMENTATIONS
Include infoset augmentations.

See Also:
Constant Field Values

SYNTHESIZED_ITEM

protected static final HTMLEventInfo SYNTHESIZED_ITEM
Synthesized event info item.


fNamespaces

protected boolean fNamespaces
Namespaces.


fAugmentations

protected boolean fAugmentations
Augmentations.


fSeenDoctype

protected boolean fSeenDoctype
True if the doctype declaration was seen.


fSeenRootElement

protected boolean fSeenRootElement
True if root element was seen.


fInCDATASection

protected boolean fInCDATASection
True if inside a CDATA section.


fPublicId

protected String fPublicId
Public identifier of doctype declaration.


fSystemId

protected String fSystemId
System identifier of doctype declaration.


fNamespaceContext

protected NamespaceContext fNamespaceContext
Namespace information.


fSynthesizedNamespaceCount

protected int fSynthesizedNamespaceCount
Synthesized namespace binding count.

Constructor Detail

Purifier

public Purifier()
Method Detail

reset

public void reset(XMLComponentManager manager)
           throws XMLConfigurationException
Description copied from class: DefaultFilter
Resets the component. The component can query the component manager about any features and properties that affect the operation of the component.

Specified by:
reset in interface XMLComponent
Overrides:
reset in class DefaultFilter
Parameters:
manager - The component manager.
Throws:
XMLConfigurationException

startDocument

public void startDocument(XMLLocator locator,
                          String encoding,
                          Augmentations augs)
                   throws XNIException
Start document.

Overrides:
startDocument in class DefaultFilter
Throws:
XNIException

startDocument

public void startDocument(XMLLocator locator,
                          String encoding,
                          NamespaceContext nscontext,
                          Augmentations augs)
                   throws XNIException
Start document.

Specified by:
startDocument in interface XMLDocumentHandler
Overrides:
startDocument in class DefaultFilter
Throws:
XNIException

xmlDecl

public void xmlDecl(String version,
                    String encoding,
                    String standalone,
                    Augmentations augs)
             throws XNIException
XML declaration.

Specified by:
xmlDecl in interface XMLDocumentHandler
Overrides:
xmlDecl in class DefaultFilter
Throws:
XNIException

comment

public void comment(XMLString text,
                    Augmentations augs)
             throws XNIException
Comment.

Specified by:
comment in interface XMLDocumentHandler
Overrides:
comment in class DefaultFilter
Throws:
XNIException

processingInstruction

public void processingInstruction(String target,
                                  XMLString data,
                                  Augmentations augs)
                           throws XNIException
Processing instruction.

Specified by:
processingInstruction in interface XMLDocumentHandler
Overrides:
processingInstruction in class DefaultFilter
Throws:
XNIException

doctypeDecl

public void doctypeDecl(String root,
                        String pubid,
                        String sysid,
                        Augmentations augs)
                 throws XNIException
Doctype declaration.

Specified by:
doctypeDecl in interface XMLDocumentHandler
Overrides:
doctypeDecl in class DefaultFilter
Throws:
XNIException

startElement

public void startElement(QName element,
                         XMLAttributes attrs,
                         Augmentations augs)
                  throws XNIException
Start element.

Specified by:
startElement in interface XMLDocumentHandler
Overrides:
startElement in class DefaultFilter
Throws:
XNIException

emptyElement

public void emptyElement(QName element,
                         XMLAttributes attrs,
                         Augmentations augs)
                  throws XNIException
Empty element.

Specified by:
emptyElement in interface XMLDocumentHandler
Overrides:
emptyElement in class DefaultFilter
Throws:
XNIException

startCDATA

public void startCDATA(Augmentations augs)
                throws XNIException
Start CDATA section.

Specified by:
startCDATA in interface XMLDocumentHandler
Overrides:
startCDATA in class DefaultFilter
Throws:
XNIException

endCDATA

public void endCDATA(Augmentations augs)
              throws XNIException
End CDATA section.

Specified by:
endCDATA in interface XMLDocumentHandler
Overrides:
endCDATA in class DefaultFilter
Throws:
XNIException

characters

public void characters(XMLString text,
                       Augmentations augs)
                throws XNIException
Characters.

Specified by:
characters in interface XMLDocumentHandler
Overrides:
characters in class DefaultFilter
Throws:
XNIException

endElement

public void endElement(QName element,
                       Augmentations augs)
                throws XNIException
End element.

Specified by:
endElement in interface XMLDocumentHandler
Overrides:
endElement in class DefaultFilter
Throws:
XNIException

handleStartDocument

protected void handleStartDocument()
Handle start document.


handleStartElement

protected void handleStartElement(QName element,
                                  XMLAttributes attrs)
Handle start element.


synthesizeBinding

protected void synthesizeBinding(XMLAttributes attrs,
                                 String ns)
Synthesize namespace binding.


synthesizedAugs

protected final Augmentations synthesizedAugs()
Returns an augmentations object with a synthesized item added.


purifyQName

protected QName purifyQName(QName qname)
Purify qualified name.


purifyName

protected String purifyName(String name,
                            boolean localpart)
Purify name.


purifyText

protected XMLString purifyText(XMLString text)
Purify content.


toHexString

protected static String toHexString(int c,
                                    int padlen)
Returns a padded hexadecimal string for the given value.



(C) Copyright 2002-2004, Andy Clark. All rights reserved.