The application can set a variety of NekoHTML settings to more
precisely control the behavior of the parser. These settings
can be set directly on the HTMLConfiguration
class
or on the supplied parser classes by calling the
setFeature
and setProperty
methods.
For example:
Feature Id / Description | Default
|
---|
http://cyberneko.org/html/features/balance-tags
Specifies if the NekoHTML parser should attempt to balance
the tags in the parsed document. Balancing the tags fixes up many
common mistakes by adding missing parent elements, automatically
closing elements with optional end tags, and correcting unbalanced
inline element tags. In order to process HTML documents as XML, this
feature should not be turned off. This feature is
provided as a performance enhancement for applications that only
care about the appearance of specific elements, attributes, and/or
content regardless of the document's ill-formed structure.
| true
|
http://cyberneko.org/html/features/balance-tags/ignore-outside-content
Specifies if the NekoHTML parser should ignore content after the end
of the document root element. If this feature is set to true, all
elements and character content appearing outside of the document body
is consumed. If set to false, the end elements for the <body>
and <html> are ignored, allowing content appearing outside of
the document to be parsed and communicated to the application.
| false
|
http://cyberneko.org/html/features/balance-tags/document-fragment
Specifies if the tag balancer should operate as if a fragment
of HTML is being parsed. With this feature set, the tag balancer
will not attempt to insert a missing body elements around content
and markup. However, proper parents for elements contained within
the <body> element will still be inserted. This feature should
not be used when using the DOMParser
class. In order to parse a DOM DocumentFragment , use the
DOMFragmentParser class.
| false
|
http://apache.org/xml/features/scanner/notify-char-refs
Specifies whether character entity references (e.g.  ,  ,
etc) should be reported to the registered document handler. The name of
the entity reported will contain the leading pound sign and optional 'x'
character. For example, the name of the character entity reference
  will be reported as "#x20".
| false
|
http://apache.org/xml/features/scanner/notify-builtin-refs
Specifies whether the XML built-in entity references (e.g. &,
<, etc) should be reported to the registered document handler.
This only applies to the five pre-defined XML general entities --
specifically, "amp", "lt", "gt", "quot", and "apos". This is done for
compatibility with the Xerces feature.
To be notified of the built-in entity references in HTML, set the
http://cyberneko.org/html/features/scanner/notify-builtin-refs
feature to true .
| false
|
http://cyberneko.org/html/features/scanner/notify-builtin-refs
Specifies whether the HTML built-in entity references (e.g. &nobr;,
©, etc) should be reported to the registered document
handler. This includes the five pre-defined XML general
entities.
| false
|
http://cyberneko.org/html/features/scanner/ignore-specified-charset
Specifies whether to ignore the character encoding specified within the
<meta http-equiv='Content-Type' content='text/html;charset=...'>
tag. By default, NekoHTML checks this tag for a charset and changes the
character encoding of the scanning reader object. Setting this feature
to true allows the application to override this behavior.
| false
|
http://cyberneko.org/html/features/scanner/script/strip-comment-delims
Specifies whether the scanner should strip HTML comment delimiters
(i.e. "<!--" and "-->") from <script> element content.
| false
|
http://cyberneko.org/html/features/scanner/style/strip-comment-delims
Specifies whether the scanner should strip HTML comment delimiters
(i.e. "<!--" and "-->") from <style> element content.
| false
|
http://cyberneko.org/html/features/augmentations
Specifies whether infoset items that correspond to the
HTML events are included in the parsing pipeline. If
included, the augmented item will implement the
HTMLEventInfo interface found in the
org.cyberneko.html package. The augmentations
can be queried in XNI by calling the getItem
method with the key
"http://cyberneko.org/html/features/augmentations".
Currently, the HTML event info augmentation can report event
character boundaries and whether the event is synthesized.
| false
|
http://cyberneko.org/html/features/report-errors
Specifies whether errors should be reported to the registered error
handler. Since HTML applications are supposed to permit the
liberal use (and abuse) of HTML documents, errors should
normally be handled silently. However, if the application wants
to know about errors in the parsed HTML document, this feature
can be set to true .
| false
|
Property Id / Description | Values | Default
|
---|
http://cyberneko.org/html/properties/filters
This property allows applications to append custom document
processing components to the end of the default NekoHTML parser
pipeline. The value of this property must be an array of type
org.apache.xerces.xni.parser.XMLDocumentFilter
and no value of this array is allowed to be null. The document
filters are appended to the parser pipeline in array order.
Please refer to the filters
documentation for more information.
| XMLDocumentFilter[]
| null
|
http://cyberneko.org/html/properties/default-encoding
Sets the default encoding the NekoHTML scanner should use
when parsing documents. In the absence of an
http-equiv directive in the source document,
this setting is important because the parser does not
have any support to auto-detect the encoding.
|
IANA
encoding names
| "Windows-1252"
|
http://cyberneko.org/html/properties/names/elems
Specifies how the NekoHTML components should modify recognized
element names. Names can be converted to upper-case, converted
to lower-case, or left as-is. The value of "match" specifies
that element names are to be left as-is but the end tag name will
be modified to match the start tag name. This is required to
ensure that the parser generates a well-formed XML document.
| "upper" "lower" "match"
| "upper"
|
http://cyberneko.org/html/properties/names/attrs
Specifies how the NekoHTML components should modify attribute names
of recognized elements. Names can be converted to upper-case,
converted to lower-case, or left as-is.
| "upper" "lower" "no-change"
| "lower"
|
(C) Copyright 2002-2003, Andy Clark. All rights reserved.