|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.w3c.tidy.Lexer
Lexer for html parser.
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections
Nested Class Summary | |
private static class |
Lexer.W3CVersionInfo
document type. |
Field Summary | |
protected short |
badAccess
for accessibility errors. |
protected short |
badChars
for bad char encodings. |
protected boolean |
badDoctype
set if html or PUBLIC is missing. |
protected short |
badForm
for mismatched/mispositioned form tags. |
protected short |
badLayout
for bad style errors. |
protected int |
columns
at start of current token. |
protected Configuration |
configuration
configuration. |
protected int |
doctype
version as given by doctype (if any). |
protected short |
errors
count of errors. |
protected java.io.PrintWriter |
errout
error output stream. |
protected boolean |
excludeBlocks
Netscape compatibility. |
protected boolean |
exiled
true if moved out of table. |
static short |
IGNORE_MARKUP
state: ignore markup. |
static short |
IGNORE_WHITESPACE
state: ignore whitespace. |
protected StreamIn |
in
file stream. |
protected Node |
inode
Inline stack for compatibility with Mosaic. |
protected int |
insert
for inferring inline tags. |
protected boolean |
insertspace
when space is moved after end tag. |
protected java.util.Stack |
istack
stack. |
protected int |
istackbase
start of frame. |
protected boolean |
isvoyager
true if xmlns attribute on html element. |
private static short |
LEX_ASP
getToken state: asp. |
private static short |
LEX_CDATA
getToken state: cdata. |
private static short |
LEX_COMMENT
getToken state: comment. |
private static short |
LEX_CONTENT
getToken state: content. |
private static short |
LEX_DOCTYPE
getToken state: doctype. |
private static short |
LEX_ENDTAG
getToken state: endtag. |
private static short |
LEX_GT
getToken state: gt. |
private static short |
LEX_JSTE
getToken state: jste. |
private static short |
LEX_PHP
getToken state: php. |
private static short |
LEX_PROCINSTR
getToken state: procinstr. |
private static short |
LEX_SECTION
getToken state: section. |
private static short |
LEX_STARTTAG
getToken state: start tag. |
private static short |
LEX_XMLDECL
getToken state: xml declaration. |
protected byte[] |
lexbuf
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of all of the elements. |
protected int |
lexlength
allocated. |
protected int |
lexsize
used. |
protected int |
lines
lines seen. |
static short |
MIXED_CONTENT
state: mixed content. |
private java.util.List |
nodeList
node list. |
static short |
PREFORMATTED
state: preformatted. |
protected boolean |
pushed
true after token has been pushed back. |
protected Report |
report
report. |
protected Node |
root
Root node is saved here. |
protected boolean |
seenEndBody
already seen end body tag? |
protected boolean |
seenEndHtml
already seen end html tag? |
protected short |
state
state of lexer's finite state machine. |
protected Style |
styles
used for cleaning up presentation markup. |
protected Node |
token
current node. |
protected int |
txtend
end of current node. |
protected int |
txtstart
start of current node. |
protected short |
versions
bit vector of HTML versions. |
private static java.lang.String |
VOYAGER_11
URI for XHTML 1.1. |
private static java.lang.String |
VOYAGER_FRAMESET
URI for XHTML 1.0 frameset DTD. |
private static java.lang.String |
VOYAGER_LOOSE
URI for XHTML 1.0 transitional DTD. |
private static java.lang.String |
VOYAGER_STRICT
URI for XHTML 1.0 strict DTD. |
private static Lexer.W3CVersionInfo[] |
W3CVERSION
lists all the known versions. |
protected short |
warnings
count of warnings in this document. |
protected boolean |
waswhite
used to collapse contiguous white space. |
private static java.lang.String |
XHTML_NAMESPACE
xhtml namespace. |
Constructor Summary | |
Lexer(StreamIn in,
Configuration configuration,
Report report)
Instantiates a new Lexer. |
Method Summary | |
void |
addByte(int c)
Adds a byte to lexer buffer. |
void |
addCharToLexer(int c)
Store char c as UTF-8 encoded byte stream. |
boolean |
addGenerator(Node root)
Add meta element for Tidy. |
void |
addStringLiteral(java.lang.String str)
calls addCharToLexer for any char in the string. |
(package private) void |
addStringLiteralLen(java.lang.String str,
int len)
calls addCharToLexer for any char in the string till len is reached. |
void |
addStringToLexer(java.lang.String str)
Adds a string to lexer buffer. |
short |
apparentVersion()
Return the html version used in document. |
boolean |
canPrune(Node element)
Can the given element be removed? |
void |
changeChar(byte c)
Substitute the last char in buffer. |
boolean |
checkDocTypeKeyWords(Node doctype)
Check system keywords (keywords should be uppercase). |
AttVal |
cloneAttributes(AttVal attrs)
Clones an attribute value and add eventual asp or php node to node list. |
Node |
cloneNode(Node node)
Clones a node and add it to node list. |
(package private) void |
constrainVersion(int vers)
Constraint the html version in the document to the given one. |
void |
deferDup()
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated. |
boolean |
endOfInput()
Has end of input stream been reached? |
short |
findGivenVersion(Node doctype)
Examine DOCTYPE to identify version. |
boolean |
fixDocType(Node root)
Fixup doctype if missing. |
void |
fixHTMLNameSpace(Node root,
java.lang.String profile)
Fix xhtml namespace. |
void |
fixId(Node node)
duplicate name attribute as an id and check if id and name match. |
boolean |
fixXmlDecl(Node root)
Ensure XML document starts with <?XML version="1.0"?> . |
Node |
getCDATA(Node container)
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some foo. |
Node |
getToken(short mode)
Gets a token. |
short |
htmlVersion()
Choose what version to use for new doctype. |
java.lang.String |
htmlVersionName()
Choose what version to use for new doctype. |
Node |
inferredTag(java.lang.String name)
Generates and inserts a new node. |
int |
inlineDup(Node node)
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P, TD, TH, DIV, PRE etc. |
Node |
insertedToken()
|
static boolean |
isCSS1Selector(java.lang.String buf)
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a numeric code (see next item). |
boolean |
isPushed(Node node)
Is the node in the stack? |
static boolean |
isValidAttrName(java.lang.String attr)
Check if attr is a valid name. |
Node |
newLineNode()
Adds a new line node. |
Node |
newNode()
Creates a new node and add it to nodelist. |
Node |
newNode(short type,
byte[] textarray,
int start,
int end)
Creates a new node and add it to nodelist. |
Node |
newNode(short type,
byte[] textarray,
int start,
int end,
java.lang.String element)
Creates a new node and add it to nodelist. |
(package private) Node |
newXhtmlDocTypeNode(Node root)
Put DOCTYPE declaration between the <:?xml version "1.0" ... |
Node |
parseAsp()
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to tailor the attribute value. |
java.lang.String |
parseAttribute(boolean[] isempty,
Node[] asp,
Node[] php)
consumes the '>' terminating start tags. |
AttVal |
parseAttrs(boolean[] isempty)
Parse tag attributes. |
void |
parseEntity(short mode)
Parse an html entity. |
Node |
parsePhp()
PHP is like ASP but is based upon XML processing instructions, e.g. |
int |
parseServerInstruction()
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this routine recognizes ' and " quoted strings. |
char |
parseTagName()
Parses a tag name. |
java.lang.String |
parseValue(java.lang.String name,
boolean foldCase,
boolean[] isempty,
int[] pdelim)
Parse an attribute value. |
void |
popInline(Node node)
Pop a copy of an inline node from the stack. |
protected boolean |
preContent(Node node)
Is content acceptable for pre elements? |
void |
pushInline(Node node)
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones generated from the istack) One issue arises with pushing inlines when the tag is already pushed. |
boolean |
setXHTMLDocType(Node root)
Adds a new xhtml doctype to the document. |
void |
ungetToken()
|
protected void |
updateNodeTextArrays(byte[] oldtextarray,
byte[] newtextarray)
Update oldtextarray in the current nodes. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final short IGNORE_WHITESPACE
public static final short MIXED_CONTENT
public static final short PREFORMATTED
public static final short IGNORE_MARKUP
private static final java.lang.String VOYAGER_LOOSE
private static final java.lang.String VOYAGER_STRICT
private static final java.lang.String VOYAGER_FRAMESET
private static final java.lang.String VOYAGER_11
private static final java.lang.String XHTML_NAMESPACE
private static final Lexer.W3CVersionInfo[] W3CVERSION
private static final short LEX_CONTENT
private static final short LEX_GT
private static final short LEX_ENDTAG
private static final short LEX_STARTTAG
private static final short LEX_COMMENT
private static final short LEX_DOCTYPE
private static final short LEX_PROCINSTR
private static final short LEX_CDATA
private static final short LEX_SECTION
private static final short LEX_ASP
private static final short LEX_JSTE
private static final short LEX_PHP
private static final short LEX_XMLDECL
protected StreamIn in
protected java.io.PrintWriter errout
protected short badAccess
protected short badLayout
protected short badChars
protected short badForm
protected short warnings
protected short errors
protected int lines
protected int columns
protected boolean waswhite
protected boolean pushed
protected boolean insertspace
protected boolean excludeBlocks
protected boolean exiled
protected boolean isvoyager
protected short versions
protected int doctype
protected boolean badDoctype
protected int txtstart
protected int txtend
protected short state
protected Node token
protected byte[] lexbuf
protected int lexlength
protected int lexsize
protected Node inode
protected int insert
protected java.util.Stack istack
protected int istackbase
protected Style styles
protected Configuration configuration
protected boolean seenEndBody
protected boolean seenEndHtml
protected Report report
protected Node root
private java.util.List nodeList
Constructor Detail |
public Lexer(StreamIn in, Configuration configuration, Report report)
in
- StreamInconfiguration
- configuation instancereport
- report instance, for reporting errorsMethod Detail |
public Node newNode()
public Node newNode(short type, byte[] textarray, int start, int end)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end position
public Node newNode(short type, byte[] textarray, int start, int end, java.lang.String element)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionelement
- tag name
public Node cloneNode(Node node)
node
- Node
public AttVal cloneAttributes(AttVal attrs)
attrs
- original AttVal
protected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
oldtextarray
in the current nodes.
oldtextarray
- previous text arraynewtextarray
- new text arraypublic Node newLineNode()
public boolean endOfInput()
true
if end of input stream been reachedpublic void addByte(int c)
c
- byte to addpublic void changeChar(byte c)
c
- new charpublic void addCharToLexer(int c)
c
- char to storepublic void addStringToLexer(java.lang.String str)
str
- String to addpublic void parseEntity(short mode)
mode
- modepublic char parseTagName()
public void addStringLiteral(java.lang.String str)
str
- input Stringvoid addStringLiteralLen(java.lang.String str, int len)
str
- input Stringlen
- length of the substring to be addedpublic short htmlVersion()
public java.lang.String htmlVersionName()
public boolean addGenerator(Node root)
root
- root node
true
if the tag has been addedpublic boolean checkDocTypeKeyWords(Node doctype)
doctype
- doctype node
public short findGivenVersion(Node doctype)
doctype
- doctype node
public void fixHTMLNameSpace(Node root, java.lang.String profile)
root
- root Nodeprofile
- current profileNode newXhtmlDocTypeNode(Node root)
html
tag. Should also work for any comments, etc. that may precede the html
tag.
root
- root node
public boolean setXHTMLDocType(Node root)
root
- root node
true
if a doctype has been addedpublic short apparentVersion()
public boolean fixDocType(Node root)
root
- root node
false
if current version has not been identifiedpublic boolean fixXmlDecl(Node root)
<?XML version="1.0"?>
. Add encoding attribute if not using
ASCII or UTF-8 output.
root
- root node
public Node inferredTag(java.lang.String name)
name
- tag name
public Node getCDATA(Node container)
container
- container node
public void ungetToken()
public Node getToken(short mode)
mode
- one of the following:
MixedContent
-- for elements which don't accept PCDATAPreformatted
-- white spacepreserved as isIgnoreMarkup
-- for CDATA elements such as script, stylepublic Node parseAsp()
href='<%=rsSchool.Fields("ID").Value%>'
where the ASP that generates the attribute value is
masked from Tidy by the quotemarks.
public Node parsePhp()
<?php ... ?>
.
public java.lang.String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
isempty
- flag is passed as array so it can be modifiedasp
- asp Node, passed as array so it can be modifiedphp
- php Node, passed as array so it can be modified
public int parseServerInstruction()
public java.lang.String parseValue(java.lang.String name, boolean foldCase, boolean[] isempty, int[] pdelim)
name
- attribute namefoldCase
- fold case?isempty
- is attribute empty? Passed as an array reference to allow modificationpdelim
- delimiter, passed as an array reference to allow modification
public static boolean isValidAttrName(java.lang.String attr)
attr
- String to check, must be non-null
true
if attr is a valid name.public static boolean isCSS1Selector(java.lang.String buf)
buf
- css selector name
true
if the given string is a valid css1 selector namepublic AttVal parseAttrs(boolean[] isempty)
isempty
- is tag empty?
public void pushInline(Node node)
<p><em> text <p><em> more text
Shouldn't be mapped to
<p><em> text </em></p><p><em><em> more text </em></em>
node
- Node to be pushedpublic void popInline(Node node)
node
- Node to be poppedpublic boolean isPushed(Node node)
node
- Node
true
is the node is found in the stackpublic int inlineDup(Node node)
<i><h1>italic heading</h1></i>
which is then treated as
equivalent to <h1><i>italic heading</i></h1>
This is implemented by setting the lexer
into a mode where it gets tokens from the inline stack rather than from the input stream.
node
- original node
public Node insertedToken()
public boolean canPrune(Node element)
element
- node
true
if he element can be removedpublic void fixId(Node node)
node
- Node to check for name/it attributespublic void deferDup()
void constrainVersion(int vers)
vers
- html version codeprotected boolean preContent(Node node)
node
- content
true
if node is acceptable in pre elements
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |