org.htmlparser.beans
public class StringBean extends NodeVisitor implements Serializable
Text within <SCRIPT></SCRIPT> tags is removed.
The text within <PRE></PRE> tags is not altered.
The property Strings
, which is the output property is null
until a URL is set. So a typical usage is:
StringBean sb = new StringBean (); sb.setLinks (false); sb.setReplaceNonBreakingSpaces (true); sb.setCollapse (true); sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here String s = sb.getStrings ();You can also use the StringBean as a NodeVisitor on your own parser, in which case you have to refetch your page if you change one of the properties because it resets the Strings property:
StringBean sb = new StringBean (); Parser parser = new Parser ("http://cbc.ca"); parser.visitAllNodesWith (sb); String s = sb.getStrings (); sb.setLinks (true); parser.reset (); parser.visitAllNodesWith (sb); String sl = sb.getStrings ();According to Nick Burch, who contributed the patch, this is handy if you don't want StringBean to wander off and get the content itself, either because you already have it, it's not on a website etc.
Field Summary | |
---|---|
protected StringBuffer | mBuffer
The buffer text is stored in while traversing the HTML. |
protected boolean | mCollapse
If true sequences of whitespace characters are replaced
with a single space character. |
protected int | mCollapseState
The state of the collapse processiung state machine. |
protected boolean | mIsPre
Set true when traversing a PRE tag. |
protected boolean | mIsScript
Set true when traversing a SCRIPT tag. |
protected boolean | mIsStyle
Set true when traversing a STYLE tag. |
protected boolean | mLinks
If true the link URLs are embedded in the text output. |
protected Parser | mParser
The parser used to extract strings. |
protected PropertyChangeSupport | mPropertySupport
Bound property support. |
protected boolean | mReplaceSpace
If true regular space characters are substituted for
non-breaking spaces in the text output. |
protected String | mStrings
The strings extracted from the URL. |
static String | PROP_COLLAPSE_PROPERTY
Property name in event where the 'collapse whitespace' state changes. |
static String | PROP_CONNECTION_PROPERTY
Property name in event where the connection changes. |
static String | PROP_LINKS_PROPERTY
Property name in event where the 'embed links' state changes. |
static String | PROP_REPLACE_SPACE_PROPERTY
Property name in event where the 'replace non-breaking spaces'
state changes. |
static String | PROP_STRINGS_PROPERTY
Property name in event where the URL contents changes. |
static String | PROP_URL_PROPERTY
Property name in event where the URL changes. |
Constructor Summary | |
---|---|
StringBean()
Create a StringBean object.
|
Method Summary | |
---|---|
void | addPropertyChangeListener(PropertyChangeListener listener)
Add a PropertyChangeListener to the listener list.
|
protected void | carriageReturn()
Appends a newline to the buffer if there isn't one there already.
|
protected void | collapse(StringBuffer buffer, String string)
Add the given text collapsing whitespace.
|
protected String | extractStrings()
Extract the text from a page. |
boolean | getCollapse()
Get the current 'collapse whitespace' state.
|
URLConnection | getConnection()
Get the current connection. |
boolean | getLinks()
Get the current 'include links' state. |
boolean | getReplaceNonBreakingSpaces()
Get the current 'replace non breaking spaces' state. |
String | getStrings()
Return the textual contents of the URL.
|
String | getURL()
Get the current URL. |
static void | main(String[] args)
Unit test. |
void | removePropertyChangeListener(PropertyChangeListener listener)
Remove a PropertyChangeListener from the listener list.
|
void | setCollapse(boolean collapse)
Set the current 'collapse whitespace' state.
|
void | setConnection(URLConnection connection)
Set the parser's connection.
|
void | setLinks(boolean links)
Set the 'include links' state.
|
void | setReplaceNonBreakingSpaces(boolean replace)
Set the 'replace non breaking spaces' state.
|
protected void | setStrings()
Fetch the URL contents.
|
void | setURL(String url)
Set the URL to extract strings from.
|
protected void | updateStrings(String strings)
Assign the Strings property, firing the property change. |
void | visitEndTag(Tag tag)
Resets the state of the PRE and SCRIPT flags. |
void | visitStringNode(Text string)
Appends the text to the output. |
void | visitTag(Tag tag)
Appends a NEWLINE to the output if the tag breaks flow, and
possibly sets the state of the PRE and SCRIPT flags. |
true
sequences of whitespace characters are replaced
with a single space character.true
when traversing a PRE tag.true
when traversing a SCRIPT tag.true
when traversing a STYLE tag.true
the link URLs are embedded in the text output.true
regular space characters are substituted for
non-breaking spaces in the text output.Links
is set false
so text appears like a
browser would display it, albeit without the colour or underline clues
normally associated with a link.
ReplaceNonBreakingSpaces
is set true
, so
that printing the text works, but the extra information regarding these
formatting marks is available if you set it false.
Collapse
is set true
, so text appears
compact like a browser would display it.
Parameters: listener The PropertyChangeListener to be added.
state 0: whitepace was last emitted character state 1: in whitespace state 2: in word A whitespace character moves us to state 1 and any other character moves us to state 2, except that state 0 stays in state 0 until a non-whitespace and going from whitespace to word we emit a space before the character: input: whitespace other-character state\next 0 0 2 1 1 space then 2 2 1 2
Parameters: buffer The buffer to append to. string The string to append.
Returns: The textual contents of the page.
Throws: ParserException If a parse error occurs.
true
this emulates the operation of browsers
in interpretting text where user agents should collapse input white space sequences when producing output inter-word space. See HTML specification section 9.1 White space http://www.w3.org/TR/html4/struct/text.html#h-9.1.
Returns: true
if sequences of whitespace (space '\u0020',
tab '\u0009', form feed '\u000C', zero-width space '\u200B',
carriage-return '\r' and NEWLINE '\n') are to be replaced with a single
space.
Returns: The connection that the parser has or null
if it
hasn't been set or the parser hasn't been constructed yet.
Returns: true
if link text is included in the text extracted
from the URL, false
otherwise.
Returns: true
if non-breaking spaces (character '\u00a0',
numeric character reference   or character entity
reference ) are to be replaced with normal
spaces (character '\u0020').
Returns: The user visible (what would be seen in a browser) text.
Returns: The URL from which text has been extracted, or null
if this property has not been set yet.
Parameters: args Pass arg[0] as the URL to process.
Parameters: listener The PropertyChangeListener to be removed.
setCollapse (getCollapse ());
Parameters: collapse If true
, sequences of whitespace
will be reduced to a single space.
Parameters: connection New value of property Connection.
Parameters: links Use true
if link text is to be included in the
text extracted from the URL, false
otherwise.
Parameters: replace true
if non-breaking spaces
(character '\u00a0', numeric character reference  
or character entity reference ) are to be replaced with normal
spaces (character '\u0020').
Parameters: url The URL that text should be fetched from.
Strings
property, firing the property change.Parameters: strings The new value of the Strings
property.
Parameters: tag The end tag to process.
Parameters: string The text node.
Parameters: tag The tag to examine.
HTML Parser is an open source library released under LGPL. | |