org.htmlparser.beans

Class StringBean

public class StringBean extends NodeVisitor implements Serializable

Extract strings from a URL.

Text within <SCRIPT></SCRIPT> tags is removed.

The text within <PRE></PRE> tags is not altered.

The property Strings, which is the output property is null until a URL is set. So a typical usage is:

     StringBean sb = new StringBean ();
     sb.setLinks (false);
     sb.setReplaceNonBreakingSpaces (true);
     sb.setCollapse (true);
     sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here
     String s = sb.getStrings ();
 
You can also use the StringBean as a NodeVisitor on your own parser, in which case you have to refetch your page if you change one of the properties because it resets the Strings property:

     StringBean sb = new StringBean ();
     Parser parser = new Parser ("http://cbc.ca");
     parser.visitAllNodesWith (sb);
     String s = sb.getStrings ();
     sb.setLinks (true);
     parser.reset ();
     parser.visitAllNodesWith (sb);
     String sl = sb.getStrings ();
 
According to Nick Burch, who contributed the patch, this is handy if you don't want StringBean to wander off and get the content itself, either because you already have it, it's not on a website etc.
Field Summary
protected StringBuffermBuffer
The buffer text is stored in while traversing the HTML.
protected booleanmCollapse
If true sequences of whitespace characters are replaced with a single space character.
protected intmCollapseState
The state of the collapse processiung state machine.
protected booleanmIsPre
Set true when traversing a PRE tag.
protected booleanmIsScript
Set true when traversing a SCRIPT tag.
protected booleanmIsStyle
Set true when traversing a STYLE tag.
protected booleanmLinks
If true the link URLs are embedded in the text output.
protected ParsermParser
The parser used to extract strings.
protected PropertyChangeSupportmPropertySupport
Bound property support.
protected booleanmReplaceSpace
If true regular space characters are substituted for non-breaking spaces in the text output.
protected StringmStrings
The strings extracted from the URL.
static StringPROP_COLLAPSE_PROPERTY
Property name in event where the 'collapse whitespace' state changes.
static StringPROP_CONNECTION_PROPERTY
Property name in event where the connection changes.
static StringPROP_LINKS_PROPERTY
Property name in event where the 'embed links' state changes.
static StringPROP_REPLACE_SPACE_PROPERTY
Property name in event where the 'replace non-breaking spaces' state changes.
static StringPROP_STRINGS_PROPERTY
Property name in event where the URL contents changes.
static StringPROP_URL_PROPERTY
Property name in event where the URL changes.
Constructor Summary
StringBean()
Create a StringBean object.
Method Summary
voidaddPropertyChangeListener(PropertyChangeListener listener)
Add a PropertyChangeListener to the listener list.
protected voidcarriageReturn()
Appends a newline to the buffer if there isn't one there already.
protected voidcollapse(StringBuffer buffer, String string)
Add the given text collapsing whitespace.
protected StringextractStrings()
Extract the text from a page.
booleangetCollapse()
Get the current 'collapse whitespace' state.
URLConnectiongetConnection()
Get the current connection.
booleangetLinks()
Get the current 'include links' state.
booleangetReplaceNonBreakingSpaces()
Get the current 'replace non breaking spaces' state.
StringgetStrings()
Return the textual contents of the URL.
StringgetURL()
Get the current URL.
static voidmain(String[] args)
Unit test.
voidremovePropertyChangeListener(PropertyChangeListener listener)
Remove a PropertyChangeListener from the listener list.
voidsetCollapse(boolean collapse)
Set the current 'collapse whitespace' state.
voidsetConnection(URLConnection connection)
Set the parser's connection.
voidsetLinks(boolean links)
Set the 'include links' state.
voidsetReplaceNonBreakingSpaces(boolean replace)
Set the 'replace non breaking spaces' state.
protected voidsetStrings()
Fetch the URL contents.
voidsetURL(String url)
Set the URL to extract strings from.
protected voidupdateStrings(String strings)
Assign the Strings property, firing the property change.
voidvisitEndTag(Tag tag)
Resets the state of the PRE and SCRIPT flags.
voidvisitStringNode(Text string)
Appends the text to the output.
voidvisitTag(Tag tag)
Appends a NEWLINE to the output if the tag breaks flow, and possibly sets the state of the PRE and SCRIPT flags.

Field Detail

mBuffer

protected StringBuffer mBuffer
The buffer text is stored in while traversing the HTML.

mCollapse

protected boolean mCollapse
If true sequences of whitespace characters are replaced with a single space character.

mCollapseState

protected int mCollapseState
The state of the collapse processiung state machine.

mIsPre

protected boolean mIsPre
Set true when traversing a PRE tag.

mIsScript

protected boolean mIsScript
Set true when traversing a SCRIPT tag.

mIsStyle

protected boolean mIsStyle
Set true when traversing a STYLE tag.

mLinks

protected boolean mLinks
If true the link URLs are embedded in the text output.

mParser

protected Parser mParser
The parser used to extract strings.

mPropertySupport

protected PropertyChangeSupport mPropertySupport
Bound property support.

mReplaceSpace

protected boolean mReplaceSpace
If true regular space characters are substituted for non-breaking spaces in the text output.

mStrings

protected String mStrings
The strings extracted from the URL.

PROP_COLLAPSE_PROPERTY

public static final String PROP_COLLAPSE_PROPERTY
Property name in event where the 'collapse whitespace' state changes.

PROP_CONNECTION_PROPERTY

public static final String PROP_CONNECTION_PROPERTY
Property name in event where the connection changes.

PROP_LINKS_PROPERTY

public static final String PROP_LINKS_PROPERTY
Property name in event where the 'embed links' state changes.

PROP_REPLACE_SPACE_PROPERTY

public static final String PROP_REPLACE_SPACE_PROPERTY
Property name in event where the 'replace non-breaking spaces' state changes.

PROP_STRINGS_PROPERTY

public static final String PROP_STRINGS_PROPERTY
Property name in event where the URL contents changes.

PROP_URL_PROPERTY

public static final String PROP_URL_PROPERTY
Property name in event where the URL changes.

Constructor Detail

StringBean

public StringBean()
Create a StringBean object. Default property values are set to 'do the right thing':

Links is set false so text appears like a browser would display it, albeit without the colour or underline clues normally associated with a link.

ReplaceNonBreakingSpaces is set true, so that printing the text works, but the extra information regarding these formatting marks is available if you set it false.

Collapse is set true, so text appears compact like a browser would display it.

Method Detail

addPropertyChangeListener

public void addPropertyChangeListener(PropertyChangeListener listener)
Add a PropertyChangeListener to the listener list. The listener is registered for all properties.

Parameters: listener The PropertyChangeListener to be added.

carriageReturn

protected void carriageReturn()
Appends a newline to the buffer if there isn't one there already. Except if the buffer is empty.

collapse

protected void collapse(StringBuffer buffer, String string)
Add the given text collapsing whitespace. Use a little finite state machine:
 state 0: whitepace was last emitted character
 state 1: in whitespace
 state 2: in word
 A whitespace character moves us to state 1 and any other character
 moves us to state 2, except that state 0 stays in state 0 until
 a non-whitespace and going from whitespace to word we emit a space
 before the character:
    input:     whitespace   other-character
 state\next
    0               0             2
    1               1        space then 2
    2               1             2
 

Parameters: buffer The buffer to append to. string The string to append.

extractStrings

protected String extractStrings()
Extract the text from a page.

Returns: The textual contents of the page.

Throws: ParserException If a parse error occurs.

getCollapse

public boolean getCollapse()
Get the current 'collapse whitespace' state. If set to true this emulates the operation of browsers in interpretting text where user agents should collapse input white space sequences when producing output inter-word space. See HTML specification section 9.1 White space http://www.w3.org/TR/html4/struct/text.html#h-9.1.

Returns: true if sequences of whitespace (space '\u0020', tab '\u0009', form feed '\u000C', zero-width space '\u200B', carriage-return '\r' and NEWLINE '\n') are to be replaced with a single space.

getConnection

public URLConnection getConnection()
Get the current connection.

Returns: The connection that the parser has or null if it hasn't been set or the parser hasn't been constructed yet.

getLinks

public boolean getLinks()
Get the current 'include links' state.

Returns: true if link text is included in the text extracted from the URL, false otherwise.

getReplaceNonBreakingSpaces

public boolean getReplaceNonBreakingSpaces()
Get the current 'replace non breaking spaces' state.

Returns: true if non-breaking spaces (character '\u00a0', numeric character reference &#160; or character entity reference &nbsp;) are to be replaced with normal spaces (character '\u0020').

getStrings

public String getStrings()
Return the textual contents of the URL. This is the primary output of the bean.

Returns: The user visible (what would be seen in a browser) text.

getURL

public String getURL()
Get the current URL.

Returns: The URL from which text has been extracted, or null if this property has not been set yet.

main

public static void main(String[] args)
Unit test.

Parameters: args Pass arg[0] as the URL to process.

removePropertyChangeListener

public void removePropertyChangeListener(PropertyChangeListener listener)
Remove a PropertyChangeListener from the listener list. This removes a registered PropertyChangeListener.

Parameters: listener The PropertyChangeListener to be removed.

setCollapse

public void setCollapse(boolean collapse)
Set the current 'collapse whitespace' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive. The internal state of the collapse state machine can be reset with code like this: setCollapse (getCollapse ());

Parameters: collapse If true, sequences of whitespace will be reduced to a single space.

setConnection

public void setConnection(URLConnection connection)
Set the parser's connection. The text from the URL will be fetched, which may be expensive, so this property should be set last.

Parameters: connection New value of property Connection.

setLinks

public void setLinks(boolean links)
Set the 'include links' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive.

Parameters: links Use true if link text is to be included in the text extracted from the URL, false otherwise.

setReplaceNonBreakingSpaces

public void setReplaceNonBreakingSpaces(boolean replace)
Set the 'replace non breaking spaces' state. If the setting is changed after the URL has been set, the text from the URL will be reacquired, which is possibly expensive.

Parameters: replace true if non-breaking spaces (character '\u00a0', numeric character reference &#160; or character entity reference &nbsp;) are to be replaced with normal spaces (character '\u0020').

setStrings

protected void setStrings()
Fetch the URL contents. Only do work if there is a valid parser with it's URL set.

setURL

public void setURL(String url)
Set the URL to extract strings from. The text from the URL will be fetched, which may be expensive, so this property should be set last.

Parameters: url The URL that text should be fetched from.

updateStrings

protected void updateStrings(String strings)
Assign the Strings property, firing the property change.

Parameters: strings The new value of the Strings property.

visitEndTag

public void visitEndTag(Tag tag)
Resets the state of the PRE and SCRIPT flags.

Parameters: tag The end tag to process.

visitStringNode

public void visitStringNode(Text string)
Appends the text to the output.

Parameters: string The text node.

visitTag

public void visitTag(Tag tag)
Appends a NEWLINE to the output if the tag breaks flow, and possibly sets the state of the PRE and SCRIPT flags.

Parameters: tag The tag to examine.

HTML Parser is an open source library released under LGPL. SourceForge.net