org.htmlparser.parserapplications

Class SiteCapturer

public class SiteCapturer extends Object

Save a web site locally. Illustrative program to save a web site contents locally. It was created to demonstrate URL rewriting in it's simplest form. It uses customized tags in the NodeFactory to alter the URLs. This program has a number of limitations:
Field Summary
protected booleanmCaptureResources
If true, save resources locally too, otherwise, leave resource links pointing to original page.
protected HashSetmCopied
The set of resources already copied.
protected NodeFiltermFilter
The filter to apply to the nodes retrieved.
protected HashSetmFinished
The set of pages already captured.
protected ArrayListmImages
The list of resources to copy.
protected ArrayListmPages
The list of pages to capture.
protected ParsermParser
The parser to use for processing.
protected StringmSource
The web site to capture.
protected StringmTarget
The local directory to capture to.
protected intTRANSFER_SIZE
Copy buffer size.
Constructor Summary
SiteCapturer()
Create a web site capturer.
Method Summary
voidcapture()
Perform the capture.
protected voidcopy()
Copy a resource (image) locally.
protected Stringdecode(String raw)
Unescape a URL to form a file name.
booleangetCaptureResources()
Getter for property captureResources.
NodeFiltergetFilter()
Getter for property filter.
StringgetSource()
Getter for property source.
StringgetTarget()
Getter for property target.
protected booleanisHtml(String link)
Returns true if the link contains text/html content.
protected booleanisToBeCaptured(String link)
Returns true if the link is one we are interested in.
static voidmain(String[] args)
Mainline to capture a web site locally.
protected StringmakeLocalLink(String link, String current)
Converts a link to local.
protected voidprocess(NodeFilter filter)
Process a single page.
voidsetCaptureResources(boolean capture)
Setter for property captureResources.
voidsetFilter(NodeFilter filter)
Setter for property filter.
voidsetSource(String source)
Setter for property source.
voidsetTarget(String target)
Setter for property target.

Field Detail

mCaptureResources

protected boolean mCaptureResources
If true, save resources locally too, otherwise, leave resource links pointing to original page.

mCopied

protected HashSet mCopied
The set of resources already copied. Used to avoid repeated acquisition of the same images and other resources.

mFilter

protected NodeFilter mFilter
The filter to apply to the nodes retrieved.

mFinished

protected HashSet mFinished
The set of pages already captured. Used to avoid repeated acquisition of the same page.

mImages

protected ArrayList mImages
The list of resources to copy. Images and other resources are added to this list as they are discovered.

mPages

protected ArrayList mPages
The list of pages to capture. Links are added to this list as they are discovered, and removed in sequential order (FIFO queue) leading to a breadth first traversal of the web site space.

mParser

protected Parser mParser
The parser to use for processing.

mSource

protected String mSource
The web site to capture. This is used as the base URL in deciding whether to adjust a link and whether to capture a page or not.

mTarget

protected String mTarget
The local directory to capture to. This is used as a base prefix for files saved locally.

TRANSFER_SIZE

protected final int TRANSFER_SIZE
Copy buffer size. Resources are moved to disk in chunks this size or less.

Constructor Detail

SiteCapturer

public SiteCapturer()
Create a web site capturer.

Method Detail

capture

public void capture()
Perform the capture.

copy

protected void copy()
Copy a resource (image) locally. Removes one element from the 'to be copied' list and saves the resource it points to locally as a file.

decode

protected String decode(String raw)
Unescape a URL to form a file name. Very crude.

Parameters: raw The escaped URI.

Returns: The native URI.

getCaptureResources

public boolean getCaptureResources()
Getter for property captureResources. If true, the images and other resources referenced by the site and within the base URL tree are also copied locally to the target directory. If false, the image links are left 'as is', still refering to the original site.

Returns: Value of property captureResources.

getFilter

public NodeFilter getFilter()
Getter for property filter.

Returns: Value of property filter.

getSource

public String getSource()
Getter for property source.

Returns: Value of property source.

getTarget

public String getTarget()
Getter for property target.

Returns: Value of property target.

isHtml

protected boolean isHtml(String link)
Returns true if the link contains text/html content.

Parameters: link The URL to check for content type.

Returns: true if the HTTP header indicates the type is "text/html".

Throws: ParserException If the supplied URL can't be read from.

isToBeCaptured

protected boolean isToBeCaptured(String link)
Returns true if the link is one we are interested in.

Parameters: link The link to be checked.

Returns: true if the link has the source URL as a prefix and doesn't contain '?' or '#'; the former because we won't be able to handle server side queries in the static target directory structure and the latter because presumably the full page with that reference has already been captured previously. This performs a case insensitive comparison, which is cheating really, but it's cheap.

main

public static void main(String[] args)
Mainline to capture a web site locally.

Parameters: args The command line arguments. There are three arguments the web site to capture, the local directory to save it to, and a flag (true or false) to indicate whether resources such as images and video are to be captured as well. These are requested via dialog boxes if not supplied.

Throws: MalformedURLException If the supplied URL is invalid. IOException If an error occurs reading the page or resources.

makeLocalLink

protected String makeLocalLink(String link, String current)
Converts a link to local. A relative link can be used to construct both a URL and a file name. Basically, the operation is to strip off the base url, if any, and then prepend as many dot-dots as necessary to make it relative to the current page. A bit of a kludge handles the root page specially by calling it index.html, even though that probably isn't it's real file name. This isn't pretty, but it works for me.

Parameters: link The link to make relative. current The current page URL, or empty if it's an absolute URL that needs to be converted.

Returns: The URL relative to the current page.

process

protected void process(NodeFilter filter)
Process a single page.

Parameters: filter The filter to apply to the collected nodes.

Throws: ParserException If a parse error occurs.

setCaptureResources

public void setCaptureResources(boolean capture)
Setter for property captureResources.

Parameters: capture New value of property captureResources.

setFilter

public void setFilter(NodeFilter filter)
Setter for property filter.

Parameters: filter New value of property filter.

setSource

public void setSource(String source)
Setter for property source. This is the base URL to capture. URL's that don't start with this prefix are ignored (left as is), while the ones with this URL as a base are re-homed to the local target.

Parameters: source New value of property source.

setTarget

public void setTarget(String target)
Setter for property target. This is the local directory under which to save the site's pages.

Parameters: target New value of property target.

HTML Parser is an open source library released under LGPL. SourceForge.net