The scrape tag library can scrape or extract content from web documents and display the content in your JSP. For example, you could scrape stock quotes from other web sites and display them in your pages.
After your JSP scrapes a document for the first time, the results of the scrape are cached for subsequent JSP requests. These results are returned unless the JSP determines that the document must be rescraped. Rescraping is determined by the following logic:
This custom tag library requires a servlet container that supports the JavaServer Pages Specification, version 1.1 or higher. It also requires an up-to-date version of the jakarta-oro package.
Follow these steps to configure your web application with this tag library:
<taglib> <taglib-uri>http://jakarta.apache.org/taglibs/scrape-1.0</taglib-uri> <taglib-location>/WEB-INF/scrape.tld</taglib-location> </taglib>
To use the tags from this library in your JSP pages, add the following directive at the top of each page:
<%@ taglib uri="http://jakarta.apache.org/taglibs/scrape-1.0" prefix="scrp" %>
where "scrp" is the tag name prefix you wish to use for tags from this library. You can change this value to any prefix you like.
page | Specify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped. |
url | Specify the URL of the document that contains the content to be scraped. Use this tag as an alternate to the page tag's url attribute when the URL must be generated dynamically. |
header | Set an http header for the request. |
scrape | Specify the text anchors that mark the beginning and end of the content to be scraped. |
result | Retrieve the content from a scrape. |
page | Availability: 1.0 | ||||
Specify the URL of the document to be scraped and the minimum time that must pass before the document is rescraped. |
|||||
Tag Body | JSP | ||||
Restrictions |
None |
||||
Attributes | Name | Required | Runtime Expression Evaluation | Availability | |
url | No | No | 1.0 | ||
The fully qualified URL of the document that is to
be scraped, such as: |
|||||
time | No | No | 1.0 | ||
The length of time the JSP waits before attempting to rescrape the document. The value of time is specified in minutes. The minimum value is 10 minutes. Note that the minimum value is used if a time attribute is not specified. |
|||||
useProxy | No | No | 1.0 | ||
Tells the taglib to use a proxy for the connection. The name and port of the proxy server will be retreived from the system properties http.proxyHost and http.proxyPort. This attribute is not necessary if setting the name amd port with the proxyServer and proxyPort attributes. |
|||||
proxyServer | No | No | 1.0 | ||
The name of the proxy server to use. |
|||||
proxyPort | No | No | 1.0 | ||
The number of the port to use to connect to the proxy server. Defaults to 3128. |
|||||
proxyName | No | No | 1.0 | ||
The username for authentication to the proxy server. |
|||||
proxyPass | No | No | 1.0 | ||
The password for authentication to the proxy server. |
|||||
charset | No | No | 1.0 | ||
Charset used by the scraped page. This attribute is useful when the page being scrapped uses a different charset than the web server. |
|||||
Variables | None | ||||
Examples | Specify a document to be scraped with a rescrape time of 20 minutes. Note that a scrape tag must be nested within the body of the page tag. | ||||
|
|||||
Examples | Specify a document to be scraped with a connection that must be made through a proxy on a port other than the default 3128. Note that a scrape tag must be nested within the body of the page tag. | ||||
|
|||||
Examples | Specify a document to be scraped with a connection that must be made through a proxy. Use the java system defaults of http.proxyHost and http.proxyPort. Note that a scrape tag must be nested within the body of the page tag. | ||||
|
|||||
Examples | Specify a document to be scraped with a connection that must be made through a proxy on a port other than the default 3128. The proxy server requires authentication. Note that a scrape tag must be nested within the body of the page tag. | ||||
|
url | Availability: 1.0 | ||||
Specify the URL of the document that contains the content to be scraped. Use this tag as an alternate to the page tag's url attribute when the URL must be generated dynamically. |
|||||
Tag Body | JSP | ||||
Restrictions |
Must be nested within a page tag. |
||||
Attributes | None | ||||
Variables | None | ||||
Examples | Specify a document to be scraped Note that a url tag must be nested within the body of the page tag | ||||
|
header | Availability: 1.1 | ||||
Set an http header for the request. |
|||||
Tag Body | JSP | ||||
Restrictions |
Must be nested within a page tag |
||||
Attributes | Name | Required | Runtime Expression Evaluation | Availability | |
name | Yes | 1.1 | |||
The name of the http header to be sent in the http request. |
|||||
value | No | 1.1 | |||
The value of the http header to be sent in the http request. |
|||||
Variables | None | ||||
Examples | Specify that the http request for the scrape set the User-Agent and Referer headers. The User-Agent is set using the name and value attributes. The Referer header is set using the name attribute and the body of the header tag. Note that a header tag must be nested within the body of the page tag | ||||
|
scrape | Availability: 1.0 | ||||
Specify the text anchors that mark the beginning and end of the content to be scraped. |
|||||
Tag Body | JSP | ||||
Restrictions |
Must be nested within a page tag |
||||
Attributes | Name | Required | Runtime Expression Evaluation | Availability | |
id | Yes | No | 1.0 | ||
A unique identifier that distinguishes this scrape from all others. Each scrape is unique and accessible only by this id. |
|||||
begin | Yes | No | 1.0 | ||
The text anchor that marks the beginning of the content to be scraped from the document. |
|||||
end | Yes | No | 1.0 | ||
The text anchor that marks the end of the content to be scraped from the document. |
|||||
strip | No | No | 1.0 | ||
If strip is set to true, the output from the result tag is stripped of HTML, XML, DHTML, etc. tags. That is, nothing within < > will be included in the scrape result. The default value is false. Note that strip can be used in conjunction with the anchors attribute. |
|||||
anchors | No | No | 1.0 | ||
If anchors is set to true, the begin and end text anchors are included in the scrape result. The default value is false. Note that anchors can be used in conjunction with the strip attribute. |
|||||
Variables | Name | Scope | Availability | ||
id attribute value | Start of tag to end of page | 1.0 | |||
Name used to retrieve the scrape later in the page. |
|||||
Properties | None | ||||
Examples | Set a scrape on a page with anchors included. Note that the page tag is first and the scrape tag is nested. | ||||
|
|||||
Examples | Set a scrape on a page with results set to have no tags. | ||||
|
result | Availability: 1.0 | ||||
Retrieve the content from a scrape. |
|||||
Tag Body | Empty | ||||
Restrictions |
None |
||||
Attributes | Name | Required | Runtime Expression Evaluation | Availability | |
scrape | Yes | No | 1.0 | ||
The id of a previously preformed scrape who's results you would like to retreive. |
|||||
Variables | None | ||||
Examples | Get the results of a previously performed scrape. | ||||
|
See the example application scrape-examples.war for examples of the usage of the tags from this custom tag library.
Java programmers can view the java class documentation for this tag library as javadocs.
Review the complete revision history of this tag library.