org.apache.lucene.ant

Class HtmlDocument


public class HtmlDocument
extends Object

The HtmlDocument class creates a Lucene Document from an HTML document.

It does this by using JTidy package. It can take input input from File or InputStream.

Author:
Erik Hatcher

Constructor Summary

HtmlDocument(File file)
Constructs an HtmlDocument from a File.
HtmlDocument(InputStream is)
Constructs an HtmlDocument from an InputStream.

Method Summary

static Document
Document(File file)
Creates a Lucene Document from a File.
String
getBody()
Gets the bodyText attribute of the HtmlDocument object.
static Document
getDocument(InputStream is)
Creates a Lucene Document from an InputStream.
String
getTitle()
Gets the title attribute of the HtmlDocument object.
static void
main(args[] )
Runs HtmlDocument on the files specified on the command line.

Constructor Details

HtmlDocument

public HtmlDocument(File file)
            throws IOException
Parameters:
file - the File containing the HTML to parse

HtmlDocument

public HtmlDocument(InputStream is)
Parameters:
is - the InputStream containing the HTML

Method Details

Document

public static Document Document(File file)
            throws IOException
Creates a Lucene Document from a File.
Parameters:
file -
Returns:

getBody

public String getBody()
Gets the bodyText attribute of the HtmlDocument object.
Returns:
the bodyText value

getDocument

public static Document getDocument(InputStream is)
Creates a Lucene Document from an InputStream.
Parameters:
is -
Returns:

getTitle

public String getTitle()
Gets the title attribute of the HtmlDocument object.
Returns:
the title value

main

public static void main(args[] )
            throws Exception
Runs HtmlDocument on the files specified on the command line.
Parameters:

Copyright © 2000-2006 Apache Software Foundation. All Rights Reserved.