org.apache.lucene.wikipedia.analysis
Class WikipediaTokenizer
java.lang.Object
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.wikipedia.analysis.WikipediaTokenizer
public class WikipediaTokenizer
- extends Tokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the
Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
EXPERIMENTAL !!!!!!!!!
NOTE: This Tokenizer is considered experimental and the grammar is subject to change in the trunk and in follow up releases.
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
INTERNAL_LINK
public static final String INTERNAL_LINK
- See Also:
- Constant Field Values
EXTERNAL_LINK
public static final String EXTERNAL_LINK
- See Also:
- Constant Field Values
EXTERNAL_LINK_URL
public static final String EXTERNAL_LINK_URL
- See Also:
- Constant Field Values
CITATION
public static final String CITATION
- See Also:
- Constant Field Values
CATEGORY
public static final String CATEGORY
- See Also:
- Constant Field Values
BOLD
public static final String BOLD
- See Also:
- Constant Field Values
ITALICS
public static final String ITALICS
- See Also:
- Constant Field Values
BOLD_ITALICS
public static final String BOLD_ITALICS
- See Also:
- Constant Field Values
HEADING
public static final String HEADING
- See Also:
- Constant Field Values
SUB_HEADING
public static final String SUB_HEADING
- See Also:
- Constant Field Values
ALPHANUM_ID
public static final int ALPHANUM_ID
- See Also:
- Constant Field Values
APOSTROPHE_ID
public static final int APOSTROPHE_ID
- See Also:
- Constant Field Values
ACRONYM_ID
public static final int ACRONYM_ID
- See Also:
- Constant Field Values
COMPANY_ID
public static final int COMPANY_ID
- See Also:
- Constant Field Values
EMAIL_ID
public static final int EMAIL_ID
- See Also:
- Constant Field Values
HOST_ID
public static final int HOST_ID
- See Also:
- Constant Field Values
NUM_ID
public static final int NUM_ID
- See Also:
- Constant Field Values
CJ_ID
public static final int CJ_ID
- See Also:
- Constant Field Values
INTERNAL_LINK_ID
public static final int INTERNAL_LINK_ID
- See Also:
- Constant Field Values
EXTERNAL_LINK_ID
public static final int EXTERNAL_LINK_ID
- See Also:
- Constant Field Values
CITATION_ID
public static final int CITATION_ID
- See Also:
- Constant Field Values
CATEGORY_ID
public static final int CATEGORY_ID
- See Also:
- Constant Field Values
BOLD_ID
public static final int BOLD_ID
- See Also:
- Constant Field Values
ITALICS_ID
public static final int ITALICS_ID
- See Also:
- Constant Field Values
BOLD_ITALICS_ID
public static final int BOLD_ITALICS_ID
- See Also:
- Constant Field Values
HEADING_ID
public static final int HEADING_ID
- See Also:
- Constant Field Values
SUB_HEADING_ID
public static final int SUB_HEADING_ID
- See Also:
- Constant Field Values
EXTERNAL_LINK_URL_ID
public static final int EXTERNAL_LINK_URL_ID
- See Also:
- Constant Field Values
TOKEN_TYPES
public static final String[] TOKEN_TYPES
- String token types that correspond to token type int constants
tokenImage
public static final String[] tokenImage
- Deprecated. Please use
TOKEN_TYPES
instead
TOKENS_ONLY
public static final int TOKENS_ONLY
- See Also:
- Constant Field Values
UNTOKENIZED_ONLY
public static final int UNTOKENIZED_ONLY
- See Also:
- Constant Field Values
BOTH
public static final int BOTH
- See Also:
- Constant Field Values
WikipediaTokenizer
public WikipediaTokenizer(Reader input)
- Creates a new instance of the
WikipediaTokenizer
. Attaches the
input
to a newly created JFlex scanner.
- Parameters:
input
- The Input Reader
next
public Token next(Token result)
throws IOException
- Overrides:
next
in class TokenStream
- Throws:
IOException
reset
public void reset()
throws IOException
- Overrides:
reset
in class TokenStream
- Throws:
IOException
reset
public void reset(Reader reader)
throws IOException
- Overrides:
reset
in class Tokenizer
- Throws:
IOException
Copyright © 2000-2008 Apache Software Foundation. All Rights Reserved.