org.apache.lucene.analysis

Class CharTokenizer

Known Direct Subclasses:
LetterTokenizer, RussianLetterTokenizer, WhitespaceTokenizer

public abstract class CharTokenizer
extends Tokenizer

An abstract base class for simple, character-oriented tokenizers.

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

input

Constructor Summary

CharTokenizer(Reader input)

Method Summary

protected abstract boolean
isTokenChar(char c)
Returns true iff a character should be included in a token.
Token
next()
Returns the next token in the stream, or null at EOS.
protected char
normalize(char c)
Called on each token character to normalize it before it is added to the token.

Methods inherited from class org.apache.lucene.analysis.Tokenizer

close

Methods inherited from class org.apache.lucene.analysis.TokenStream

close, next

Constructor Details

CharTokenizer

public CharTokenizer(Reader input)

Method Details

isTokenChar

protected abstract boolean isTokenChar(char c)
Returns true iff a character should be included in a token. This tokenizer generates as tokens adjacent sequences of characters which satisfy this predicate. Characters for which this is false are used to define token boundaries and are not included in tokens.

next

public final Token next()
            throws IOException
Returns the next token in the stream, or null at EOS.
Overrides:
next in interface TokenStream

normalize

protected char normalize(char c)
Called on each token character to normalize it before it is added to the token. The default implementation does nothing. Subclasses may use this to, e.g., lowercase tokens.

Copyright © 2000-2006 Apache Software Foundation. All Rights Reserved.