Package org.apache.lucene.analysis.th
Class ThaiTokenizer
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.Tokenizer
-
- org.apache.lucene.analysis.util.SegmentingTokenizerBase
-
- org.apache.lucene.analysis.th.ThaiTokenizer
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public class ThaiTokenizer extends SegmentingTokenizerBase
Tokenizer that useBreakIterator
to tokenize Thai text.WARNING: this tokenizer may not be supported by all JREs. It is known to work with Sun/Oracle and Harmony JREs. If your application needs to be fully portable, consider using ICUTokenizer instead, which uses an ICU Thai BreakIterator that will always be available.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static boolean
DBBI_AVAILABLE
True if the JRE supports a working dictionary-based breakiterator for Thai.private OffsetAttribute
offsetAtt
private static java.text.BreakIterator
proto
(package private) int
sentenceEnd
private static java.text.BreakIterator
sentenceProto
used for breaking the text into sentences(package private) int
sentenceStart
private CharTermAttribute
termAtt
private java.text.BreakIterator
wordBreaker
private CharArrayIterator
wrapper
-
Fields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
buffer, BUFFERMAX, offset
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description ThaiTokenizer()
Creates a new ThaiTokenizerThaiTokenizer(AttributeFactory factory)
Creates a new ThaiTokenizer, supplying the AttributeFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected boolean
incrementWord()
Returns true if another word is availableprotected void
setNextSentence(int sentenceStart, int sentenceEnd)
Provides the next input sentence for analysis-
Methods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
end, incrementToken, isSafeEnd, reset
-
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
DBBI_AVAILABLE
public static final boolean DBBI_AVAILABLE
True if the JRE supports a working dictionary-based breakiterator for Thai. If this is false, this tokenizer will not work at all!
-
proto
private static final java.text.BreakIterator proto
-
sentenceProto
private static final java.text.BreakIterator sentenceProto
used for breaking the text into sentences
-
wordBreaker
private final java.text.BreakIterator wordBreaker
-
wrapper
private final CharArrayIterator wrapper
-
sentenceStart
int sentenceStart
-
sentenceEnd
int sentenceEnd
-
termAtt
private final CharTermAttribute termAtt
-
offsetAtt
private final OffsetAttribute offsetAtt
-
-
Constructor Detail
-
ThaiTokenizer
public ThaiTokenizer()
Creates a new ThaiTokenizer
-
ThaiTokenizer
public ThaiTokenizer(AttributeFactory factory)
Creates a new ThaiTokenizer, supplying the AttributeFactory
-
-
Method Detail
-
setNextSentence
protected void setNextSentence(int sentenceStart, int sentenceEnd)
Description copied from class:SegmentingTokenizerBase
Provides the next input sentence for analysis- Specified by:
setNextSentence
in classSegmentingTokenizerBase
-
incrementWord
protected boolean incrementWord()
Description copied from class:SegmentingTokenizerBase
Returns true if another word is available- Specified by:
incrementWord
in classSegmentingTokenizerBase
-
-