org.apache.lucene.analysis.cjk

Class CJKTokenizer


public final class CJKTokenizer
extends Tokenizer

CJKTokenizer was modified from StopTokenizer which does a decent job for most European languages. It performs other token methods for double-byte Characters: the token will return at each two charactors with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4" it also need filter filter zero length token ""
for Digit: digit, '+', '#' will token as letter
for more info on Asia language(Chinese Japanese Korean) text segmentation: please search google
Author:
Che, Dong

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

input

Constructor Summary

CJKTokenizer(Reader in)
Construct a token stream processing the given input.

Method Summary

Token
next()
Returns the next token in the stream, or null at EOS.

Methods inherited from class org.apache.lucene.analysis.Tokenizer

close

Methods inherited from class org.apache.lucene.analysis.TokenStream

close, next

Constructor Details

CJKTokenizer

public CJKTokenizer(Reader in)
Construct a token stream processing the given input.
Parameters:
in - I/O reader

Method Details

next

public final Token next()
            throws IOException
Returns the next token in the stream, or null at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.
Overrides:
next in interface TokenStream
Returns:
Token

Copyright © 2000-2006 Apache Software Foundation. All Rights Reserved.