com.ibm.icu.text
Class DictionaryBasedBreakIterator

java.lang.Object
  extended bycom.ibm.icu.text.BreakIterator
      extended bycom.ibm.icu.text.RuleBasedBreakIterator
          extended bycom.ibm.icu.text.RuleBasedBreakIterator_Old
              extended bycom.ibm.icu.text.DictionaryBasedBreakIterator
All Implemented Interfaces:
java.lang.Cloneable

public class DictionaryBasedBreakIterator
extends RuleBasedBreakIterator_Old

A subclass of RuleBasedBreakIterator_Old that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old, but adds one more special substitution name: _dictionary_. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in _dictionary_, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It uses Class.getResource() to locate the dictionary file. The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

Status:
Stable ICU 2.0.

Nested Class Summary
protected  class DictionaryBasedBreakIterator.Builder
          The Builder class for DictionaryBasedBreakIterator inherits almost all of its functionality from the Builder class for RuleBasedBreakIterator_Old, but extends it with extra logic to handle the DICTIONARY_VAR token
 
Field Summary
 
Fields inherited from class com.ibm.icu.text.RuleBasedBreakIterator_Old
IGNORE
 
Fields inherited from class com.ibm.icu.text.RuleBasedBreakIterator
WORD_IDEO, WORD_IDEO_LIMIT, WORD_KANA, WORD_KANA_LIMIT, WORD_LETTER, WORD_LETTER_LIMIT, WORD_NONE, WORD_NONE_LIMIT, WORD_NUMBER, WORD_NUMBER_LIMIT
 
Fields inherited from class com.ibm.icu.text.BreakIterator
DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD
 
Constructor Summary
DictionaryBasedBreakIterator(java.lang.String description, java.io.InputStream dictionaryStream)
          Constructs a DictionaryBasedBreakIterator.
 
Method Summary
 int first()
          Sets the current iteration position to the beginning of the text.
 int following(int offset)
          Sets the current iteration position to the first boundary position after the specified position.
protected  int handleNext()
          This is the implementation function for next().
 int last()
          Sets the current iteration position to the end of the text.
protected  int lookupCategory(char c)
          Looks up a character category for a character.
protected  RuleBasedBreakIterator_Old.Builder makeBuilder()
          Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
 int preceding(int offset)
          Sets the current iteration position to the last boundary position before the specified position.
 int previous()
          Advances the iterator one step backwards.
 void setText(java.text.CharacterIterator newText)
          Set the iterator to analyze a new piece of text.
 void writeTablesToFile(java.io.FileOutputStream file, boolean littleEndian)
          Write the RBBI runtime engine state transition tables to a file.
 
Methods inherited from class com.ibm.icu.text.RuleBasedBreakIterator_Old
checkOffset, clone, current, debugDumpTables, debugPrintln, equals, getRuleStatus, getRuleStatusVec, getText, handlePrevious, hashCode, isBoundary, lookupBackwardState, lookupState, next, next, toString, writeSwappedInt, writeSwappedShort
 
Methods inherited from class com.ibm.icu.text.RuleBasedBreakIterator
getInstanceFromCompiledRules
 
Methods inherited from class com.ibm.icu.text.BreakIterator
getAvailableLocales, getAvailableULocales, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, registerInstance, registerInstance, setText, unregister
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(java.lang.String description,
                                    java.io.InputStream dictionaryStream)
                             throws java.io.IOException
Constructs a DictionaryBasedBreakIterator.

Parameters:
description - Same as the description parameter on RuleBasedBreakIterator_Old, except for the special meaning of DICTIONARY_VAR. This parameter is just passed through to RuleBasedBreakIterator_Old's constructor.
dictionaryStream - the stream containing the dictionary data
Status:
Stable ICU 2.0.
Method Detail

makeBuilder

protected RuleBasedBreakIterator_Old.Builder makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator. This is the same as RuleBasedBreakIterator_Old.Builder, except for the extra code to handle the DICTIONARY_VAR tag.

Overrides:
makeBuilder in class RuleBasedBreakIterator_Old
Status:
Internal. This API is ICU internal only.

writeTablesToFile

public void writeTablesToFile(java.io.FileOutputStream file,
                              boolean littleEndian)
                       throws java.io.IOException
Description copied from class: RuleBasedBreakIterator_Old
Write the RBBI runtime engine state transition tables to a file. Formerly used to export the tables to the C++ RBBI Implementation. Now obsolete, as C++ builds its own tables.

Overrides:
writeTablesToFile in class RuleBasedBreakIterator_Old
Throws:
java.io.IOException
Status:
Internal. This API is ICU internal only.

setText

public void setText(java.text.CharacterIterator newText)
Description copied from class: RuleBasedBreakIterator_Old
Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.

Overrides:
setText in class RuleBasedBreakIterator_Old
Parameters:
newText - An iterator over the text to analyze.
Status:
Stable ICU 2.0.

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

Overrides:
first in class RuleBasedBreakIterator_Old
Returns:
The offset of the beginning of the text.
Status:
Stable ICU 2.0.

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

Overrides:
last in class RuleBasedBreakIterator_Old
Returns:
The text's past-the-end offset.
Status:
Stable ICU 2.0.

previous

public int previous()
Advances the iterator one step backwards.

Overrides:
previous in class RuleBasedBreakIterator_Old
Returns:
The position of the last boundary position before the current iteration position
Status:
Stable ICU 2.0.

preceding

public int preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.

Overrides:
preceding in class RuleBasedBreakIterator_Old
Parameters:
offset - The position to begin searching from
Returns:
The position of the last boundary before "offset"
Status:
Stable ICU 2.0.

following

public int following(int offset)
Sets the current iteration position to the first boundary position after the specified position.

Overrides:
following in class RuleBasedBreakIterator_Old
Parameters:
offset - The position to begin searching forward from
Returns:
The position of the first boundary after "offset"
Status:
Stable ICU 2.0.

handleNext

protected int handleNext()
This is the implementation function for next().

Overrides:
handleNext in class RuleBasedBreakIterator_Old
Status:
Internal. This API is ICU internal only.

lookupCategory

protected int lookupCategory(char c)
Looks up a character category for a character.

Overrides:
lookupCategory in class RuleBasedBreakIterator_Old
Status:
Internal. This API is ICU internal only.


Copyright (c) 2006 IBM Corporation and others.