Class ScriptIterator


  • final class ScriptIterator
    extends java.lang.Object
    An iterator that locates ISO 15924 script boundaries in text.

    This is not the same as simply looking at the Unicode block, or even the Script property. Some characters are 'common' across multiple scripts, and some 'inherit' the script value of text surrounding them.

    This is similar to ICU (internal-only) UScriptRun, with the following differences:

    • Doesn't attempt to match paired punctuation. For tokenization purposes, this is not necessary. It's also quite expensive.
    • Non-spacing marks inherit the script of their base character, following recommendations from UTR #24.
    • Constructor Summary

      Constructors 
      Constructor Description
      ScriptIterator​(boolean combineCJ)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private int getScript​(int codepoint)
      fast version of UScript.getScript().
      (package private) int getScriptCode()
      Get the UScript script code for this script run
      (package private) int getScriptLimit()
      Get the index of the first character after the end of this script run
      (package private) int getScriptStart()
      Get the start of this script run
      private static boolean isSameScript​(int scriptOne, int scriptTwo)
      Determine if two scripts are compatible.
      (package private) boolean next()
      Iterates to the next script run, returning true if one exists.
      (package private) void setText​(char[] text, int start, int length)
      Set a new region of text to be examined by this iterator
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • text

        private char[] text
      • start

        private int start
      • limit

        private int limit
      • index

        private int index
      • scriptStart

        private int scriptStart
      • scriptLimit

        private int scriptLimit
      • scriptCode

        private int scriptCode
      • combineCJ

        private final boolean combineCJ
      • basicLatin

        private static final int[] basicLatin
        linear fast-path for basic latin case
    • Constructor Detail

      • ScriptIterator

        ScriptIterator​(boolean combineCJ)
        Parameters:
        combineCJ - if true: Han,Hiragana,Katakana will all return as UScript.JAPANESE
    • Method Detail

      • getScriptStart

        int getScriptStart()
        Get the start of this script run
        Returns:
        start position of script run
      • getScriptLimit

        int getScriptLimit()
        Get the index of the first character after the end of this script run
        Returns:
        position of the first character after this script run
      • getScriptCode

        int getScriptCode()
        Get the UScript script code for this script run
        Returns:
        code for the script of the current run
      • next

        boolean next()
        Iterates to the next script run, returning true if one exists.
        Returns:
        true if there is another script run, false otherwise.
      • isSameScript

        private static boolean isSameScript​(int scriptOne,
                                            int scriptTwo)
        Determine if two scripts are compatible.
      • setText

        void setText​(char[] text,
                     int start,
                     int length)
        Set a new region of text to be examined by this iterator
        Parameters:
        text - text buffer to examine
        start - offset into buffer
        length - maximum length to examine
      • getScript

        private int getScript​(int codepoint)
        fast version of UScript.getScript(). Basic Latin is an array lookup