Class JapaneseIterationMarkCharFilter

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, java.lang.Readable

    public class JapaneseIterationMarkCharFilter
    extends CharFilter
    Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.

    Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana.

    Note that a full stop punctuation character "。" (U+3002) can not be iterated (see below). Iteration marks themselves can be emitted in case they are illegal, i.e. if they go back past the beginning of the character stream.

    The implementation buffers input until a full stop punctuation character (U+3002) or EOF is reached in order to not keep a copy of the character stream in memory. Vertical iteration marks, which are even rarer than horizontal iteration marks in contemporary Japanese, are unsupported.

    • Field Detail

      • NORMALIZE_KANJI_DEFAULT

        public static final boolean NORMALIZE_KANJI_DEFAULT
        Normalize kanji iteration marks by default
        See Also:
        Constant Field Values
      • NORMALIZE_KANA_DEFAULT

        public static final boolean NORMALIZE_KANA_DEFAULT
        Normalize kana iteration marks by default
        See Also:
        Constant Field Values
      • HIRAGANA_ITERATION_MARK

        private static final char HIRAGANA_ITERATION_MARK
        See Also:
        Constant Field Values
      • HIRAGANA_VOICED_ITERATION_MARK

        private static final char HIRAGANA_VOICED_ITERATION_MARK
        See Also:
        Constant Field Values
      • KATAKANA_ITERATION_MARK

        private static final char KATAKANA_ITERATION_MARK
        See Also:
        Constant Field Values
      • KATAKANA_VOICED_ITERATION_MARK

        private static final char KATAKANA_VOICED_ITERATION_MARK
        See Also:
        Constant Field Values
      • h2d

        private static char[] h2d
      • k2d

        private static char[] k2d
      • bufferPosition

        private int bufferPosition
      • iterationMarksSpanSize

        private int iterationMarksSpanSize
      • iterationMarkSpanEndPosition

        private int iterationMarkSpanEndPosition
      • normalizeKanji

        private boolean normalizeKanji
      • normalizeKana

        private boolean normalizeKana
    • Constructor Detail

      • JapaneseIterationMarkCharFilter

        public JapaneseIterationMarkCharFilter​(java.io.Reader input)
        Constructor. Normalizes both kanji and kana iteration marks by default.
        Parameters:
        input - char stream
      • JapaneseIterationMarkCharFilter

        public JapaneseIterationMarkCharFilter​(java.io.Reader input,
                                               boolean normalizeKanji,
                                               boolean normalizeKana)
        Constructor
        Parameters:
        input - char stream
        normalizeKanji - indicates whether kanji iteration marks should be normalized
        normalizeKana - indicates whether kana iteration marks should be normalized
    • Method Detail

      • read

        public int read​(char[] buffer,
                        int offset,
                        int length)
                 throws java.io.IOException
        Specified by:
        read in class java.io.Reader
        Throws:
        java.io.IOException
      • read

        public int read()
                 throws java.io.IOException
        Overrides:
        read in class java.io.Reader
        Throws:
        java.io.IOException
      • normalizeIterationMark

        private char normalizeIterationMark​(char c)
                                     throws java.io.IOException
        Normalizes the iteration mark character c
        Parameters:
        c - iteration mark character to normalize
        Returns:
        normalized iteration mark
        Throws:
        java.io.IOException - If there is a low-level I/O error.
      • nextIterationMarkSpanSize

        private int nextIterationMarkSpanSize()
                                       throws java.io.IOException
        Finds the number of subsequent next iteration marks
        Returns:
        number of iteration marks starting at the current buffer position
        Throws:
        java.io.IOException - If there is a low-level I/O error.
      • sourceCharacter

        private char sourceCharacter​(int position,
                                     int spanSize)
                              throws java.io.IOException
        Returns the source character for a given position and iteration mark span size
        Parameters:
        position - buffer position (should not exceed bufferPosition)
        spanSize - iteration mark span size
        Returns:
        source character
        Throws:
        java.io.IOException - If there is a low-level I/O error.
      • normalize

        private char normalize​(char c,
                               char m)
        Normalize a character
        Parameters:
        c - character to normalize
        m - repetition mark referring to c
        Returns:
        normalized character - return c on illegal iteration marks
      • normalizedHiragana

        private char normalizedHiragana​(char c,
                                        char m)
        Normalize hiragana character
        Parameters:
        c - hiragana character
        m - repetition mark referring to c
        Returns:
        normalized character - return c on illegal iteration marks
      • normalizedKatakana

        private char normalizedKatakana​(char c,
                                        char m)
        Normalize katakana character
        Parameters:
        c - katakana character
        m - repetition mark referring to c
        Returns:
        normalized character - return c on illegal iteration marks
      • isIterationMark

        private boolean isIterationMark​(char c)
        Iteration mark character predicate
        Parameters:
        c - character to test
        Returns:
        true if c is an iteration mark character. Otherwise false.
      • isHiraganaIterationMark

        private boolean isHiraganaIterationMark​(char c)
        Hiragana iteration mark character predicate
        Parameters:
        c - character to test
        Returns:
        true if c is a hiragana iteration mark character. Otherwise false.
      • isKatakanaIterationMark

        private boolean isKatakanaIterationMark​(char c)
        Katakana iteration mark character predicate
        Parameters:
        c - character to test
        Returns:
        true if c is a katakana iteration mark character. Otherwise false.
      • isKanjiIterationMark

        private boolean isKanjiIterationMark​(char c)
        Kanji iteration mark character predicate
        Parameters:
        c - character to test
        Returns:
        true if c is a kanji iteration mark character. Otherwise false.
      • lookupHiraganaDakuten

        private char lookupHiraganaDakuten​(char c)
        Look up hiragana dakuten
        Parameters:
        c - character to look up
        Returns:
        hiragana dakuten variant of c or c itself if no dakuten variant exists
      • lookupKatakanaDakuten

        private char lookupKatakanaDakuten​(char c)
        Look up katakana dakuten. Only full-width katakana are supported.
        Parameters:
        c - character to look up
        Returns:
        katakana dakuten variant of c or c itself if no dakuten variant exists
      • isHiraganaDakuten

        private boolean isHiraganaDakuten​(char c)
        Hiragana dakuten predicate
        Parameters:
        c - character to check
        Returns:
        true if c is a hiragana dakuten and otherwise false
      • isKatakanaDakuten

        private boolean isKatakanaDakuten​(char c)
        Katakana dakuten predicate
        Parameters:
        c - character to check
        Returns:
        true if c is a hiragana dakuten and otherwise false
      • lookup

        private char lookup​(char c,
                            char[] map,
                            char offset)
        Looks up a character in dakuten map and returns the dakuten variant if it exists. Otherwise return the character being looked up itself
        Parameters:
        c - character to look up
        map - dakuten map
        offset - code point offset from c
        Returns:
        mapped character or c if no mapping exists
      • inside

        private boolean inside​(char c,
                               char[] map,
                               char offset)
        Predicate indicating if the lookup character is within dakuten map range
        Parameters:
        c - character to look up
        map - dakuten map
        offset - code point offset from c
        Returns:
        true if c is mapped by map and otherwise false
      • correct

        protected int correct​(int currentOff)
        Description copied from class: CharFilter
        Subclasses override to correct the current offset.
        Specified by:
        correct in class CharFilter
        Parameters:
        currentOff - current offset
        Returns:
        corrected offset