Package org.apache.lucene.analysis.in
Class IndicNormalizer
- java.lang.Object
-
- org.apache.lucene.analysis.in.IndicNormalizer
-
public class IndicNormalizer extends java.lang.Object
Normalizes the Unicode representation of text in Indian languages.Follows guidelines from Unicode 5.2, chapter 6, South Asian Scripts I and graphical decompositions from http://ldc.upenn.edu/myl/IndianScriptsUnicode.html
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
IndicNormalizer.ScriptData
-
Field Summary
Fields Modifier and Type Field Description private static int[][]
decompositions
Decompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.html Most of these are not handled by unicode normalization anyway.private static java.util.IdentityHashMap<java.lang.Character.UnicodeBlock,IndicNormalizer.ScriptData>
scripts
-
Constructor Summary
Constructors Constructor Description IndicNormalizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private int
compose(int ch0, java.lang.Character.UnicodeBlock block0, IndicNormalizer.ScriptData sd, char[] text, int pos, int len)
Compose into standard form any compositions in the decompositions table.private static int
flag(java.lang.Character.UnicodeBlock ub)
int
normalize(char[] text, int len)
Normalizes input text, and returns the new length.
-
-
-
Field Detail
-
scripts
private static final java.util.IdentityHashMap<java.lang.Character.UnicodeBlock,IndicNormalizer.ScriptData> scripts
-
decompositions
private static final int[][] decompositions
Decompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.html Most of these are not handled by unicode normalization anyway. The numbers here represent offsets into the respective codepages, with -1 representing null and 0xFF representing zero-width joiner. the columns are: ch1, ch2, ch3, res, flags ch1, ch2, and ch3 are the decomposition res is the composition, and flags are the scripts to which it applies.
-
-
Method Detail
-
flag
private static int flag(java.lang.Character.UnicodeBlock ub)
-
normalize
public int normalize(char[] text, int len)
Normalizes input text, and returns the new length. The length will always be less than or equal to the existing length.- Parameters:
text
- input textlen
- valid length- Returns:
- normalized length
-
compose
private int compose(int ch0, java.lang.Character.UnicodeBlock block0, IndicNormalizer.ScriptData sd, char[] text, int pos, int len)
Compose into standard form any compositions in the decompositions table.
-
-