#include <normlzr.h>
Inheritance diagram for Normalizer::
Public Types | |
enum | { DONE = 0xffff } |
If DONE is returned from an iteration function that returns a code point, then there are no more normalization results available. More... | |
Public Methods | |
Normalizer (const UnicodeString &str, UNormalizationMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
Normalizer (const UChar *str, int32_t length, UNormalizationMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
Normalizer (const CharacterIterator &iter, UNormalizationMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of the given text. More... | |
Normalizer (const Normalizer ©) | |
Copy constructor. More... | |
~Normalizer () | |
Destructor. More... | |
UChar32 | current (void) |
Return the current character in the normalized text. More... | |
UChar32 | first (void) |
Return the first character in the normalized text. More... | |
UChar32 | last (void) |
Return the last character in the normalized text. More... | |
UChar32 | next (void) |
Return the next character in the normalized text. More... | |
UChar32 | previous (void) |
Return the previous character in the normalized text and decrement. More... | |
void | setIndexOnly (int32_t index) |
Set the iteration position in the input text that is being normalized, without any immediate normalization. More... | |
void | reset (void) |
Reset the index to the beginning of the text. More... | |
int32_t | getIndex (void) const |
Retrieve the current iteration position in the input text that is being normalized. More... | |
int32_t | startIndex (void) const |
Retrieve the index of the start of the input text. More... | |
int32_t | endIndex (void) const |
Retrieve the index of the end of the input text. More... | |
UBool | operator== (const Normalizer &that) const |
Returns TRUE when both iterators refer to the same character in the same input text. More... | |
UBool | operator!= (const Normalizer &that) const |
Returns FALSE when both iterators refer to the same character in the same input text. More... | |
Normalizer * | clone (void) const |
Returns a pointer to a new Normalizer that is a clone of this one. More... | |
int32_t | hashCode (void) const |
Generates a hash code for this iterator. More... | |
void | setMode (UNormalizationMode newMode) |
Set the normalization mode for this object. More... | |
UNormalizationMode | getUMode (void) const |
Return the normalization mode for this object. More... | |
void | setOption (int32_t option, UBool value) |
Set options that affect this Normalizer 's operation. More... | |
UBool | getOption (int32_t option) const |
Determine whether an option is turned on or off. More... | |
void | setText (const UnicodeString &newText, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | setText (const CharacterIterator &newText, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | setText (const UChar *newText, int32_t length, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | getText (UnicodeString &result) |
Copies the input text into the UnicodeString argument. More... | |
virtual UClassID | getDynamicClassID () const |
ICU "poor man's RTTI", returns a UClassID for the actual class. More... | |
Static Public Methods | |
void | normalize (const UnicodeString &source, UNormalizationMode mode, int32_t options, UnicodeString &result, UErrorCode &status) |
Normalizes a UnicodeString according to the specified normalization mode. More... | |
void | compose (const UnicodeString &source, UBool compat, int32_t options, UnicodeString &result, UErrorCode &status) |
Compose a UnicodeString . More... | |
void | decompose (const UnicodeString &source, UBool compat, int32_t options, UnicodeString &result, UErrorCode &status) |
Static method to decompose a UnicodeString . More... | |
UNormalizationCheckResult | quickCheck (const UnicodeString &source, UNormalizationMode mode, UErrorCode &status) |
Performing quick check on a string, to quickly determine if the string is in a particular normalization format. More... | |
UNormalizationCheckResult | quickCheck (const UnicodeString &source, UNormalizationMode mode, int32_t options, UErrorCode &status) |
Performing quick check on a string; same as the other version of quickCheck but takes an extra options parameter like most normalization functions. More... | |
UBool | isNormalized (const UnicodeString &src, UNormalizationMode mode, UErrorCode &errorCode) |
Test if a string is in a given normalization form. More... | |
UBool | isNormalized (const UnicodeString &src, UNormalizationMode mode, int32_t options, UErrorCode &errorCode) |
Test if a string is in a given normalization form; same as the other version of isNormalized but takes an extra options parameter like most normalization functions. More... | |
UnicodeString & | concatenate (UnicodeString &left, UnicodeString &right, UnicodeString &result, UNormalizationMode mode, int32_t options, UErrorCode &errorCode) |
Concatenate normalized strings, making sure that the result is normalized as well. More... | |
int32_t | compare (const UnicodeString &s1, const UnicodeString &s2, uint32_t options, UErrorCode &errorCode) |
Compare two strings for canonical equivalence. More... | |
UClassID | getStaticClassID () |
ICU "poor man's RTTI", returns a UClassID for this class. More... | |
Private Methods | |
Normalizer () | |
Normalizer & | operator= (const Normalizer &that) |
UBool | nextNormalize () |
UBool | previousNormalize () |
void | init (CharacterIterator *iter) |
void | clearBuffer (void) |
Private Attributes | |
UNormalizationMode | fUMode |
int32_t | fOptions |
UCharIterator * | text |
int32_t | currentIndex |
int32_t | nextIndex |
UnicodeString | buffer |
int32_t | bufferPos |
The Normalizer class consists of two parts:
The static functions are basically wrappers around the C implementation, using UnicodeString instead of UChar*. For basic information about normalization forms and details about the C API please see the documentation in unorm.h.
The iterator API with the Normalizer constructors and the non-static functions uses a CharacterIterator as input. It is possible to pass a string which is then internally wrapped in a CharacterIterator. The input text is not normalized all at once, but incrementally where needed (providing efficient random access). This allows to pass in a large text but spend only a small amount of time normalizing a small part of that text. However, if the entire text is normalized, then the iterator will be slower than normalizing the entire text at once and iterating over the result. A possible use of the Normalizer iterator is also to report an index into the original text that is close to where the normalized characters come from.
Important: The iterator API was cleaned up significantly for ICU 2.0. The earlier implementation reported the getIndex() inconsistently, and previous() could not be used after setIndex(), next(), first(), and current().
Normalizer allows to start normalizing from anywhere in the input text by calling setIndexOnly(), first(), or last(). Without calling any of these, the iterator will start at the beginning of the text.
At any time, next() returns the next normalized code point (UChar32), with post-increment semantics (like CharacterIterator::next32PostInc()). previous() returns the previous normalized code point (UChar32), with pre-decrement semantics (like CharacterIterator::previous32()).
current() returns the current code point (respectively the one at the newly set index) without moving the getIndex(). Note that if the text at the current position needs to be normalized, then these functions will do that. (This is why current() is not const.) It is more efficient to call setIndexOnly() instead, which does not normalize.
getIndex() always refers to the position in the input text where the normalized code points are returned from. It does not always change with each returned code point. The code point that is returned from any of the functions corresponds to text at or after getIndex(), according to the function's iteration semantics (post-increment or pre-decrement).
next() returns a code point from at or after the getIndex() from before the next() call. After the next() call, the getIndex() might have moved to where the next code point will be returned from (from a next() or current() call). This is semantically equivalent to array access with array[index++] (post-increment semantics).
previous() returns a code point from at or after the getIndex() from after the previous() call. This is semantically equivalent to array access with array[--index] (pre-decrement semantics).
Internally, the Normalizer iterator normalizes a small piece of text starting at the getIndex() and ending at a following "safe" index. The normalized results is stored in an internal string buffer, and the code points are iterated from there. With multiple iteration calls, this is repeated until the next piece of text needs to be normalized, and the getIndex() needs to be moved.
The following "safe" index, the internal buffer, and the secondary iteration index into that buffer are not exposed on the API. This also means that it is currently not practical to return to a particular, arbitrary position in the text because one would need to know, and be able to set, in addition to the getIndex(), at least also the current index into the internal buffer. It is currently only possible to observe when getIndex() changes (with careful consideration of the iteration semantics), at which time the internal index will be 0. For example, if getIndex() is different after next() than before it, then the internal index is 0 and one can return to this getIndex() later with setIndexOnly().
Definition at line 115 of file normlzr.h.
|
If DONE is returned from an iteration function that returns a code point, then there are no more normalization results available.
|
|
Creates a new
|
|
Creates a new
|
|
Creates a new
|
|
Copy constructor.
|
|
Destructor.
|
|
|
|
|
|
Returns a pointer to a new Normalizer that is a clone of this one. The caller is responsible for deleting the new clone.
|
|
Compare two strings for canonical equivalence. Further options include case-insensitive comparison and code point order (as opposed to code unit order). Canonical equivalence between two strings is defined as their normalized forms (NFD or NFC) being identical. This function compares strings incrementally instead of normalizing (and optionally case-folding) both strings entirely, improving performance significantly. Bulk normalization is only necessary if the strings do not fulfill the FCD conditions. Only in this case, and only if the strings are relatively long, is memory allocated temporarily. For FCD strings and short non-FCD strings there is no memory allocation. Semantically, this is equivalent to strcmp[CodePointOrder](NFD(foldCase(s1)), NFD(foldCase(s2))) where code point order and foldCase are all optional. UAX 21 2.5 Caseless Matching specifies that for a canonical caseless match the case folding must be performed first, then the normalization.
|
|
Compose a This is equivalent to normalize() with mode UNORM_NFC or UNORM_NFKC. This is a wrapper for unorm_normalize(), using UnicodeString's.
The
|
|
Concatenate normalized strings, making sure that the result is normalized as well. If both the left and the right strings are in the normalization form according to "mode/options", then the result will be
dest=normalize(left+right, mode, options) For details see unorm_concatenate in unorm.h.
|
|
Return the current character in the normalized text. current() may need to normalize some text at getIndex(). The getIndex() is not changed.
|
|
Static method to decompose a This is equivalent to normalize() with mode UNORM_NFD or UNORM_NFKD. This is a wrapper for unorm_normalize(), using UnicodeString's.
The
|
|
Retrieve the index of the end of the input text.
This is the end index of the
|
|
Return the first character in the normalized text. This is equivalent to setIndexOnly(startIndex()) followed by next(). (Post-increment semantics.)
|
|
ICU "poor man's RTTI", returns a UClassID for the actual class.
Reimplemented from UObject. |
|
Retrieve the current iteration position in the input text that is being normalized. A following call to next() will return a normalized code point from the input text at or after this index. After a call to previous(), getIndex() will point at or before the position in the input text where the normalized code point was returned from with previous().
|
|
Determine whether an option is turned on or off. If multiple options are specified, then the result is TRUE if any of them are set.
|
|
ICU "poor man's RTTI", returns a UClassID for this class.
|
|
Copies the input text into the UnicodeString argument.
|
|
Return the normalization mode for this object. This is an unusual name because there used to be a getMode() that returned a different type.
|
|
Generates a hash code for this iterator.
|
|
|
|
Test if a string is in a given normalization form; same as the other version of isNormalized but takes an extra options parameter like most normalization functions.
|
|
Test if a string is in a given normalization form. This is semantically equivalent to source.equals(normalize(source, mode)) . Unlike unorm_quickCheck(), this function returns a definitive result, never a "maybe". For NFD, NFKD, and FCD, both functions work exactly the same. For NFC and NFKC where quickCheck may return "maybe", this function will perform further tests to arrive at a TRUE/FALSE result.
|
|
Return the last character in the normalized text. This is equivalent to setIndexOnly(endIndex()) followed by previous(). (Pre-decrement semantics.)
|
|
Return the next character in the normalized text. (Post-increment semantics.) If the end of the text has already been reached, DONE is returned. The DONE value could be confused with a U+FFFF non-character code point in the text. If this is possible, you can test getIndex()<endIndex() before calling next(), or (getIndex()<endIndex() || last()!=DONE) after calling next(). (Calling last() will change the iterator state!) The C API unorm_next() is more efficient and does not have this ambiguity.
|
|
|
|
Normalizes a This is a wrapper for unorm_normalize(), using UnicodeString's.
The
|
|
Returns FALSE when both iterators refer to the same character in the same input text.
|
|
|
|
Returns TRUE when both iterators refer to the same character in the same input text.
Referenced by operator!=().
|
|
Return the previous character in the normalized text and decrement. (Pre-decrement semantics.) If the beginning of the text has already been reached, DONE is returned. The DONE value could be confused with a U+FFFF non-character code point in the text. If this is possible, you can test (getIndex()>startIndex() || first()!=DONE). (Calling first() will change the iterator state!) The C API unorm_previous() is more efficient and does not have this ambiguity.
|
|
|
|
Performing quick check on a string; same as the other version of quickCheck but takes an extra options parameter like most normalization functions.
|
|
Performing quick check on a string, to quickly determine if the string is in a particular normalization format. This is a wrapper for unorm_quickCheck(), using a UnicodeString. Three types of result can be returned UNORM_YES, UNORM_NO or UNORM_MAYBE. Result UNORM_YES indicates that the argument string is in the desired normalized format, UNORM_NO determines that argument string is not in the desired normalized format. A UNORM_MAYBE result indicates that a more thorough check is required, the user may have to put the string in its normalized form and compare the results.
|
|
Reset the index to the beginning of the text. This is equivalent to setIndexOnly(startIndex)).
|
|
Set the iteration position in the input text that is being normalized, without any immediate normalization. After setIndexOnly(), getIndex() will return the same index that is specified here.
|
|
Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating over a string, calls to next and previous may return previously buffers characters in the old normalization mode until the iteration is able to re-sync at the next base character. It is safest to call setIndexOnly, reset, setText, first, last, etc. after calling
|
|
Set options that affect this Options do not change the basic composition or decomposition operation that is being performed, but they control whether certain optional portions of the operation are done. Currently the only available option is obsolete. It is possible to specify multiple options that are all turned on or off.
|
|
Set the input text over which this The iteration position is set to the beginning.
|
|
Set the input text over which this The iteration position is set to the beginning.
|
|
Set the input text over which this The iteration position is set to the beginning.
|
|
Retrieve the index of the start of the input text.
This is the begin index of the
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|