module Uunf:sig
..end
Uunf
normalizes Unicode text. It supports all Unicode
normalization forms. The module is independent from any IO mechanism or
Unicode text data structure and it can process text without a
complete in-memory representation of the data.
Consult the basics, limitations and examples of use.
Release 0.9.2 — Unicode version 6.3.0 — Daniel Bünzli <daniel.buenzl i@erratique.ch>
typeuchar =
int
0x0000
...0xD7FF
and 0xE000
...0x10FFFF
.val is_scalar_value : int -> bool
typeform =
[ `NFC | `NFD | `NFKC | `NFKD ]
`NFD
normalization form D, canonical decomposition.`NFC
normalization form C, canonical decomposition followed by
canonical composition
(recommended for the www).`NFKD
normalization form KD, compatibility decomposition.`NFKC
normalization form KC, compatibility decomposition,
followed by canonical composition.type
t
val create : [< form ] -> t
create nf
is an Unicode text normalizer for the normal form nf
.val form : t -> form
form n
is the normalization form of n
.val add : t ->
[ `Await | `End | `Uchar of uchar ] -> [ `Await | `Uchar of uchar ]
add n v
is:
`Uchar u
if u
is the next character in the normalized
sequence. The client must then call add
with `Await
until `Await
is returned.`Await
when the normalizer is ready to add a new
`Uchar
or `End
.
For v
use `Uchar u
to add a new character to the sequence
to normalize and `End
to signal the end of sequence. After
adding one of these two values, always call add
with `Await
until `Await
is returned.
Raises. Invalid_argument
if `Uchar
or `End
is
added directly after an `Uchar
was returned by the normalizer
or if an `Uchar
is added after `End
was added.
Warning. add
deals with Unicode
scalar values. If you are handling foreign data you must assert
that before with Uunf.is_scalar_value
.
val reset : t -> unit
reset n
resets the normalizer to a state equivalent to the
state of Uunf.create (Uunf.form n)
.val copy : t -> t
These properties are used internally to implement the normalizers.
They are not needed to use the module but are exposed as they may
be useful to implement other algorithms.
val unicode_version : string
unicode_version
is the Unicode version supported by the module.val ccc : uchar -> int
val decomp : uchar -> int array
decomp u
is u
's
decomposition
mapping. If the empty array is returned, u
decomposes to itself.
The first number in the array contains additional information, it
cannot be used as an Uunf.uchar
. Use Uunf.d_uchar
on the number to get the
actual character and Uunf.d_compatibility
to find out if this is
a compatibility decomposition. All other characters of the array
are guaranteed to be of type Uunf.uchar
.
Warning. Do not mutate the array.
val d_uchar : int -> uchar
Uunf.decomp
.val d_compatibility : int -> bool
Uunf.decomp
.val composite : uchar -> uchar -> uchar option
An Uunf
normalizer consumes only a small bounded amount of
memory on ordinary, meaningful text. However on legal but degenerate text like a
starter followed by
10'000 combining
non-spacing
marks it will have to bufferize all the marks (a workaround is
to first convert your input to
stream-safe
text format).
A normalizer is a stateful filter that inputs a sequence of characters and outputs an equivalent sequence in the requested normal form.
The function Uunf.create
returns a new normalizer for a given normal
form:
let nfd = Uunf.create `NFD;;
To add characters to the sequence to normalize, call Uunf.add
on
nfd
with `Uchar _
. To end the sequence, call Uunf.add
on nfd
with `End
. The normalized sequence of characters is returned,
character by character, by the successive calls to Uunf.add
.
The client and the normalizer must wait on each other to limit
internal buffering: each time the client adds to the sequence by
calling Uunf.add
with `Uchar
or `End
it must continue to call
Uunf.add
with `Await
until the normalizer returns `Await
. In
practice this leads to the following kind of control flow:
let rec add acc v = match Uunf.add nfd v with
| `Uchar u -> add (u :: acc) `Await
| `Await -> acc
For example to normalize the character U+00E9
(é) with nfd
to a list
of characters we can write:
let e_acute_nfd = List.rev (add (add [] (`Uchar 0x00E9)) `End)
The next section has more examples.
utf_8_normalize nf s
is the UTF-8 encoded normal form nf
of
the UTF-8 encoded string s
. This example uses Uutf
to fold
over the characters of s
and to encode the normalized
sequence in a standard OCaml buffer.
let utf_8_normalize nf s =
let b = Buffer.create (String.length s * 3) in
let n = Uunf.create nf in
let rec add v = match Uunf.add n v with
| `Uchar u -> Uutf.Buffer.add_utf_8 b u; add `Await
| `Await -> ()
in
let add_uchar _ _ = function
| `Malformed _ -> add (`Uchar Uutf.u_rep)
| `Uchar _ as u -> add u
in
Uutf.String.fold_utf_8 add_uchar () s; add `End; Buffer.contents b