Tokenizer

A text segmenter

Downloading
tokenizer

Compilation

Caution: you need lexed >= 4.3.3 to build.
To install under unix type
./configure [--prefix=<directory>] [--with-amalgam] [--with-composition] (./configure --help for the help)
make
make install
make clean

Use

For help

tokenizer -h

To build an automata

lexed [ -d <directory > ] [ -p <filename prefix> ] <lexicon1> <lexicon2> ...
The lexicons contain for every line the word followed by associated information, separated by a character (tabulation or space by default).
"." is default directory.
"lexicon" is default filename prefix.

To configure the segmenter

You have to edit tokenizer.ll and rebuild.

To use the segmenter

tokenizer [ -d <directory> ] [ -p <filename prefix> ] [ --encode <encoding> ] < inputfile > outputfile