A LetterTokenizer is a tokenizer that divides text at non-letters. That is to say, it defines tokens as maximal strings of adjacent letters, as defined by the regular expression _/[]+/_ where [:alpha] matches all characters in your local locale.
"Dave's résumé, at http://www.davebalmain.com/ 1234" => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]
Create a new LetterTokenizer which optionally downcases tokens. Downcasing is done according the current locale.
lower |
set to false if you don’t wish to downcase tokens |
static VALUE frb_letter_tokenizer_init(int argc, VALUE *argv, VALUE self) { TS_ARGS(false); #ifndef POSH_OS_WIN32 if (!frb_locale) frb_locale = setlocale(LC_CTYPE, ""); #endif return get_wrapped_ts(self, rstr, mb_letter_tokenizer_new(lower)); }
Generated with the Darkfish Rdoc Generator 2.