Module langid
source code
Sam Huston 2007
This is a simulation of the article: "Evaluation of a language
identification system for mono- and multilingual text documents" by
Artemenko, O; Mandl, T; Shramko, M; Womser-Hacker, C. presented at:
Applied Computing 2006, 21st Annual ACM Symposium on Applied Computing;
23-27 April 2006
This implementation is intended for monolingual documents only,
however it is performed over a much larger range of languages.
Additionally three supervised methods of classification are explored:
Cosine distance, NaiveBayes, and Spearman-rho
|
fd = detect.feature({"char-bigrams": lambda t: [string.join(t)...
|
|
training_data = udhr.langs(['English-Latin1', 'French_Francais...
|
|
gold_data = {}
|
fd
- Value:
detect.feature({"char-bigrams": lambda t: [string.join(t) [n: n+ 2] fo
r n in range(len(t)-1)]})
|
|
training_data
- Value:
udhr.langs(['English-Latin1', 'French_Francais-Latin1', 'Indonesian-La
tin1', 'Zapoteco-Latin1'])
|
|