Home | Trees | Indices | Help |
|
---|
|
1 # Natural Language Toolkit: Indian Language POS-Tagged Corpus Reader 2 # 3 # Copyright (C) 2001-2007 University of Pennsylvania 4 # Author: Steven Bird <sb@ldc.upenn.edu> 5 # Edward Loper <edloper@gradient.cis.upenn.edu> 6 # URL: <http://nltk.sf.net> 7 # For license information, see LICENSE.TXT 8 9 """ 10 Indian Language POS-Tagged Corpus 11 Collected by A Kumaran, Microsoft Research, India 12 Distributed with permission 13 14 Contents: 15 - Bangla: IIT Kharagpur 16 - Hindi: Microsoft Research India 17 - Marathi: IIT Bombay 18 - Telugu: IIIT Hyderabad 19 """ 20 21 from nltk_lite.corpora import get_basedir 22 from nltk_lite import tokenize 23 from nltk_lite.tag import string2tags, string2words 24 import os 25 26 items = list(['bangla', 'hindi', 'marathi', 'telugu']) 2729 if type(files) is str: files = (files,) 30 31 for file in files: 32 path = os.path.join(get_basedir(), "indian", file + ".pos") 33 f = open(path).read() 34 for sent in tokenize.line(f): 35 if sent and sent[0] != "<": 36 yield conversion_function(sent)3739 if type(files) is str: files = (files,) 40 for file in files: 41 path = os.path.join(get_basedir(), "indian", file + ".pos") 42 for line in open(path): 43 yield line44 47 50 51 58 65 66 if __name__ == '__main__': 67 demo() 68
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0beta1 on Wed May 16 22:47:52 2007 | http://epydoc.sourceforge.net |