>>> x = 'Python'; y = 'NLTK'; z = 'Natural Language Processing' >>> x + '/' + y 'Python/NLTK' >>> 'LT' in y True >>> x[2:] 'thon' >>> x[::-1] 'nohtyP' >>> len(x) 6 >>> z.count('a') 4 >>> z.endswith('ing') True >>> z.index('Language') 8 >>> '; '.join([x,y,z]) 'Python; NLTK; Natural Language Processing' >>> y.lower() 'nltk' >>> z.replace(' ', '\n') 'Natural\nLanguage\nProcessing' >>> print z.replace(' ', '\n') Natural Language Processing >>> z.split() ['Natural', 'Language', 'Processing']
For more information, type help(str) at the Python prompt.
>>> x = ['Natural', 'Language']; y = ['Processing'] >>> x[0] 'Natural' >>> list(x[0]) ['N', 'a', 't', 'u', 'r', 'a', 'l'] >>> x + y ['Natural', 'Language', 'Processing'] >>> 'Language' in x True >>> len(x) 2 >>> x.index('Language') 1
The following functions modify the list in-place:
>>> x.append('Toolkit') >>> x ['Natural', 'Language', 'Toolkit'] >>> x.insert(0, 'Python') >>> x ['Python', 'Natural', 'Language', 'Toolkit'] >>> x.reverse() >>> x ['Toolkit', 'Language', 'Natural', 'Python'] >>> x.sort() >>> x ['Language', 'Natural', 'Python', 'Toolkit']
For more information, type help(list) at the Python prompt.
>>> d = {'natural': 'adj', 'language': 'noun'} >>> d['natural'] 'adj' >>> d['toolkit'] = 'noun' >>> d {'natural': 'adj', 'toolkit': 'noun', 'language': 'noun'} >>> 'language' in d True >>> d.items() [('natural', 'adj'), ('toolkit', 'noun'), ('language', 'noun')] >>> d.keys() ['natural', 'toolkit', 'language'] >>> d.values() ['adj', 'noun', 'noun']
For more information, type help(dict) at the Python prompt.
Note
to be written
>>> text = '''NLTK, the Natural Language Toolkit, is a suite of program ... modules, data sets and tutorials supporting research and teaching in ... computational linguistics and natural language processing.''' >>> from nltk_lite import tokenize >>> list(tokenize.line(text)) ['NLTK, the Natural Language Toolkit, is a suite of program', 'modules, data sets and tutorials supporting research and teaching in', 'computational linguistics and natural language processing.'] >>> list(tokenize.whitespace(text)) ['NLTK,', 'the', 'Natural', 'Language', 'Toolkit,', 'is', 'a', 'suite', 'of', 'program', 'modules,', 'data', 'sets', 'and', 'tutorials', 'supporting', 'research', 'and', 'teaching', 'in', 'computational', 'linguistics', 'and', 'natural', 'language', 'processing.'] >>> list(tokenize.wordpunct(text)) ['NLTK', ',', 'the', 'Natural', 'Language', 'Toolkit', ',', 'is', 'a', 'suite', 'of', 'program', 'modules', ',', 'data', 'sets', 'and', 'tutorials', 'supporting', 'research', 'and', 'teaching', 'in', 'computational', 'linguistics', 'and', 'natural', 'language', 'processing', '.'] >>> list(tokenize.regexp(text, ', ', gaps=True)) ['NLTK', 'the Natural Language Toolkit', 'is a suite of program\nmodules', 'data sets and tutorials supporting research and teaching in\ncomputational linguistics and natural language processing.']
>>> tokens = list(tokenize.wordpunct(text)) >>> from nltk_lite import stem >>> stemmer = stem.Regexp('ing$|s$|e$') >>> for token in tokens: ... print stemmer.stem(token), NLTK , th Natural Languag Toolkit , i a suit of program module , data set and tutorial support research and teach in computational linguistic and natural languag process . >>> stemmer = stem.Porter() >>> for token in tokens: ... print stemmer.stem(token), NLTK , the Natur Languag Toolkit , is a suit of program modul , data set and tutori support research and teach in comput linguist and natur languag process .
Note
to be written
About this document...
This chapter is a draft from Introduction to Natural Language Processing, by Steven Bird, James Curran, Ewan Klein and Edward Loper, Copyright © 2006 the authors. It is distributed with the Natural Language Toolkit [http://nltk.sourceforge.net], under the terms of the Creative Commons Attribution-ShareAlike License [http://creativecommons.org/licenses/by-sa/2.5/].