Package nltk_lite :: Package corpora :: Module timit
[hide private]
[frames] | no frames]

Source Code for Module nltk_lite.corpora.timit

  1  # Natural Language Toolkit: TIMIT Corpus Reader 
  2  # 
  3  # Copyright (C) 2001-2007 University of Pennsylvania 
  4  # Author: Haejoong Lee <haejoong@ldc.upenn.edu> 
  5  #         Steven Bird <sb@ldc.upenn.edu> 
  6  # URL: <http://nltk.sf.net> 
  7  # For license information, see LICENSE.TXT 
  8   
  9  """ 
 10  Read tokens, phonemes and audio data from the NLTK TIMIT Corpus. 
 11   
 12  This corpus contains selected portion of the TIMIT corpus. 
 13   
 14  * 16 speakers from 8 dialect regions 
 15  * 1 male and 1 female from each dialect region 
 16  * total 130 sentences (10 sentences per speaker.  Note that some 
 17    sentences are shared among other speakers, especially sa1 and sa2 
 18    are spoken by all speakers.) 
 19  * total 160 recording of sentences (10 recordings per speaker) 
 20  * audio format: NIST Sphere, single channel, 16kHz sampling, 
 21    16 bit sample, PCM encoding 
 22   
 23   
 24  Module contents 
 25  --------------- 
 26   
 27  The timit module provides 4 functions and 4 data items. 
 28   
 29  * items 
 30   
 31    List of items in the corpus.  There are total 160 items, each of which 
 32    corresponds to a unique utterance of a speaker.  Here's an example of an 
 33    item in the list: 
 34   
 35        dr1-fvmh0:sx206 
 36          - _----  _--- 
 37          | |  |   | | 
 38          | |  |   | | 
 39          | |  |   | `--- sentence number 
 40          | |  |   `----- sentence type (a:all, i:shared, x:exclusive) 
 41          | |  `--------- speaker ID 
 42          | `------------ sex (m:male, f:female) 
 43          `-------------- dialect region (1..8) 
 44   
 45  * speakers 
 46   
 47    List of speaker IDs.  An example of speaker ID: 
 48   
 49        dr1-fvmh0 
 50   
 51    Note that if you split an item ID with colon and take the first element of 
 52    the result, you will get a speaker ID. 
 53   
 54        >>> itemid = dr1-fvmh0:sx206 
 55        >>> spkrid,sentid = itemid.split(':') 
 56        >>> spkrid 
 57        'dr1-fvmh0' 
 58         
 59    The second element of the result is a sentence ID. 
 60     
 61  * dictionary 
 62   
 63    Phonetic dictionary of words contained in this corpus.  This is a Python 
 64    dictionary from words to phoneme lists. 
 65     
 66  * spkrinfo 
 67   
 68    Speaker information table.  It's a Python dictionary from speaker IDs to 
 69    records of 10 fields.  Speaker IDs the same as the ones in timie.speakers. 
 70    Each record is a dictionary from field names to values, and the fields are 
 71    as follows: 
 72   
 73      id         speaker ID as defined in the original TIMIT speaker info table 
 74      sex        speaker gender (M:male, F:female) 
 75      dr         speaker dialect region (1:new england, 2:northern, 
 76                 3:north midland, 4:south midland, 5:southern, 6:new york city, 
 77                 7:western, 8:army brat (moved around)) 
 78      use        corpus type (TRN:training, TST:test) 
 79                 in this sample corpus only TRN is available 
 80      recdate    recording date 
 81      birthdate  speaker birth date 
 82      ht         speaker height 
 83      race       speaker race (WHT:white, BLK:black, AMR:american indian, 
 84                 SPN:spanish-american, ORN:oriental,???:unknown) 
 85      edu        speaker education level (HS:high school, AS:associate degree, 
 86                 BS:bachelor's degree (BS or BA), MS:master's degree (MS or MA), 
 87                 PHD:doctorate degree (PhD,JD,MD), ??:unknown) 
 88      comments   comments by the recorder 
 89     
 90  The 4 functions are as follows. 
 91   
 92  * raw(sentences=items, offset=False) 
 93   
 94    Given a list of items, returns an iterator of a list of word lists, 
 95    each of which corresponds to an item (sentence).  If offset is set to True, 
 96    each element of the word list is a tuple of word(string), start offset and 
 97    end offset, where offset is represented as a number of 16kHz samples. 
 98       
 99  * phonetic(sentences=items, offset=False) 
100   
101    Given a list of items, returns an iterator of a list of phoneme lists, 
102    each of which corresponds to an item (sentence).  If offset is set to True, 
103    each element of the phoneme list is a tuple of word(string), start offset 
104    and end offset, where offset is represented as a number of 16kHz samples. 
105   
106  * audiodata(item, start=0, end=None) 
107   
108    Given an item, returns a chunk of audio samples formatted into a string. 
109    When the fuction is called, if start and end are omitted, the entire 
110    samples of the recording will be returned.  If only end is omitted, 
111    samples from the start offset to the end of the recording will be returned. 
112   
113  * play(data) 
114   
115    Play the given audio samples. The audio samples can be obtained from the 
116    timit.audiodata function. 
117   
118  """        
119   
120  from nltk_lite.corpora import get_basedir 
121  from nltk_lite import tokenize 
122  from itertools import islice 
123  import ossaudiodev, time 
124  import sys, os, re 
125   
126  if sys.platform.startswith('linux') or sys.platform.startswith('freebsd'): 
127      PLAY_ENABLED = True 
128  else: 
129      PLAY_ENABLED = False 
130       
131  __all__ = ["items", "raw", "phonetic", "speakers", "dictionary", "spkrinfo", 
132             "audiodata", "play"] 
133   
134  PREFIX = os.path.join(get_basedir(),"timit") 
135   
136  speakers = [] 
137  items = [] 
138  dictionary = {} 
139  spkrinfo = {} 
140   
141  for f in os.listdir(PREFIX): 
142      if re.match("^dr[0-9]-[a-z]{4}[0-9]$", f): 
143          speakers.append(f) 
144          for g in os.listdir(os.path.join(PREFIX,f)): 
145              if g.endswith(".txt"): 
146                  items.append(f+':'+g[:-4]) 
147  speakers.sort() 
148  items.sort() 
149   
150  # read dictionary 
151  for l in open(os.path.join(PREFIX,"timitdic.txt")): 
152      if l[0] == ';': continue 
153      a = l.strip().split('  ') 
154      dictionary[a[0]] = a[1].strip('/').split() 
155   
156  # read spkrinfo 
157  header = ['id','sex','dr','use','recdate','birthdate','ht','race','edu', 
158            'comments'] 
159  for l in open(os.path.join(PREFIX,"spkrinfo.txt")): 
160      if l[0] == ';': continue 
161      rec = l[:54].split() + [l[54:].strip()] 
162      key = "dr%s-%s%s" % (rec[2],rec[1].lower(),rec[0].lower()) 
163      spkrinfo[key] = dict((header[i],rec[i]) for i in range(10)) 
164       
165 -def _prim(ext, sentences=items, offset=False):
166 if isinstance(sentences,str): 167 sentences = [sentences] 168 for sent in sentences: 169 fnam = os.path.sep.join([PREFIX] + sent.split(':')) + ext 170 r = [] 171 for l in open(fnam): 172 if not l.strip(): continue 173 a = l.split() 174 if offset: 175 r.append((a[2],int(a[0]),int(a[1]))) 176 else: 177 r.append(a[2]) 178 yield r
179
180 -def raw(sentences=items, offset=False):
181 """ 182 Given a list of items, returns an iterator of a list of word lists, 183 each of which corresponds to an item (sentence). If offset is set to True, 184 each element of the word list is a tuple of word(string), start offset and 185 end offset, where offset is represented as a number of 16kHz samples. 186 187 @param sentences: List of items (sentences) for which tokenized word list 188 will be returned. In case there is only one item, it is possible to 189 pass the item id as a string. 190 @type sentences: list of strings or a string 191 @param offset: If True, the start and end offsets are accompanied to each 192 word in the returned list. Note that here, an offset is represented by 193 the number of 16kHz samples. 194 @type offset: bool 195 @return: List of list of strings (words) if offset is False. List of list 196 of tuples (word, start offset, end offset) if offset if True. 197 """ 198 return _prim(".wrd", sentences, offset)
199 200
201 -def phonetic(sentences=items, offset=False):
202 """ 203 Given a list of items, returns an iterator of a list of phoneme lists, 204 each of which corresponds to an item (sentence). If offset is set to True, 205 each element of the phoneme list is a tuple of word(string), start offset 206 and end offset, where offset is represented as a number of 16kHz samples. 207 208 @param sentences: List of items (sentences) for which phoneme list 209 will be returned. In case there is only one item, it is possible to 210 pass the item id as a string. 211 @type sentences: list of strings or a string 212 @param offset: If True, the start and end offsets are accompanied to each 213 phoneme in the returned list. Note that here, an offset is represented by 214 the number of 16kHz samples. 215 @type offset: bool 216 @return: List of list of strings (phonemes) if offset is False. List of 217 list of tuples (phoneme, start offset, end offset) if offset if True. 218 """ 219 return _prim(".phn", sentences, offset)
220
221 -def audiodata(item, start=0, end=None):
222 """ 223 Given an item, returns a chunk of audio samples formatted into a string. 224 When the fuction is called, if start and end are omitted, the entire 225 samples of the recording will be returned. If only end is omitted, 226 samples from the start offset to the end of the recording will be returned. 227 228 @param start: start offset 229 @type start: integer (number of 16kHz frames) 230 @param end: end offset 231 @type end: integer (number of 16kHz frames) or None to indicate 232 the end of file 233 @return: string of sequence of bytes of audio samples 234 """ 235 assert(end is None or end > start) 236 headersize = 44 237 fnam = os.path.join(PREFIX,item.replace(':',os.path.sep)) + '.wav' 238 if end is None: 239 data = open(fnam).read() 240 else: 241 data = open(fnam).read(headersize+end*2) 242 return data[headersize+start*2:]
243
244 -def play(data):
245 """ 246 Play the given audio samples. 247 248 @param data: audio samples 249 @type data: string of bytes of audio samples 250 """ 251 if not PLAY_ENABLED: 252 print >>sys.stderr, "sorry, currently we don't support audio playback on this platform:", sys.platform 253 return 254 255 try: 256 dsp = ossaudiodev.open('w') 257 except IOError, e: 258 print >>sys.stderr, "can't acquire the audio device; please activate your audio device." 259 print >>sys.stderr, "system error message:", str(e) 260 return 261 262 dsp.setfmt(ossaudiodev.AFMT_S16_LE) 263 dsp.channels(1) 264 dsp.speed(16000) 265 dsp.write(data) 266 dsp.close()
267
268 -def demo():
269 from nltk_lite.corpora import timit 270 271 print "6th item (timit.items[5])" 272 print "-------------------------" 273 itemid = timit.items[5] 274 spkrid, sentid = itemid.split(':') 275 print " item id: ", itemid 276 print " speaker id: ", spkrid 277 print " sentence id:", sentid 278 print 279 record = timit.spkrinfo[spkrid] 280 print " speaker information:" 281 print " TIMIT speaker id: ", record['id'] 282 print " speaker sex: ", record['sex'] 283 print " dialect region: ", record['dr'] 284 print " data type: ", record['use'] 285 print " recording date: ", record['recdate'] 286 print " date of birth: ", record['birthdate'] 287 print " speaker height: ", record['ht'] 288 print " speaker race: ", record['race'] 289 print " speaker education:", record['edu'] 290 print " comments: ", record['comments'] 291 print 292 293 print " words of the sentence:" 294 print " ", timit.raw(sentences=itemid).next() 295 print 296 297 print " words of the sentence with offsets (first 3):" 298 print " ", timit.raw(sentences=itemid, offset=True).next()[:3] 299 print 300 301 print " phonemes of the sentence (first 10):" 302 print " ", timit.phonetic(sentences=itemid).next()[:10] 303 print 304 305 print " phonemes of the sentence with offsets (first 3):" 306 print " ", timit.phonetic(sentences=itemid, offset=True).next()[:3] 307 print 308 309 print " looking up dictionary for words of the sentence..." 310 words = timit.raw(sentences=itemid).next() 311 for word in words: 312 print " %-5s:" % word, timit.dictionary[word] 313 print 314 315 316 print "audio playback:" 317 print "---------------" 318 print " playing sentence", sentid, "by speaker", spkrid, "(a.k.a. %s)"%record["id"], "..." 319 data = timit.audiodata(itemid) 320 timit.play(data) 321 print 322 print " playing words:" 323 words = timit.raw(sentences=itemid, offset=True).next() 324 for word, start, end in words: 325 print " playing %-10s in 1.5 seconds ..." % `word` 326 time.sleep(1.5) 327 data = timit.audiodata(itemid, start, end) 328 timit.play(data) 329 print 330 print " playing phonemes (first 10):" 331 phones = timit.phonetic(sentences=itemid, offset=True).next() 332 for phone, start, end in phones[:10]: 333 print " playing %-10s in 1.5 seconds ..." % `phone` 334 time.sleep(1.5) 335 data = timit.audiodata(itemid, start, end) 336 timit.play(data) 337 print 338 339 # play sentence sa1 of all female speakers 340 sentid = 'sa1' 341 for spkr in timit.speakers: 342 if timit.spkrinfo[spkr]['sex'] == 'F': 343 itemid = spkr + ':' + sentid 344 print " playing sentence %s of speaker %s ..." % (sentid, spkr) 345 data = timit.audiodata(itemid) 346 timit.play(data) 347 print
348 349 if __name__ == '__main__': 350 demo() 351