Package Bio :: Package AlignIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.AlignIO

  1  # Copyright 2008 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Multiple sequence alignment input/output as Alignment objects. 
  7   
  8  The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in 
  9  fact the two are connected internally.  Both modules use the same set of file 
 10  format names (lower case strings).  From the user's perspective, you can read 
 11  in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you 
 12  can read in the sequences within these alignmenta using Bio.SeqIO. 
 13   
 14  Input 
 15  ===== 
 16  For the typical special case when your file or handle contains one and only 
 17  one alignment, use the function Bio.AlignIO.read().  This takes an input file 
 18  handle, format string and optional number of sequences per alignment.  It will 
 19  return a single Alignment object (or raise an exception if there isn't just 
 20  one alignment): 
 21   
 22      from Bio import AlignIO 
 23      handle = open("example.aln", "rU") 
 24      align = AlignIO.read(handle, "clustal") 
 25      handle.close() 
 26      print align 
 27   
 28  For the general case, when the handle could contain any number of alignments, 
 29  use the function Bio.AlignIO.parse(...) which takes the same arguments, but 
 30  returns an iterator giving Alignment objects.  For example, using the output 
 31  from the EMBOSS water or needle pairwise alignment prorams: 
 32   
 33      from Bio import AlignIO 
 34      handle = open("example.txt", "rU") 
 35      for alignment in AlignIO.parse(handle, "emboss") : 
 36          print alignment 
 37   
 38  If you want random access to the alignments by number, turn this into a list: 
 39   
 40      from Bio import AlignIO 
 41      handle = open("example.aln", "rU") 
 42      alignments = list(AlignIO.parse(handle, "clustal")) 
 43      print alignments[0] 
 44   
 45  Most alignment file formats can be concatenated so as to hold as many 
 46  different multiple sequence alignments as possible.  One common example 
 47  is the output of the tool seqboot in the PHLYIP suite.  Sometimes there 
 48  can be a file header and footer, as seen in the EMBOSS alignment output. 
 49   
 50  There is an optional argument for the number of sequences per alignment which 
 51  is usually only needed with the alignments stored in the FASTA format. 
 52  Without this information, there is no clear way to tell if you have say a 
 53  single alignment of 20 sequences, or four alignments of 5 sequences.  e.g. 
 54   
 55      from Bio import AlignIO 
 56      handle = open("example.faa", "rU") 
 57      for alignment in AlignIO.parse(handle, "fasta", seq_count=5) : 
 58          print alignment 
 59   
 60  The above code would split up the FASTA files, and try and batch every five 
 61  sequences into an alignment. 
 62   
 63  Output 
 64  ====== 
 65  Use the function Bio.AlignIO.write(...), which takes a complete set of 
 66  Alignment objects (either as a list, or an iterator), an output file handle 
 67  and of course the file format. 
 68   
 69      from Bio import AlignIO 
 70      alignments = ... 
 71      handle = open("example.faa", "w") 
 72      alignment = SeqIO.write(alignments, handle, "fasta") 
 73      handle.close() 
 74   
 75  In general, you are expected to call this function once (with all your 
 76  alignments) and then close the file handle.  However, for file formats 
 77  like PHYLIP where multiple alignments are stored sequentially (with no file 
 78  header and footer), then multiple calls to the write function should work as 
 79  expected. 
 80   
 81  File Formats 
 82  ============ 
 83  When specifying the file format, use lowercase strings.  The same format 
 84  names are also used in Bio.SeqIO and include the following: 
 85   
 86  clustal   - Ouput from Clustal W or X, see also the module Bio.Clustalw 
 87              which can be used to run the command line tool from Biopython. 
 88  emboss    - The "pairs" and "simple" alignment format from the EMBOSS tools. 
 89  fasta     - The generic sequence file format where each record starts with a 
 90              identifer line starting with a ">" character, followed by lines 
 91              of sequence. 
 92  fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA 
 93              tools when used with the -m 10 command line option for machine 
 94              readable output. 
 95  ig        - The IntelliGenetics file format, apparently the same as the 
 96              MASE alignment format. 
 97  nexus     - Output from NEXUS, see also the module Bio.Nexus which can also 
 98              read any phylogenetic trees in these files. 
 99  phylip    - Used by the PHLIP tools. 
100  stockholm - A richly annotated alignment file format used by PFAM. 
101   
102  Note that while Bio.AlignIO can read all the above file formats, it cannot 
103  write to all of them. 
104   
105  You can also use any file format supported by Bio.SeqIO, such as "fasta" or 
106  "ig" (which are listed above), PROVIDED the sequences in your file are all the 
107  same length. 
108   
109  Further Information 
110  =================== 
111  See the wiki page http://biopython.org/wiki/AlignIO and also the Bio.AlignIO 
112  chapter in the Biopython Tutorial and Cookbook which is also available online: 
113   
114  http://biopython.org/DIST/docs/tutorial/Tutorial.html 
115  http://biopython.org/DIST/docs/tutorial/Tutorial.pdf 
116  """ 
117   
118  #TODO 
119  # - define policy on reading aligned sequences with gaps in 
120  #   (e.g. - and . characters) including how the alphabet interacts 
121  # 
122  # - Can we build the to_alignment(...) functionality 
123  #   into the generic Alignment class instead? 
124  # 
125  # - How best to handle unique/non unique record.id when writing. 
126  #   For most file formats reading such files is fine; The stockholm 
127  #   parser would fail. 
128  # 
129  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
130  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format  
131   
132  import os 
133  #from cStringIO import StringIO 
134  from StringIO import StringIO 
135  from Bio.Seq import Seq 
136  from Bio.SeqRecord import SeqRecord 
137  from Bio.Align.Generic import Alignment 
138  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
139   
140  import StockholmIO 
141  import ClustalIO 
142  import NexusIO 
143  import PhylipIO 
144  import EmbossIO 
145  import FastaIO 
146   
147  #Convention for format names is "mainname-subtype" in lower case. 
148  #Please use the same names as BioPerl and EMBOSS where possible. 
149   
150  _FormatToIterator ={#"fasta" is done via Bio.SeqIO 
151                      "clustal" : ClustalIO.ClustalIterator, 
152                      "emboss" : EmbossIO.EmbossIterator, 
153                      "fasta-m10" : FastaIO.FastaM10Iterator, 
154                      "nexus" : NexusIO.NexusIterator, 
155                      "phylip" : PhylipIO.PhylipIterator, 
156                      "stockholm" : StockholmIO.StockholmIterator, 
157                      } 
158   
159  _FormatToWriter ={#"fasta" is done via Bio.SeqIO 
160                    #"emboss" : EmbossIO.EmbossWriter, (unfinished) 
161                    "nexus" : NexusIO.NexusWriter, 
162                    "phylip" : PhylipIO.PhylipWriter, 
163                    "stockholm" : StockholmIO.StockholmWriter, 
164                    "clustal" : ClustalIO.ClustalWriter, 
165                    } 
166   
167 -def write(alignments, handle, format) :
168 """Write complete set of alignments to a file. 169 170 sequences - A list (or iterator) of Alignment objects 171 handle - File handle object to write to 172 format - lower case string describing the file format to write. 173 174 You should close the handle after calling this function. 175 176 Returns the number of alignments written (as an integer). 177 """ 178 from Bio import SeqIO 179 180 #Try and give helpful error messages: 181 if isinstance(handle, basestring) : 182 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 183 if not isinstance(format, basestring) : 184 raise TypeError("Need a string for the file format (lower case)") 185 if not format : 186 raise ValueError("Format required (lower case string)") 187 if format != format.lower() : 188 raise ValueError("Format string '%s' should be lower case" % format) 189 if isinstance(alignments, Alignment) : 190 raise TypeError("Need an Alignment list/iterator, not just a single Alignment") 191 192 #Map the file format to a writer class 193 if format in _FormatToIterator : 194 writer_class = _FormatToWriter[format] 195 count = writer_class(handle).write_file(alignments) 196 elif format in SeqIO._FormatToWriter : 197 #Exploit the existing SeqIO parser to the dirty work! 198 #TODO - Can we make one call to SeqIO.write() and count the alignments? 199 count = 0 200 for alignment in alignments : 201 SeqIO.write(alignment, handle, format) 202 count += 1 203 elif format in _FormatToIterator or format in SeqIO._FormatToIterator : 204 raise ValueError("Reading format '%s' is supported, but not writing" \ 205 % format) 206 else : 207 raise ValueError("Unknown format '%s'" % format) 208 209 assert isinstance(count, int), "Internal error - the underlying writer " \ 210 + " should have returned the alignment count, not %s" % repr(count) 211 return count
212 213 #This is a generator function!
214 -def _SeqIO_to_alignment_iterator(handle, format, alphabet=None, seq_count=None) :
215 """Uses Bio.SeqIO to create an Alignment iterator (PRIVATE). 216 217 handle - handle to the file. 218 format - string describing the file format. 219 alphabet - optional Alphabet object, useful when the sequence type cannot 220 be automatically inferred from the file itself (e.g. fasta) 221 seq_count- Optional integer, number of sequences expected in 222 each alignment. Recommended for fasta format files. 223 224 If count is omitted (default) then all the sequences in 225 the file are combined into a single Alignment. 226 """ 227 from Bio import SeqIO 228 assert format in SeqIO._FormatToIterator 229 230 if seq_count : 231 #Use the count to split the records into batches. 232 seq_record_iterator = SeqIO.parse(handle, format, alphabet) 233 234 records = [] 235 for record in seq_record_iterator : 236 records.append(record) 237 if len(records) == seq_count : 238 yield SeqIO.to_alignment(records) 239 records = [] 240 if len(records) > 0 : 241 raise ValueError("Check seq_count argument, not enough sequences?") 242 else : 243 #Must assume that there is a single alignment using all 244 #the SeqRecord objects: 245 records = list(SeqIO.parse(handle, format, alphabet)) 246 if records : 247 yield SeqIO.to_alignment(records) 248 else : 249 #No alignment found! 250 pass
251
252 -def _force_alphabet(alignment_iterator, alphabet) :
253 """Iterate over alignments, over-riding the alphabet (PRIVATE).""" 254 #Assume the alphabet argument has been pre-validated 255 given_base_class = _get_base_alphabet(alphabet).__class__ 256 for align in alignment_iterator : 257 if not isinstance(_get_base_alphabet(align._alphabet), 258 given_base_class) : 259 raise ValueError("Specified alphabet %s clashes with "\ 260 "that determined from the file, %s" \ 261 % (repr(alphabet), repr(align._alphabet))) 262 for record in align : 263 if not isinstance(_get_base_alphabet(record.seq.alphabet), 264 given_base_class) : 265 raise ValueError("Specified alphabet %s clashes with "\ 266 "that determined from the file, %s" \ 267 % (repr(alphabet), repr(record.seq.alphabet))) 268 record.seq.alphabet = alphabet 269 align._alphabet = alphabet 270 yield align
271
272 -def parse(handle, format, seq_count=None, alphabet=None) :
273 """Turns a sequence file into an iterator returning Alignment objects. 274 275 handle - handle to the file. 276 format - string describing the file format. 277 alphabet - optional Alphabet object, useful when the sequence type cannot 278 be automatically inferred from the file itself (e.g. phylip) 279 seq_count- Optional integer, number of sequences expected in 280 each alignment. Recommended for fasta format files. 281 282 If you have the file name in a string 'filename', use: 283 284 >>> from Bio import AlignIO 285 >>> filename = "Emboss/needle.txt" 286 >>> format = "emboss" 287 >>> for alignment in AlignIO.parse(open(filename,"rU"), format) : 288 ... print "Alignment of length", alignment.get_alignment_length() 289 Alignment of length 124 290 Alignment of length 119 291 Alignment of length 120 292 Alignment of length 118 293 Alignment of length 125 294 295 If you have a string 'data' containing the file contents, use: 296 297 from Bio import AlignIO 298 from StringIO import StringIO 299 my_iterator = AlignIO.parse(StringIO(data), format) 300 301 Use the Bio.AlignIO.read() function when you expect a single record only. 302 """ 303 from Bio import SeqIO 304 305 #Try and give helpful error messages: 306 if isinstance(handle, basestring) : 307 raise TypeError("Need a file handle, not a string (i.e. not a filename)") 308 if not isinstance(format, basestring) : 309 raise TypeError("Need a string for the file format (lower case)") 310 if not format : 311 raise ValueError("Format required (lower case string)") 312 if format != format.lower() : 313 raise ValueError("Format string '%s' should be lower case" % format) 314 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \ 315 isinstance(alphabet, AlphabetEncoder)) : 316 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 317 318 #Map the file format to a sequence iterator: 319 if format in _FormatToIterator : 320 iterator_generator = _FormatToIterator[format] 321 if alphabet is None : 322 return iterator_generator(handle, seq_count) 323 try : 324 #Initially assume the optional alphabet argument is supported 325 return iterator_generator(handle, seq_count, alphabet=alphabet) 326 except TypeError : 327 #It isn't supported. 328 return _force_alphabet(iterator_generator(handle, seq_count), alphabet) 329 330 elif format in SeqIO._FormatToIterator : 331 #Exploit the existing SeqIO parser to the dirty work! 332 return _SeqIO_to_alignment_iterator(handle, format, 333 alphabet=alphabet, 334 seq_count=seq_count) 335 else : 336 raise ValueError("Unknown format '%s'" % format)
337
338 -def read(handle, format, seq_count=None, alphabet=None) :
339 """Turns an alignment file into a single Alignment object. 340 341 handle - handle to the file. 342 format - string describing the file format. 343 alphabet - optional Alphabet object, useful when the sequence type cannot 344 be automatically inferred from the file itself (e.g. phylip) 345 seq_count- Optional interger, number of sequences expected in 346 the alignment to check you got what you expected. 347 348 If the handle contains no alignments, or more than one alignment, 349 an exception is raised. For example, using a PFAM/Stockholm file 350 containing one alignment: 351 352 >>> from Bio import AlignIO 353 >>> filename = "Clustalw/protein.aln" 354 >>> format = "clustal" 355 >>> alignment = AlignIO.read(open(filename, "rU"), format) 356 >>> print "Alignment of length", alignment.get_alignment_length() 357 Alignment of length 411 358 359 If however you want the first alignment from a file containing 360 multiple alignments this function would raise an exception. 361 Instead use: 362 363 >>> from Bio import AlignIO 364 >>> filename = "Emboss/needle.txt" 365 >>> format = "emboss" 366 >>> alignment = AlignIO.parse(open(filename, "rU"), format).next() 367 >>> print "First alignment has length", alignment.get_alignment_length() 368 First alignment has length 124 369 370 Use the Bio.AlignIO.parse() function if you want to read multiple 371 records from the handle. 372 """ 373 iterator = parse(handle, format, seq_count, alphabet) 374 try : 375 first = iterator.next() 376 except StopIteration : 377 first = None 378 if first is None : 379 raise ValueError("No records found in handle") 380 try : 381 second = iterator.next() 382 except StopIteration : 383 second = None 384 if second is not None : 385 raise ValueError("More than one record found in handle") 386 if seq_count : 387 assert len(first.get_all_seqs())==seq_count 388 return first
389
390 -def _test():
391 """Run the Bio.AlignIO module's doctests. 392 393 This will try and locate the unit tests directory, and run the doctests 394 from there in order that the relative paths used in the examples work. 395 """ 396 import doctest 397 import os 398 if os.path.isdir(os.path.join("..","..","Tests")) : 399 print "Runing doctests..." 400 cur_dir = os.path.abspath(os.curdir) 401 os.chdir(os.path.join("..","..","Tests")) 402 doctest.testmod() 403 os.chdir(cur_dir) 404 del cur_dir 405 print "Done"
406 407 if __name__ == "__main__" : 408 _test() 409