1
2
3
4
5
6
7
8
9 """Sequence input/output as SeqRecord objects.
10
11 The Bio.SeqIO module is also documented by a whole chapter in the Biopython
12 tutorial, and by the wiki http://biopython.org/wiki/SeqIO on the website.
13 The approach is designed to be similar to the bioperl SeqIO design.
14
15 Input
16 =====
17 The main function is Bio.SeqIO.parse(...) which takes an input file handle,
18 and format string. This returns an iterator giving SeqRecord objects.
19
20 from Bio import SeqIO
21 handle = open("example.fasta", "rU")
22 for record in SeqIO.parse(handle, "fasta") :
23 print record
24 handle.close()
25
26 Note that the parse() function will all invoke the relevant parser for the
27 format with its default settings. You may want more control, in which case
28 you need to create a format specific sequence iterator directly.
29
30 For non-interlaced files (e.g. Fasta, GenBank, EMBL) with multiple records
31 using a sequence iterator can save you a lot of memory (RAM). There is
32 less benefit for interlaced file formats (e.g. most multiple alignment file
33 formats). However, an iterator only lets you access the records one by one.
34
35 If you want random access to the records by number, turn this into a list:
36
37 from Bio import SeqIO
38 handle = open("example.fasta", "rU")
39 records = list(SeqIO.parse(handle, "fasta"))
40 handle.close()
41 print records[0]
42
43 If you want random access to the records by a key such as the record id,
44 turn the iterator into a dictionary:
45
46 from Bio import SeqIO
47 handle = open("example.fasta", "rU")
48 record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
49 handle.close()
50 print record["gi:12345678"]
51
52 If you expect your file to contain one-and-only-one record, then we provide
53 the following 'helper' function which will return a single SeqRecord, or
54 raise an exception if there are no records or more than one record:
55
56 from Bio import SeqIO
57 handle = open("example.fasta", "rU")
58 record = SeqIO.read(handle, "fasta")
59 handle.close()
60 print record
61
62 This style is useful when you expect a single record only (and would
63 consider multiple records an error). For example, when dealing with GenBank
64 files for bacterial genomes or chromosomes, there is normally only a single
65 record. Alternatively, use this with a handle when download a single record
66 from the internet.
67
68 However, if you just want the first record from a file containing multiple
69 record, use the iterator's next() method:
70
71 from Bio import SeqIO
72 handle = open("example.fasta", "rU")
73 record = SeqIO.parse(handle, "fasta").next()
74 handle.close()
75 print record
76
77 The above code will work as long as the file contains at least one record.
78 Note that if there is more than one record, the remaining records will be
79 silently ignored.
80
81 Input - Alignments
82 ==================
83 You can read in alignment files as Alignment objects using Bio.AlignIO.
84 Alternatively, reading in an alignment file format via Bio.SeqIO will give
85 you a SeqRecord for each row of each alignment.
86
87 Output
88 ======
89 Use the function Bio.SeqIO.write(...), which takes a complete set of
90 SeqRecord objects (either as a list, or an iterator), an output file handle
91 and of course the file format.
92
93 from Bio import SeqIO
94 records = ...
95 handle = open("example.faa", "w")
96 SeqIO.write(records, handle, "fasta")
97 handle.close()
98
99 In general, you are expected to call this function once (with all your
100 records) and then close the file handle.
101
102 Output - Advanced
103 =================
104 The effect of calling write() multiple times on a single file will vary
105 depending on the file format, and is best avoided unless you have a strong
106 reason to do so.
107
108 Trying this for certain alignment formats (e.g. phylip, clustal, stockholm)
109 would have the effect of concatenating several multiple sequence alignments
110 together. Such files are created by the PHYLIP suite of programs for
111 bootstrap analysis.
112
113 For sequential files formats (e.g. fasta, genbank) each "record block" holds
114 a single sequence. For these files it would probably be safe to call
115 write() multiple times.
116
117 File Formats
118 ============
119 When specifying the file format, use lowercase strings. The same format
120 names are also used in Bio.AlignIO and include the following:
121
122 ace - Reads the contig sequences from an ACE assembly file.
123 embl - The EMBL flat file format. Uses Bio.GenBank internally.
124 fasta - The generic sequence file format where each record starts with
125 an identifer line starting with a ">" character, followed by
126 lines of sequence.
127 genbank - The GenBank or GenPept flat file format.
128 ig - The IntelliGenetics file format, apparently the same as the
129 MASE alignment format.
130 phd - Output from PHRED, used by PHRAP and CONSED for input.
131 pir - A "FASTA like" format introduced by the National Biomedical
132 Research Foundation (NBRF) for the Protein Information Resource
133 (PIR) database, now part of UniProt.
134 swiss - Plain text Swiss-Prot aka UniProt format.
135 tab - Simple two column tab separated sequence files, where each
136 line holds a record's identifier and sequence. For example,
137 this is used as by Aligent's eArray software when saving
138 microarray probes in a minimal tab delimited text file.
139
140 Note that while Bio.SeqIO can read all the above file formats, it cannot write
141 to all of them.
142
143 You can also use any file format supported by Bio.AlignIO, such as "nexus",
144 "phlip" and "stockholm", which gives you access to the individual sequences
145 making up each alignment as SeqRecords.
146
147 Further Information
148 ===================
149 See the wiki page http://biopython.org/wiki/SeqIO and also the Bio.SeqIO
150 chapter in the Biopython Tutorial and Cookbook which is also available online:
151
152 http://biopython.org/DIST/docs/tutorial/Tutorial.html
153 http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
154 """
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170 """
171 FAO BioPython Developers
172 ========================
173 The way I envision this SeqIO system working as that for any sequence file
174 format we have an iterator that returns SeqRecord objects.
175
176 This also applies to interlaced fileformats (like clustal - although that
177 is now handled via Bio.AlignIO instead) where the file cannot be read record
178 by record. You should still return an iterator, even if the implementation
179 could just as easily return a list.
180
181 These file format specific sequence iterators may be implemented as:
182 * Classes which take a handle for __init__ and provide the __iter__ method
183 * Functions that take a handle, and return an iterator object
184 * Generator functions that take a handle, and yield SeqRecord objects
185
186 It is then trivial to turn this iterator into a list of SeqRecord objects,
187 an in memory dictionary, or a multiple sequence alignment object.
188
189 For building the dictionary by default the id propery of each SeqRecord is
190 used as the key. You should always populate the id property, and it should
191 be unique in most cases. For some file formats the accession number is a good
192 choice. If the file itself contains ambiguous identifiers, don't try and
193 dis-ambiguate them - return them as is.
194
195 When adding a new file format, please use the same lower case format name
196 as BioPerl, or if they have not defined one, try the names used by EMBOSS.
197
198 See also http://biopython.org/wiki/SeqIO_dev
199
200 --Peter
201 """
202
203 import os
204 from StringIO import StringIO
205 from Bio.Seq import Seq
206 from Bio.SeqRecord import SeqRecord
207 from Bio.Align.Generic import Alignment
208 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
209
210 import AceIO
211 import FastaIO
212 import IgIO
213 import InsdcIO
214 import PhdIO
215 import PirIO
216 import SwissIO
217 import TabIO
218
219
220
221
222
223
224
225
226
227
228 _FormatToIterator ={"fasta" : FastaIO.FastaIterator,
229 "genbank" : InsdcIO.GenBankIterator,
230 "genbank-cds" : InsdcIO.GenBankCdsFeatureIterator,
231 "embl" : InsdcIO.EmblIterator,
232 "embl-cds" : InsdcIO.EmblCdsFeatureIterator,
233 "ig" : IgIO.IgIterator,
234 "swiss" : SwissIO.SwissIterator,
235 "phd" : PhdIO.PhdIterator,
236 "ace" : AceIO.AceIterator,
237 "tab" : TabIO.TabIterator,
238 "pir" : PirIO.PirIterator,
239 }
240
241 _FormatToWriter ={"fasta" : FastaIO.FastaWriter,
242 "genbank" : InsdcIO.GenBankWriter,
243 "tab" : TabIO.TabWriter,
244 }
245
246 -def write(sequences, handle, format) :
247 """Write complete set of sequences to a file.
248
249 sequences - A list (or iterator) of SeqRecord objects.
250 handle - File handle object to write to.
251 format - lower case string describing the file format to write.
252
253 You should close the handle after calling this function.
254
255 Returns the number of records written (as an integer).
256 """
257 from Bio import AlignIO
258
259
260 if isinstance(handle, basestring) :
261 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
262 if not isinstance(format, basestring) :
263 raise TypeError("Need a string for the file format (lower case)")
264 if not format :
265 raise ValueError("Format required (lower case string)")
266 if format != format.lower() :
267 raise ValueError("Format string '%s' should be lower case" % format)
268 if isinstance(sequences,SeqRecord):
269 raise ValueError("Use a SeqRecord list/iterator, not just a single SeqRecord")
270
271
272 if format in _FormatToWriter :
273 writer_class = _FormatToWriter[format]
274 count = writer_class(handle).write_file(sequences)
275 elif format in AlignIO._FormatToWriter :
276
277
278 alignment = to_alignment(sequences)
279 alignment_count = AlignIO.write([alignment], handle, format)
280 assert alignment_count == 1, "Internal error - the underlying writer " \
281 + " should have returned 1, not %s" % repr(alignment_count)
282 count = len(alignment.get_all_seqs())
283 del alignment_count, alignment
284 elif format in _FormatToIterator or format in AlignIO._FormatToIterator :
285 raise ValueError("Reading format '%s' is supported, but not writing" \
286 % format)
287 else :
288 raise ValueError("Unknown format '%s'" % format)
289
290 assert isinstance(count, int), "Internal error - the underlying writer " \
291 + " should have returned the record count, not %s" % repr(count)
292 return count
293
294 -def parse(handle, format, alphabet=None) :
295 """Turns a sequence file into an iterator returning SeqRecords.
296
297 handle - handle to the file.
298 format - lower case string describing the file format.
299 alphabet - optional Alphabet object, useful when the sequence type cannot
300 be automatically inferred from the file itself (e.g. fasta)
301
302 Typical usage, opening a file to read in, and looping over the record(s):
303
304 >>> from Bio import SeqIO
305 >>> filename = "Nucleic/sweetpea.nu"
306 >>> for record in SeqIO.parse(open(filename,"rU"), "fasta") :
307 ... print "ID", record.id
308 ... print "Sequence length", len(record)
309 ... print "Sequence alphabet", record.seq.alphabet
310 ID gi|3176602|gb|U78617.1|LOU78617
311 Sequence length 309
312 Sequence alphabet SingleLetterAlphabet()
313
314 For file formats like FASTA where the alphabet cannot be determined, it
315 may be useful to specify the alphabet explicitly:
316
317 >>> from Bio import SeqIO
318 >>> from Bio.Alphabet import generic_dna
319 >>> filename = "Nucleic/sweetpea.nu"
320 >>> for record in SeqIO.parse(open(filename,"rU"), "fasta", generic_dna) :
321 ... print "ID", record.id
322 ... print "Sequence length", len(record)
323 ... print "Sequence alphabet", record.seq.alphabet
324 ID gi|3176602|gb|U78617.1|LOU78617
325 Sequence length 309
326 Sequence alphabet DNAAlphabet()
327
328 If you have a string 'data' containing the file contents, you must
329 first turn this into a handle in order to parse it:
330
331 from Bio import SeqIO
332 from StringIO import StringIO
333 my_iterator = SeqIO.parse(StringIO(data), format)
334
335 Use the Bio.SeqIO.read(handle, format) function when you expect a single
336 record only.
337 """
338 from Bio import AlignIO
339
340
341 if isinstance(handle, basestring) :
342 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
343 if not isinstance(format, basestring) :
344 raise TypeError("Need a string for the file format (lower case)")
345 if not format :
346 raise ValueError("Format required (lower case string)")
347 if format != format.lower() :
348 raise ValueError("Format string '%s' should be lower case" % format)
349 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \
350 isinstance(alphabet, AlphabetEncoder)) :
351 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
352
353
354 if format in _FormatToIterator :
355 iterator_generator = _FormatToIterator[format]
356 if alphabet is None :
357 return iterator_generator(handle)
358 try :
359 return iterator_generator(handle, alphabet=alphabet)
360 except :
361 return _force_alphabet(iterator_generator(handle), alphabet)
362 elif format in AlignIO._FormatToIterator :
363
364
365
366 return _iterate_via_AlignIO(handle, format, alphabet)
367 else :
368 raise ValueError("Unknown format '%s'" % format)
369
370
377
391
392 -def read(handle, format, alphabet=None) :
393 """Turns a sequence file into a single SeqRecord.
394
395 handle - handle to the file.
396 format - string describing the file format.
397 alphabet - optional Alphabet object, useful when the sequence type cannot
398 be automatically inferred from the file itself (e.g. fasta)
399
400 This function is for use parsing sequence files containing
401 exactly one record. For example, reading a GenBank file:
402
403 >>> from Bio import SeqIO
404 >>> record = SeqIO.read(open("GenBank/arab1.gb", "rU"), "genbank")
405 >>> print "ID", record.id
406 ID AC007323.5
407 >>> print "Sequence length", len(record)
408 Sequence length 86436
409 >>> print "Sequence alphabet", record.seq.alphabet
410 Sequence alphabet IUPACAmbiguousDNA()
411
412 If the handle contains no records, or more than one record,
413 an exception is raised. For example:
414
415 >>> from Bio import SeqIO
416 >>> record = SeqIO.read(open("GenBank/cor6_6.gb", "rU"), "genbank")
417 Traceback (most recent call last):
418 ...
419 ValueError: More than one record found in handle
420
421 If however you want the first record from a file containing
422 multiple records this function would raise an exception (as
423 shown in the example above). Instead use:
424
425 >>> from Bio import SeqIO
426 >>> record = SeqIO.parse(open("GenBank/cor6_6.gb", "rU"), "genbank").next()
427 >>> print "First record's ID", record.id
428 First record's ID X55053.1
429
430 Use the Bio.SeqIO.parse(handle, format) function if you want
431 to read multiple records from the handle.
432 """
433 iterator = parse(handle, format, alphabet)
434 try :
435 first = iterator.next()
436 except StopIteration :
437 first = None
438 if first is None :
439 raise ValueError("No records found in handle")
440 try :
441 second = iterator.next()
442 except StopIteration :
443 second = None
444 if second is not None :
445 raise ValueError("More than one record found in handle")
446 return first
447
448 -def to_dict(sequences, key_function=None) :
449 """Turns a sequence iterator or list into a dictionary.
450
451 sequences - An iterator that returns SeqRecord objects,
452 or simply a list of SeqRecord objects.
453 key_function - Optional function which when given a SeqRecord
454 returns a unique string for the dictionary key.
455
456 e.g. key_function = lambda rec : rec.name
457 or, key_function = lambda rec : rec.description.split()[0]
458
459 If key_function is ommitted then record.id is used, on the
460 assumption that the records objects returned are SeqRecords
461 with a unique id field.
462
463 If there are duplicate keys, an error is raised.
464
465 Example usage, defaulting to using the record.id as key:
466
467 >>> from Bio import SeqIO
468 >>> handle = open("GenBank/cor6_6.gb", "rU")
469 >>> format = "genbank"
470 >>> id_dict = SeqIO.to_dict(SeqIO.parse(handle, format))
471 >>> print id_dict.keys()
472 ['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
473 >>> print id_dict["L31939.1"].description
474 Brassica rapa (clone bif72) kin mRNA, complete cds.
475
476 A more complex example, using the key_function argument in order to use
477 a sequence checksum as the dictionary key:
478
479 >>> from Bio import SeqIO
480 >>> from Bio.SeqUtils.CheckSum import seguid
481 >>> handle = open("GenBank/cor6_6.gb", "rU")
482 >>> format = "genbank"
483 >>> seguid_dict = SeqIO.to_dict(SeqIO.parse(handle, format), \
484 key_function = lambda rec : seguid(rec.seq))
485 >>> for key, record in seguid_dict.iteritems() :
486 ... print key, record.id
487 SabZaA4V2eLE9/2Fm5FnyYy07J4 X55053.1
488 l7gjJFE6W/S1jJn5+1ASrUKW/FA X62281.1
489 /wQvmrl87QWcm9llO4/efg23Vgg AJ237582.1
490 TtWsXo45S3ZclIBy4X/WJc39+CY M81224.1
491 uVEYeAQSV5EDQOnFoeMmVea+Oow AF297471.1
492 BUg6YxXSKWEcFFH0L08JzaLGhQs L31939.1
493
494 """
495 if key_function is None :
496 key_function = lambda rec : rec.id
497
498 d = dict()
499 for record in sequences :
500 key = key_function(record)
501 if key in d :
502 raise ValueError("Duplicate key '%s'" % key)
503 d[key] = record
504 return d
505
506
508 """Returns a multiple sequence alignment (OBSOLETE).
509
510 sequences -An iterator that returns SeqRecord objects,
511 or simply a list of SeqRecord objects.
512 All the record sequences must be the same length.
513 alphabet - Optional alphabet. Stongly recommended.
514 strict - Optional, defaults to True. Should error checking
515 be done?
516
517 Using this function is now discouraged. Rather doing this:
518
519 from Bio import SeqIO
520 alignment = SeqIO.to_alignment(SeqIO.parse(handle, format))
521
522 You are now encouraged to use Bio.AlignIO instead, e.g.
523
524 from Bio import AlignIO
525 alignment = AlignIO.read(handle, format)
526 """
527
528 from Bio.Alphabet import generic_alphabet
529 from Bio.Alphabet import _consensus_alphabet
530 if alphabet is None :
531 sequences = list(sequences)
532 alphabet = _consensus_alphabet([rec.seq.alphabet for rec in sequences])
533
534 if not (isinstance(alphabet, Alphabet) or isinstance(alphabet, AlphabetEncoder)) :
535 raise ValueError("Invalid alphabet")
536
537 alignment_length = None
538 alignment = Alignment(alphabet)
539 for record in sequences :
540 if strict :
541 if alignment_length is None :
542 alignment_length = len(record.seq)
543 elif alignment_length != len(record.seq) :
544 raise ValueError("Sequences must all be the same length")
545
546 assert isinstance(record.seq.alphabet, Alphabet) \
547 or isinstance(record.seq.alphabet, AlphabetEncoder), \
548 "Sequence does not have a valid alphabet"
549
550
551
552
553 if isinstance(record.seq.alphabet, Alphabet) \
554 and isinstance(alphabet, Alphabet) :
555
556 if not isinstance(record.seq.alphabet, alphabet.__class__) :
557 raise ValueError("Incompatible sequence alphabet " \
558 + "%s for %s alignment" \
559 % (record.seq.alphabet, alphabet))
560 elif isinstance(record.seq.alphabet, AlphabetEncoder) \
561 and isinstance(alphabet, Alphabet) :
562 raise ValueError("Sequence has a gapped alphabet, alignment does not")
563 elif isinstance(record.seq.alphabet, Alphabet) \
564 and isinstance(alphabet, Gapped) :
565
566 if not isinstance(record.seq.alphabet, alphabet.alphabet.__class__) :
567 raise ValueError("Incompatible sequence alphabet " \
568 + "%s for %s alignment" \
569 % (record.seq.alphabet, alphabet))
570 else :
571
572 if not isinstance(record.seq.alphabet, alphabet.__class__) :
573 raise ValueError("Incompatible sequence alphabet " \
574 + "%s for %s alignment" \
575 % (record.seq.alphabet, alphabet))
576 if record.seq.alphabet.gap_char != alphabet.gap_char :
577 raise ValueError("Sequence gap characters != alignment gap char")
578
579
580
581
582
583
584 alignment._records.append(record)
585 return alignment
586
588 """Run the Bio.SeqIO module's doctests.
589
590 This will try and locate the unit tests directory, and run the doctests
591 from there in order that the relative paths used in the examples work.
592 """
593 import doctest
594 import os
595 if os.path.isdir(os.path.join("..","..","Tests")) :
596 print "Runing doctests..."
597 cur_dir = os.path.abspath(os.curdir)
598 os.chdir(os.path.join("..","..","Tests"))
599 doctest.testmod()
600 os.chdir(cur_dir)
601 del cur_dir
602 print "Done"
603
604 if __name__ == "__main__":
605
606 _test()
607