1
2
3
4
5
6 """Multiple sequence alignment input/output as Alignment objects.
7
8 The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in
9 fact the two are connected internally. Both modules use the same set of file
10 format names (lower case strings). From the user's perspective, you can read
11 in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you
12 can read in the sequences within these alignmenta using Bio.SeqIO.
13
14 Input
15 =====
16 For the typical special case when your file or handle contains one and only
17 one alignment, use the function Bio.AlignIO.read(). This takes an input file
18 handle, format string and optional number of sequences per alignment. It will
19 return a single Alignment object (or raise an exception if there isn't just
20 one alignment):
21
22 from Bio import AlignIO
23 handle = open("example.aln", "rU")
24 align = AlignIO.read(handle, "clustal")
25 handle.close()
26 print align
27
28 For the general case, when the handle could contain any number of alignments,
29 use the function Bio.AlignIO.parse(...) which takes the same arguments, but
30 returns an iterator giving Alignment objects. For example, using the output
31 from the EMBOSS water or needle pairwise alignment prorams:
32
33 from Bio import AlignIO
34 handle = open("example.txt", "rU")
35 for alignment in AlignIO.parse(handle, "emboss") :
36 print alignment
37
38 If you want random access to the alignments by number, turn this into a list:
39
40 from Bio import AlignIO
41 handle = open("example.aln", "rU")
42 alignments = list(AlignIO.parse(handle, "clustal"))
43 print alignments[0]
44
45 Most alignment file formats can be concatenated so as to hold as many
46 different multiple sequence alignments as possible. One common example
47 is the output of the tool seqboot in the PHLYIP suite. Sometimes there
48 can be a file header and footer, as seen in the EMBOSS alignment output.
49
50 There is an optional argument for the number of sequences per alignment which
51 is usually only needed with the alignments stored in the FASTA format.
52 Without this information, there is no clear way to tell if you have say a
53 single alignment of 20 sequences, or four alignments of 5 sequences. e.g.
54
55 from Bio import AlignIO
56 handle = open("example.faa", "rU")
57 for alignment in AlignIO.parse(handle, "fasta", seq_count=5) :
58 print alignment
59
60 The above code would split up the FASTA files, and try and batch every five
61 sequences into an alignment.
62
63 Output
64 ======
65 Use the function Bio.AlignIO.write(...), which takes a complete set of
66 Alignment objects (either as a list, or an iterator), an output file handle
67 and of course the file format.
68
69 from Bio import AlignIO
70 alignments = ...
71 handle = open("example.faa", "w")
72 alignment = SeqIO.write(alignments, handle, "fasta")
73 handle.close()
74
75 In general, you are expected to call this function once (with all your
76 alignments) and then close the file handle. However, for file formats
77 like PHYLIP where multiple alignments are stored sequentially (with no file
78 header and footer), then multiple calls to the write function should work as
79 expected.
80
81 File Formats
82 ============
83 When specifying the file format, use lowercase strings. The same format
84 names are also used in Bio.SeqIO and include the following:
85
86 clustal - Ouput from Clustal W or X, see also the module Bio.Clustalw
87 which can be used to run the command line tool from Biopython.
88 emboss - The "pairs" and "simple" alignment format from the EMBOSS tools.
89 fasta - The generic sequence file format where each record starts with a
90 identifer line starting with a ">" character, followed by lines
91 of sequence.
92 fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA
93 tools when used with the -m 10 command line option for machine
94 readable output.
95 ig - The IntelliGenetics file format, apparently the same as the
96 MASE alignment format.
97 nexus - Output from NEXUS, see also the module Bio.Nexus which can also
98 read any phylogenetic trees in these files.
99 phylip - Used by the PHLIP tools.
100 stockholm - A richly annotated alignment file format used by PFAM.
101
102 Note that while Bio.AlignIO can read all the above file formats, it cannot
103 write to all of them.
104
105 You can also use any file format supported by Bio.SeqIO, such as "fasta" or
106 "ig" (which are listed above), PROVIDED the sequences in your file are all the
107 same length.
108
109 Further Information
110 ===================
111 See the wiki page http://biopython.org/wiki/AlignIO and also the Bio.AlignIO
112 chapter in the Biopython Tutorial and Cookbook which is also available online:
113
114 http://biopython.org/DIST/docs/tutorial/Tutorial.html
115 http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
116 """
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132 import os
133
134 from StringIO import StringIO
135 from Bio.Seq import Seq
136 from Bio.SeqRecord import SeqRecord
137 from Bio.Align.Generic import Alignment
138 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
139
140 import StockholmIO
141 import ClustalIO
142 import NexusIO
143 import PhylipIO
144 import EmbossIO
145 import FastaIO
146
147
148
149
150 _FormatToIterator ={
151 "clustal" : ClustalIO.ClustalIterator,
152 "emboss" : EmbossIO.EmbossIterator,
153 "fasta-m10" : FastaIO.FastaM10Iterator,
154 "nexus" : NexusIO.NexusIterator,
155 "phylip" : PhylipIO.PhylipIterator,
156 "stockholm" : StockholmIO.StockholmIterator,
157 }
158
159 _FormatToWriter ={
160
161 "nexus" : NexusIO.NexusWriter,
162 "phylip" : PhylipIO.PhylipWriter,
163 "stockholm" : StockholmIO.StockholmWriter,
164 "clustal" : ClustalIO.ClustalWriter,
165 }
166
167 -def write(alignments, handle, format) :
168 """Write complete set of alignments to a file.
169
170 sequences - A list (or iterator) of Alignment objects
171 handle - File handle object to write to
172 format - lower case string describing the file format to write.
173
174 You should close the handle after calling this function.
175
176 Returns the number of alignments written (as an integer).
177 """
178 from Bio import SeqIO
179
180
181 if isinstance(handle, basestring) :
182 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
183 if not isinstance(format, basestring) :
184 raise TypeError("Need a string for the file format (lower case)")
185 if not format :
186 raise ValueError("Format required (lower case string)")
187 if format != format.lower() :
188 raise ValueError("Format string '%s' should be lower case" % format)
189 if isinstance(alignments, Alignment) :
190 raise TypeError("Need an Alignment list/iterator, not just a single Alignment")
191
192
193 if format in _FormatToIterator :
194 writer_class = _FormatToWriter[format]
195 count = writer_class(handle).write_file(alignments)
196 elif format in SeqIO._FormatToWriter :
197
198
199 count = 0
200 for alignment in alignments :
201 SeqIO.write(alignment, handle, format)
202 count += 1
203 elif format in _FormatToIterator or format in SeqIO._FormatToIterator :
204 raise ValueError("Reading format '%s' is supported, but not writing" \
205 % format)
206 else :
207 raise ValueError("Unknown format '%s'" % format)
208
209 assert isinstance(count, int), "Internal error - the underlying writer " \
210 + " should have returned the alignment count, not %s" % repr(count)
211 return count
212
213
215 """Uses Bio.SeqIO to create an Alignment iterator (PRIVATE).
216
217 handle - handle to the file.
218 format - string describing the file format.
219 alphabet - optional Alphabet object, useful when the sequence type cannot
220 be automatically inferred from the file itself (e.g. fasta)
221 seq_count- Optional integer, number of sequences expected in
222 each alignment. Recommended for fasta format files.
223
224 If count is omitted (default) then all the sequences in
225 the file are combined into a single Alignment.
226 """
227 from Bio import SeqIO
228 assert format in SeqIO._FormatToIterator
229
230 if seq_count :
231
232 seq_record_iterator = SeqIO.parse(handle, format, alphabet)
233
234 records = []
235 for record in seq_record_iterator :
236 records.append(record)
237 if len(records) == seq_count :
238 yield SeqIO.to_alignment(records)
239 records = []
240 if len(records) > 0 :
241 raise ValueError("Check seq_count argument, not enough sequences?")
242 else :
243
244
245 records = list(SeqIO.parse(handle, format, alphabet))
246 if records :
247 yield SeqIO.to_alignment(records)
248 else :
249
250 pass
251
271
272 -def parse(handle, format, seq_count=None, alphabet=None) :
273 """Turns a sequence file into an iterator returning Alignment objects.
274
275 handle - handle to the file.
276 format - string describing the file format.
277 alphabet - optional Alphabet object, useful when the sequence type cannot
278 be automatically inferred from the file itself (e.g. phylip)
279 seq_count- Optional integer, number of sequences expected in
280 each alignment. Recommended for fasta format files.
281
282 If you have the file name in a string 'filename', use:
283
284 >>> from Bio import AlignIO
285 >>> filename = "Emboss/needle.txt"
286 >>> format = "emboss"
287 >>> for alignment in AlignIO.parse(open(filename,"rU"), format) :
288 ... print "Alignment of length", alignment.get_alignment_length()
289 Alignment of length 124
290 Alignment of length 119
291 Alignment of length 120
292 Alignment of length 118
293 Alignment of length 125
294
295 If you have a string 'data' containing the file contents, use:
296
297 from Bio import AlignIO
298 from StringIO import StringIO
299 my_iterator = AlignIO.parse(StringIO(data), format)
300
301 Use the Bio.AlignIO.read() function when you expect a single record only.
302 """
303 from Bio import SeqIO
304
305
306 if isinstance(handle, basestring) :
307 raise TypeError("Need a file handle, not a string (i.e. not a filename)")
308 if not isinstance(format, basestring) :
309 raise TypeError("Need a string for the file format (lower case)")
310 if not format :
311 raise ValueError("Format required (lower case string)")
312 if format != format.lower() :
313 raise ValueError("Format string '%s' should be lower case" % format)
314 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \
315 isinstance(alphabet, AlphabetEncoder)) :
316 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
317
318
319 if format in _FormatToIterator :
320 iterator_generator = _FormatToIterator[format]
321 if alphabet is None :
322 return iterator_generator(handle, seq_count)
323 try :
324
325 return iterator_generator(handle, seq_count, alphabet=alphabet)
326 except TypeError :
327
328 return _force_alphabet(iterator_generator(handle, seq_count), alphabet)
329
330 elif format in SeqIO._FormatToIterator :
331
332 return _SeqIO_to_alignment_iterator(handle, format,
333 alphabet=alphabet,
334 seq_count=seq_count)
335 else :
336 raise ValueError("Unknown format '%s'" % format)
337
338 -def read(handle, format, seq_count=None, alphabet=None) :
339 """Turns an alignment file into a single Alignment object.
340
341 handle - handle to the file.
342 format - string describing the file format.
343 alphabet - optional Alphabet object, useful when the sequence type cannot
344 be automatically inferred from the file itself (e.g. phylip)
345 seq_count- Optional interger, number of sequences expected in
346 the alignment to check you got what you expected.
347
348 If the handle contains no alignments, or more than one alignment,
349 an exception is raised. For example, using a PFAM/Stockholm file
350 containing one alignment:
351
352 >>> from Bio import AlignIO
353 >>> filename = "Clustalw/protein.aln"
354 >>> format = "clustal"
355 >>> alignment = AlignIO.read(open(filename, "rU"), format)
356 >>> print "Alignment of length", alignment.get_alignment_length()
357 Alignment of length 411
358
359 If however you want the first alignment from a file containing
360 multiple alignments this function would raise an exception.
361 Instead use:
362
363 >>> from Bio import AlignIO
364 >>> filename = "Emboss/needle.txt"
365 >>> format = "emboss"
366 >>> alignment = AlignIO.parse(open(filename, "rU"), format).next()
367 >>> print "First alignment has length", alignment.get_alignment_length()
368 First alignment has length 124
369
370 Use the Bio.AlignIO.parse() function if you want to read multiple
371 records from the handle.
372 """
373 iterator = parse(handle, format, seq_count, alphabet)
374 try :
375 first = iterator.next()
376 except StopIteration :
377 first = None
378 if first is None :
379 raise ValueError("No records found in handle")
380 try :
381 second = iterator.next()
382 except StopIteration :
383 second = None
384 if second is not None :
385 raise ValueError("More than one record found in handle")
386 if seq_count :
387 assert len(first.get_all_seqs())==seq_count
388 return first
389
391 """Run the Bio.AlignIO module's doctests.
392
393 This will try and locate the unit tests directory, and run the doctests
394 from there in order that the relative paths used in the examples work.
395 """
396 import doctest
397 import os
398 if os.path.isdir(os.path.join("..","..","Tests")) :
399 print "Runing doctests..."
400 cur_dir = os.path.abspath(os.curdir)
401 os.chdir(os.path.join("..","..","Tests"))
402 doctest.testmod()
403 os.chdir(cur_dir)
404 del cur_dir
405 print "Done"
406
407 if __name__ == "__main__" :
408 _test()
409