1
2
3
4
5
6
7
8
9 """Bio.SeqIO support for the FASTQ and QUAL file formats.
10
11 Note that you are expected to use this code via the Bio.SeqIO interface, as
12 shown below.
13
14 The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
15 to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
16 90). Rather than using a single FASTQ file, often paired FASTA and QUAL files
17 are used containing the sequence and the quality information separately.
18
19 The PHRED software reads DNA sequencing trace files, calls bases, and
20 assigns a non-negative quality value to each called base using a logged
21 transformation of the error probability, Q = -10 log10( Pe ), for example::
22
23 Pe = 1.0, Q = 0
24 Pe = 0.1, Q = 10
25 Pe = 0.01, Q = 20
26 ...
27 Pe = 0.00000001, Q = 80
28 Pe = 0.000000001, Q = 90
29
30 In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
31 In the QUAL format these quality values are held as space separated text in
32 a FASTA like file format. In the FASTQ format, each quality values is encoded
33 with a single ASCI character using chr(Q+33), meaning zero maps to the
34 character "!" and for example 80 maps to "q". For the Sanger FASTQ standard
35 the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
36 quality are then stored in pairs in a FASTA like format.
37
38 Unfortunately there is no official document describing the FASTQ file format,
39 and worse, several related but different variants exist. For more details,
40 please read this open access publication::
41
42 The Sanger FASTQ file format for sequences with quality scores, and the
43 Solexa/Illumina FASTQ variants.
44 P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),
45 M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).
46 Nucleic Acids Research 2010 38(6):1767-1771
47 http://dx.doi.org/10.1093/nar/gkp1137
48
49 The good news is that Roche 454 sequencers can output files in the QUAL format,
50 and sensibly they use PHREP style scores like Sanger. Converting a pair of
51 FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
52 files from a Roche 454 SFF binary file, use the Roche off instrument command
53 line tool "sffinfo" with the -q or -qual argument. You can extract a matching
54 FASTA file using the -s or -seq argument instead.
55
56 The bad news is that Solexa/Illumina did things differently - they have their
57 own scoring system AND their own incompatible versions of the FASTQ format.
58 Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
59 be negative. PHRED scores and Solexa scores are NOT interchangeable (but a
60 reasonable mapping can be achieved between them, and they are approximately
61 equal for higher quality reads).
62
63 Confusingly early Solexa pipelines produced a FASTQ like file but using their
64 own score mapping and an ASCII offset of 64. To make things worse, for the
65 Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
66 FASTQ file format, this time using PHRED scores (which is more consistent) but
67 with an ASCII offset of 64.
68
69 i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
70 file format: The original Sanger PHRED standard, and two from Solexa/Illumina.
71
72 You are expected to use this module via the Bio.SeqIO functions, with the
73 following format names:
74
75 - "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
76 - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
77 offset of 33 (e.g. from the NCBI Short Read Archive). These can hold PHRED
78 scores from 0 to 93.
79 - "fastq-sanger" is an alias for "fastq".
80 - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
81 files, using Solexa scores with an ASCII offset 64. These can hold Solexa
82 scores from -5 to 62.
83 - "fastq-illumina" means new Illumina 1.3+ style FASTQ files, using PHRED
84 scores but with an ASCII offset 64, allowing PHRED scores from 0 to 62.
85
86 We could potentially add support for "qual-solexa" meaning QUAL files which
87 contain Solexa scores, but thus far there isn't any reason to use such files.
88
89 For example, consider the following short FASTQ file::
90
91 @EAS54_6_R1_2_1_413_324
92 CCCTTCTTGTCTTCAGCGTTTCTCC
93 +
94 ;;3;;;;;;;;;;;;7;;;;;;;88
95 @EAS54_6_R1_2_1_540_792
96 TTGGCAGGCCAAGGCCGATGGATCA
97 +
98 ;;;;;;;;;;;7;;;;;-;;;3;83
99 @EAS54_6_R1_2_1_443_348
100 GTTGCTTCTGGCGTGGGTGGGGGGG
101 +
102 ;;;;;;;;;;;9;7;;.7;393333
103
104 This contains three reads of length 25. From the read length these were
105 probably originally from an early Solexa/Illumina sequencer but this file
106 follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
107 offet of 33). This means we can parse this file using Bio.SeqIO using
108 "fastq" as the format name:
109
110 >>> from Bio import SeqIO
111 >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"):
112 ... print record.id, record.seq
113 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
114 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
115 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
116
117 The qualities are held as a list of integers in each record's annotation:
118
119 >>> print record
120 ID: EAS54_6_R1_2_1_443_348
121 Name: EAS54_6_R1_2_1_443_348
122 Description: EAS54_6_R1_2_1_443_348
123 Number of features: 0
124 Per letter annotation for: phred_quality
125 Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
126 >>> print record.letter_annotations["phred_quality"]
127 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
128
129 You can use the SeqRecord format method to show this in the QUAL format:
130
131 >>> print record.format("qual")
132 >EAS54_6_R1_2_1_443_348
133 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
134 24 18 18 18 18
135 <BLANKLINE>
136
137 Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):
138
139 >>> print record.format("fastq")
140 @EAS54_6_R1_2_1_443_348
141 GTTGCTTCTGGCGTGGGTGGGGGGG
142 +
143 ;;;;;;;;;;;9;7;;.7;393333
144 <BLANKLINE>
145
146 Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
147 of 64):
148
149 >>> print record.format("fastq-illumina")
150 @EAS54_6_R1_2_1_443_348
151 GTTGCTTCTGGCGTGGGTGGGGGGG
152 +
153 ZZZZZZZZZZZXZVZZMVZRXRRRR
154 <BLANKLINE>
155
156 You can also get Biopython to convert the scores and show a Solexa style
157 FASTQ file:
158
159 >>> print record.format("fastq-solexa")
160 @EAS54_6_R1_2_1_443_348
161 GTTGCTTCTGGCGTGGGTGGGGGGG
162 +
163 ZZZZZZZZZZZXZVZZMVZRXRRRR
164 <BLANKLINE>
165
166 Notice that this is actually the same output as above using "fastq-illumina"
167 as the format! The reason for this is all these scores are high enough that
168 the PHRED and Solexa scores are almost equal. The differences become apparent
169 for poor quality reads. See the functions solexa_quality_from_phred and
170 phred_quality_from_solexa for more details.
171
172 If you wanted to trim your sequences (perhaps to remove low quality regions,
173 or to remove a primer sequence), try slicing the SeqRecord objects. e.g.
174
175 >>> sub_rec = record[5:15]
176 >>> print sub_rec
177 ID: EAS54_6_R1_2_1_443_348
178 Name: EAS54_6_R1_2_1_443_348
179 Description: EAS54_6_R1_2_1_443_348
180 Number of features: 0
181 Per letter annotation for: phred_quality
182 Seq('TTCTGGCGTG', SingleLetterAlphabet())
183 >>> print sub_rec.letter_annotations["phred_quality"]
184 [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
185 >>> print sub_rec.format("fastq")
186 @EAS54_6_R1_2_1_443_348
187 TTCTGGCGTG
188 +
189 ;;;;;;9;7;
190 <BLANKLINE>
191
192 If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
193
194 >>> from Bio import SeqIO
195 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
196 >>> out_handle = open("Quality/temp.qual", "w")
197 >>> SeqIO.write(record_iterator, out_handle, "qual")
198 3
199 >>> out_handle.close()
200
201 You can of course read in a QUAL file, such as the one we just created:
202
203 >>> from Bio import SeqIO
204 >>> for record in SeqIO.parse("Quality/temp.qual", "qual"):
205 ... print record.id, record.seq
206 EAS54_6_R1_2_1_413_324 ?????????????????????????
207 EAS54_6_R1_2_1_540_792 ?????????????????????????
208 EAS54_6_R1_2_1_443_348 ?????????????????????????
209
210 Notice that QUAL files don't have a proper sequence present! But the quality
211 information is there:
212
213 >>> print record
214 ID: EAS54_6_R1_2_1_443_348
215 Name: EAS54_6_R1_2_1_443_348
216 Description: EAS54_6_R1_2_1_443_348
217 Number of features: 0
218 Per letter annotation for: phred_quality
219 UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
220 >>> print record.letter_annotations["phred_quality"]
221 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
222
223 Just to keep things tidy, if you are following this example yourself, you can
224 delete this temporary file now:
225
226 >>> import os
227 >>> os.remove("Quality/temp.qual")
228
229 Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
230 files. Because the Bio.SeqIO system is designed for reading single files, you
231 would have to read the two in separately and then combine the data. However,
232 since this is such a common thing to want to do, there is a helper iterator
233 defined in this module that does this for you - PairedFastaQualIterator.
234
235 Alternatively, if you have enough RAM to hold all the records in memory at once,
236 then a simple dictionary approach would work:
237
238 >>> from Bio import SeqIO
239 >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta"))
240 >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"):
241 ... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
242
243 You can then access any record by its key, and get both the sequence and the
244 quality scores.
245
246 >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq")
247 @EAS54_6_R1_2_1_540_792
248 TTGGCAGGCCAAGGCCGATGGATCA
249 +
250 ;;;;;;;;;;;7;;;;;-;;;3;83
251 <BLANKLINE>
252
253 It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
254 using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
255 "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
256 for the more recent variant), as this cannot be detected reliably
257 automatically.
258
259 To illustrate this problem, let's consider an artifical example:
260
261 >>> from Bio.Seq import Seq
262 >>> from Bio.Alphabet import generic_dna
263 >>> from Bio.SeqRecord import SeqRecord
264 >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test",
265 ... description="Made up!")
266 >>> print test.format("fasta")
267 >Test Made up!
268 NACGTACGTA
269 <BLANKLINE>
270 >>> print test.format("fastq")
271 Traceback (most recent call last):
272 ...
273 ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).
274
275 We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
276 or FASTQ format we need to provide some quality scores. These are held as a
277 list of integers (one for each base) in the letter_annotations dictionary:
278
279 >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40]
280 >>> print test.format("qual")
281 >Test Made up!
282 0 1 2 3 4 5 10 20 30 40
283 <BLANKLINE>
284 >>> print test.format("fastq")
285 @Test Made up!
286 NACGTACGTA
287 +
288 !"#$%&+5?I
289 <BLANKLINE>
290
291 We can check this FASTQ encoding - the first PHRED quality was zero, and this
292 mapped to a exclamation mark, while the final score was 40 and this mapped to
293 the letter "I":
294
295 >>> ord('!') - 33
296 0
297 >>> ord('I') - 33
298 40
299 >>> [ord(letter)-33 for letter in '!"#$%&+5?I']
300 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
301
302 Similarly, we could produce an Illumina 1.3+ style FASTQ file using PHRED
303 scores with an offset of 64:
304
305 >>> print test.format("fastq-illumina")
306 @Test Made up!
307 NACGTACGTA
308 +
309 @ABCDEJT^h
310 <BLANKLINE>
311
312 And we can check this too - the first PHRED score was zero, and this mapped to
313 "@", while the final score was 40 and this mapped to "h":
314
315 >>> ord("@") - 64
316 0
317 >>> ord("h") - 64
318 40
319 >>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
320 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
321
322 Notice how different the standard Sanger FASTQ and the Illumina 1.3+ style
323 FASTQ files look for the same data! Then we have the older Solexa/Illumina
324 format to consider which encodes Solexa scores instead of PHRED scores.
325
326 First let's see what Biopython says if we convert the PHRED scores into Solexa
327 scores (rounding to one decimal place):
328
329 >>> for q in [0,1,2,3,4,5,10,20,30,40]:
330 ... print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))
331 PHRED 0 maps to Solexa -5.0
332 PHRED 1 maps to Solexa -5.0
333 PHRED 2 maps to Solexa -2.3
334 PHRED 3 maps to Solexa -0.0
335 PHRED 4 maps to Solexa 1.8
336 PHRED 5 maps to Solexa 3.3
337 PHRED 10 maps to Solexa 9.5
338 PHRED 20 maps to Solexa 20.0
339 PHRED 30 maps to Solexa 30.0
340 PHRED 40 maps to Solexa 40.0
341
342 Now here is the record using the old Solexa style FASTQ file:
343
344 >>> print test.format("fastq-solexa")
345 @Test Made up!
346 NACGTACGTA
347 +
348 ;;>@BCJT^h
349 <BLANKLINE>
350
351 Again, this is using an ASCII offset of 64, so we can check the Solexa scores:
352
353 >>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
354 [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]
355
356 This explains why the last few letters of this FASTQ output matched that using
357 the Illumina 1.3+ format - high quality PHRED scores and Solexa scores are
358 approximately equal.
359
360 """
361 __docformat__ = "epytext en"
362
363 from Bio.Alphabet import single_letter_alphabet
364 from Bio.Seq import Seq, UnknownSeq
365 from Bio.SeqRecord import SeqRecord
366 from Bio.SeqIO.Interfaces import SequentialSequenceWriter
367 from math import log
368 import warnings
369
370
371
372 SANGER_SCORE_OFFSET = 33
373 SOLEXA_SCORE_OFFSET = 64
374
376 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
377
378 PHRED and Solexa quality scores are both log transformations of a
379 probality of error (high score = low probability of error). This function
380 takes a PHRED score, transforms it back to a probability of error, and
381 then re-expresses it as a Solexa score. This assumes the error estimates
382 are equivalent.
383
384 How does this work exactly? Well the PHRED quality is minus ten times the
385 base ten logarithm of the probability of error::
386
387 phred_quality = -10*log(error,10)
388
389 Therefore, turning this round::
390
391 error = 10 ** (- phred_quality / 10)
392
393 Now, Solexa qualities use a different log transformation::
394
395 solexa_quality = -10*log(error/(1-error),10)
396
397 After substitution and a little manipulation we get::
398
399 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
400
401 However, real Solexa files use a minimum quality of -5. This does have a
402 good reason - a random a random base call would be correct 25% of the time,
403 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
404 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
405 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
406
407 Taken literally, this logarithic formula would map a PHRED quality of zero
408 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
409 score of zero means a probability of error of one (i.e. the base call is
410 definitely wrong), which is worse than random! In practice, a PHRED quality
411 of zero usually means a default value, or perhaps random - and therefore
412 mapping it to the minimum Solexa score of -5 is reasonable.
413
414 In conclusion, we follow EMBOSS, and take this logarithmic formula but also
415 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
416 quality of zero to -5.0 as well.
417
418 Note this function will return a floating point number, it is up to you to
419 round this to the nearest integer if appropriate. e.g.
420
421 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
422 80.00
423 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
424 50.00
425 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
426 19.96
427 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
428 9.54
429 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2)
430 3.35
431 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2)
432 1.80
433 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2)
434 -0.02
435 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2)
436 -2.33
437 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
438 -5.00
439 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2)
440 -5.00
441
442 Notice that for high quality reads PHRED and Solexa scores are numerically
443 equal. The differences are important for poor quality reads, where PHRED
444 has a minimum of zero but Solexa scores can be negative.
445
446 Finally, as a special case where None is used for a "missing value", None
447 is returned:
448
449 >>> print solexa_quality_from_phred(None)
450 None
451 """
452 if phred_quality is None:
453
454
455 return None
456 elif phred_quality > 0:
457
458
459 return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10))
460 elif phred_quality == 0:
461
462 return -5.0
463 else:
464 raise ValueError("PHRED qualities must be positive (or zero), not %s" \
465 % repr(phred_quality))
466
468 """Convert a Solexa quality (which can be negative) to a PHRED quality.
469
470 PHRED and Solexa quality scores are both log transformations of a
471 probality of error (high score = low probability of error). This function
472 takes a Solexa score, transforms it back to a probability of error, and
473 then re-expresses it as a PHRED score. This assumes the error estimates
474 are equivalent.
475
476 The underlying formulas are given in the documentation for the sister
477 function solexa_quality_from_phred, in this case the operation is::
478
479 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
480
481 This will return a floating point number, it is up to you to round this to
482 the nearest integer if appropriate. e.g.
483
484 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
485 80.00
486 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
487 20.04
488 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
489 10.41
490 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
491 3.01
492 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2)
493 1.19
494
495 Note that a solexa_quality less then -5 is not expected, will trigger a
496 warning, but will still be converted as per the logarithmic mapping
497 (giving a number between 0 and 1.19 back).
498
499 As a special case where None is used for a "missing value", None is
500 returned:
501
502 >>> print phred_quality_from_solexa(None)
503 None
504 """
505 if solexa_quality is None:
506
507 return None
508 if solexa_quality < -5:
509 import warnings
510 warnings.warn("Solexa quality less than -5 passed, %s" \
511 % repr(solexa_quality))
512 return 10*log(10**(solexa_quality/10.0) + 1, 10)
513
515 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
516
517 If there are no PHRED qualities, but there are Solexa qualities, those are
518 used instead after conversion.
519 """
520 try:
521 return record.letter_annotations["phred_quality"]
522 except KeyError:
523 pass
524 try:
525 return [phred_quality_from_solexa(q) for \
526 q in record.letter_annotations["solexa_quality"]]
527 except KeyError:
528 raise ValueError("No suitable quality scores found in "
529 "letter_annotations of SeqRecord (id=%s)." \
530 % record.id)
531
532
533 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp+SANGER_SCORE_OFFSET))) \
534 for qp in range(0, 93+1))
535
536 _solexa_to_sanger_quality_str = dict( \
537 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \
538 for qs in range(-5, 93+1))
540 """Returns a Sanger FASTQ encoded quality string (PRIVATE).
541
542 >>> from Bio.Seq import Seq
543 >>> from Bio.SeqRecord import SeqRecord
544 >>> r = SeqRecord(Seq("ACGTAN"), id="Test",
545 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]})
546 >>> _get_sanger_quality_str(r)
547 'SI?5+!'
548
549 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
550 the PHRED qualities are integers, this function is able to use a very fast
551 pre-cached mapping. However, if they are floats which differ slightly, then
552 it has to do the appropriate rounding - which is slower:
553
554 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
555 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]})
556 >>> _get_sanger_quality_str(r2)
557 'SI?5+!'
558
559 If your scores include a None value, this raises an exception:
560
561 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
562 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]})
563 >>> _get_sanger_quality_str(r3)
564 Traceback (most recent call last):
565 ...
566 TypeError: A quality value of None was found
567
568 If (strangely) your record has both PHRED and Solexa scores, then the PHRED
569 scores are used in preference:
570
571 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
572 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0],
573 ... "solexa_quality":[-5,-4,0,None,0,40]})
574 >>> _get_sanger_quality_str(r4)
575 'SI?5+!'
576
577 If there are no PHRED scores, but there are Solexa scores, these are used
578 instead (after the approriate conversion):
579
580 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
581 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]})
582 >>> _get_sanger_quality_str(r5)
583 'I?5+$"'
584
585 Again, integer Solexa scores can be looked up in a pre-cached mapping making
586 this very fast. You can still use approximate floating point scores:
587
588 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
589 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]})
590 >>> _get_sanger_quality_str(r6)
591 'I?5+$"'
592
593 Notice that due to the limited range of printable ASCII characters, a
594 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
595 file (using ASCII 126, the tilde). This function will issue a warning
596 in this situation.
597 """
598
599
600
601 try:
602
603 qualities = record.letter_annotations["phred_quality"]
604 except KeyError:
605
606 pass
607 else:
608
609 try:
610 return "".join([_phred_to_sanger_quality_str[qp] \
611 for qp in qualities])
612 except KeyError:
613
614 pass
615 if None in qualities:
616 raise TypeError("A quality value of None was found")
617 if max(qualities) >= 93.5:
618 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ")
619
620 return "".join([chr(min(126, int(round(qp))+SANGER_SCORE_OFFSET)) \
621 for qp in qualities])
622
623 try:
624 qualities = record.letter_annotations["solexa_quality"]
625 except KeyError:
626 raise ValueError("No suitable quality scores found in "
627 "letter_annotations of SeqRecord (id=%s)." \
628 % record.id)
629
630 try:
631 return "".join([_solexa_to_sanger_quality_str[qs] \
632 for qs in qualities])
633 except KeyError:
634
635 pass
636 if None in qualities:
637 raise TypeError("A quality value of None was found")
638
639
640 if max(qualities) >= 93.5:
641 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ")
642
643 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \
644 for qs in qualities])
645
646
647 assert 62+SOLEXA_SCORE_OFFSET == 126
648 _phred_to_illumina_quality_str = dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \
649 for qp in range(0, 62+1))
650
651 _solexa_to_illumina_quality_str = dict( \
652 (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
653 for qs in range(-5, 62+1))
655 """Returns an Illumina 1.3+ FASTQ encoded quality string (PRIVATE).
656
657 Notice that due to the limited range of printable ASCII characters, a
658 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
659 file (using ASCII 126, the tilde). This function will issue a warning
660 in this situation.
661 """
662
663
664
665 try:
666
667 qualities = record.letter_annotations["phred_quality"]
668 except KeyError:
669
670 pass
671 else:
672
673 try:
674 return "".join([_phred_to_illumina_quality_str[qp] \
675 for qp in qualities])
676 except KeyError:
677
678 pass
679 if None in qualities:
680 raise TypeError("A quality value of None was found")
681 if max(qualities) >= 62.5:
682 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ")
683
684 return "".join([chr(min(126, int(round(qp))+SOLEXA_SCORE_OFFSET)) \
685 for qp in qualities])
686
687 try:
688 qualities = record.letter_annotations["solexa_quality"]
689 except KeyError:
690 raise ValueError("No suitable quality scores found in "
691 "letter_annotations of SeqRecord (id=%s)." \
692 % record.id)
693
694 try:
695 return "".join([_solexa_to_illumina_quality_str[qs] \
696 for qs in qualities])
697 except KeyError:
698
699 pass
700 if None in qualities:
701 raise TypeError("A quality value of None was found")
702
703
704 if max(qualities) >= 62.5:
705 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ")
706
707 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
708 for qs in qualities])
709
710
711 assert 62+SOLEXA_SCORE_OFFSET == 126
712 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs+SOLEXA_SCORE_OFFSET))) \
713 for qs in range(-5, 62+1))
714
715 _phred_to_solexa_quality_str = dict(\
716 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \
717 for qp in range(0, 62+1))
719 """Returns a Solexa FASTQ encoded quality string (PRIVATE).
720
721 Notice that due to the limited range of printable ASCII characters, a
722 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
723 file (using ASCII 126, the tilde). This function will issue a warning
724 in this situation.
725 """
726
727
728
729 try:
730
731 qualities = record.letter_annotations["solexa_quality"]
732 except KeyError:
733
734 pass
735 else:
736
737 try:
738 return "".join([_solexa_to_solexa_quality_str[qs] \
739 for qs in qualities])
740 except KeyError:
741
742 pass
743 if None in qualities:
744 raise TypeError("A quality value of None was found")
745 if max(qualities) >= 62.5:
746 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ")
747
748 return "".join([chr(min(126, int(round(qs))+SOLEXA_SCORE_OFFSET)) \
749 for qs in qualities])
750
751 try:
752 qualities = record.letter_annotations["phred_quality"]
753 except KeyError:
754 raise ValueError("No suitable quality scores found in "
755 "letter_annotations of SeqRecord (id=%s)." \
756 % record.id)
757
758 try:
759 return "".join([_phred_to_solexa_quality_str[qp] \
760 for qp in qualities])
761 except KeyError:
762
763
764 pass
765 if None in qualities:
766 raise TypeError("A quality value of None was found")
767
768
769 if max(qualities) >= 62.5:
770 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ")
771 return "".join([chr(min(126,
772 int(round(solexa_quality_from_phred(qp))) + \
773 SOLEXA_SCORE_OFFSET)) \
774 for qp in qualities])
775
776
778 """Iterate over Fastq records as string tuples (not as SeqRecord objects).
779
780 This code does not try to interpret the quality string numerically. It
781 just returns tuples of the title, sequence and quality as strings. For
782 the sequence and quality, any whitespace (such as new lines) is removed.
783
784 Our SeqRecord based FASTQ iterators call this function internally, and then
785 turn the strings into a SeqRecord objects, mapping the quality string into
786 a list of numerical scores. If you want to do a custom quality mapping,
787 then you might consider calling this function directly.
788
789 For parsing FASTQ files, the title string from the "@" line at the start
790 of each record can optionally be omitted on the "+" lines. If it is
791 repeated, it must be identical.
792
793 The sequence string and the quality string can optionally be split over
794 multiple lines, although several sources discourage this. In comparison,
795 for the FASTA file format line breaks between 60 and 80 characters are
796 the norm.
797
798 WARNING - Because the "@" character can appear in the quality string,
799 this can cause problems as this is also the marker for the start of
800 a new sequence. In fact, the "+" sign can also appear as well. Some
801 sources recommended having no line breaks in the quality to avoid this,
802 but even that is not enough, consider this example::
803
804 @071113_EAS56_0053:1:1:998:236
805 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
806 +071113_EAS56_0053:1:1:998:236
807 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
808 @071113_EAS56_0053:1:1:182:712
809 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
810 +
811 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
812 @071113_EAS56_0053:1:1:153:10
813 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
814 +
815 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
816 @071113_EAS56_0053:1:3:990:501
817 TGGGAGGTTTTATGTGGA
818 AAGCAGCAATGTACAAGA
819 +
820 IIIIIII.IIIIII1@44
821 @-7.%<&+/$/%4(++(%
822
823 This is four PHRED encoded FASTQ entries originally from an NCBI source
824 (given the read length of 36, these are probably Solexa Illumna reads where
825 the quality has been mapped onto the PHRED values).
826
827 This example has been edited to illustrate some of the nasty things allowed
828 in the FASTQ format. Firstly, on the "+" lines most but not all of the
829 (redundant) identifiers are ommited. In real files it is likely that all or
830 none of these extra identifiers will be present.
831
832 Secondly, while the first three sequences have been shown without line
833 breaks, the last has been split over multiple lines. In real files any line
834 breaks are likely to be consistent.
835
836 Thirdly, some of the quality string lines start with an "@" character. For
837 the second record this is unavoidable. However for the fourth sequence this
838 only happens because its quality string is split over two lines. A naive
839 parser could wrongly treat any line starting with an "@" as the beginning of
840 a new sequence! This code copes with this possible ambiguity by keeping
841 track of the length of the sequence which gives the expected length of the
842 quality string.
843
844 Using this tricky example file as input, this short bit of code demonstrates
845 what this parsing function would return:
846
847 >>> handle = open("Quality/tricky.fastq", "rU")
848 >>> for (title, sequence, quality) in FastqGeneralIterator(handle):
849 ... print title
850 ... print sequence, quality
851 071113_EAS56_0053:1:1:998:236
852 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
853 071113_EAS56_0053:1:1:182:712
854 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
855 071113_EAS56_0053:1:1:153:10
856 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
857 071113_EAS56_0053:1:3:990:501
858 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
859 >>> handle.close()
860
861 Finally we note that some sources state that the quality string should
862 start with "!" (which using the PHRED mapping means the first letter always
863 has a quality score of zero). This rather restrictive rule is not widely
864 observed, so is therefore ignored here. One plus point about this "!" rule
865 is that (provided there are no line breaks in the quality sequence) it
866 would prevent the above problem with the "@" character.
867 """
868
869
870 handle_readline = handle.readline
871
872
873 while True:
874 line = handle_readline()
875 if line == "" : return
876 if line[0] == "@":
877 break
878
879 while True:
880 if line[0] != "@":
881 raise ValueError("Records in Fastq files should start with '@' character")
882 title_line = line[1:].rstrip()
883
884
885
886 seq_string = handle_readline().rstrip()
887
888 while True:
889 line = handle_readline()
890 if not line:
891 raise ValueError("End of file without quality information.")
892 if line[0] == "+":
893
894 second_title = line[1:].rstrip()
895 if second_title and second_title != title_line:
896 raise ValueError("Sequence and quality captions differ.")
897 break
898 seq_string += line.rstrip()
899
900
901 if " " in seq_string or "\t" in seq_string:
902 raise ValueError("Whitespace is not allowed in the sequence.")
903 seq_len = len(seq_string)
904
905
906 quality_string = handle_readline().rstrip()
907
908 while True:
909 line = handle_readline()
910 if not line : break
911 if line[0] == "@":
912
913
914
915
916 if len(quality_string) >= seq_len:
917
918
919 break
920
921 quality_string += line.rstrip()
922
923 if seq_len != len(quality_string):
924 raise ValueError("Lengths of sequence and quality values differs "
925 " for %s (%i and %i)." \
926 % (title_line, seq_len, len(quality_string)))
927
928
929 yield (title_line, seq_string, quality_string)
930 if not line : return
931 assert False, "Should not reach this line"
932
933
935 """Generator function to iterate over FASTQ records (as SeqRecord objects).
936
937 - handle - input file
938 - alphabet - optional alphabet
939 - title2ids - A function that, when given the title line from the FASTQ
940 file (without the beginning >), will return the id, name and
941 description (in that order) for the record as a tuple of
942 strings. If this is not given, then the entire title line
943 will be used as the description, and the first word as the
944 id and name.
945
946 Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
947
948 For each sequence in a (Sanger style) FASTQ file there is a matching string
949 encoding the PHRED qualities (integers between 0 and about 90) using ASCII
950 values with an offset of 33.
951
952 For example, consider a file containing three short reads::
953
954 @EAS54_6_R1_2_1_413_324
955 CCCTTCTTGTCTTCAGCGTTTCTCC
956 +
957 ;;3;;;;;;;;;;;;7;;;;;;;88
958 @EAS54_6_R1_2_1_540_792
959 TTGGCAGGCCAAGGCCGATGGATCA
960 +
961 ;;;;;;;;;;;7;;;;;-;;;3;83
962 @EAS54_6_R1_2_1_443_348
963 GTTGCTTCTGGCGTGGGTGGGGGGG
964 +
965 ;;;;;;;;;;;9;7;;.7;393333
966
967 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
968 string encoding the PHRED qualities using a ASCI values with an offset of
969 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
970
971 Using this module directly you might run:
972
973 >>> handle = open("Quality/example.fastq", "rU")
974 >>> for record in FastqPhredIterator(handle):
975 ... print record.id, record.seq
976 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
977 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
978 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
979 >>> handle.close()
980
981 Typically however, you would call this via Bio.SeqIO instead with "fastq"
982 (or "fastq-sanger") as the format:
983
984 >>> from Bio import SeqIO
985 >>> handle = open("Quality/example.fastq", "rU")
986 >>> for record in SeqIO.parse(handle, "fastq"):
987 ... print record.id, record.seq
988 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
989 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
990 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
991 >>> handle.close()
992
993 If you want to look at the qualities, they are record in each record's
994 per-letter-annotation dictionary as a simple list of integers:
995
996 >>> print record.letter_annotations["phred_quality"]
997 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
998 """
999 assert SANGER_SCORE_OFFSET == ord("!")
1000
1001
1002
1003
1004
1005 q_mapping = dict()
1006 for letter in range(0, 255):
1007 q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET
1008 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1009 if title2ids:
1010 id, name, descr = title2ids(title_line)
1011 else:
1012 descr = title_line
1013 id = descr.split()[0]
1014 name = id
1015 record = SeqRecord(Seq(seq_string, alphabet),
1016 id=id, name=name, description=descr)
1017 qualities = [q_mapping[letter] for letter in quality_string]
1018 if qualities and (min(qualities) < 0 or max(qualities) > 93):
1019 raise ValueError("Invalid character in quality string")
1020
1021
1022
1023
1024
1025 dict.__setitem__(record._per_letter_annotations,
1026 "phred_quality", qualities)
1027 yield record
1028
1029
1031 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
1032
1033 The optional arguments are the same as those for the FastqPhredIterator.
1034
1035 For each sequence in Solexa/Illumina FASTQ files there is a matching string
1036 encoding the Solexa integer qualities using ASCII values with an offset
1037 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython
1038 will NOT perform any automatic conversion when loading.
1039
1040 NOTE - This file format is used by the OLD versions of the Solexa/Illumina
1041 pipeline. See also the FastqIlluminaIterator function for the NEW version.
1042
1043 For example, consider a file containing these five records::
1044
1045 @SLXA-B3_649_FC8437_R1_1_1_610_79
1046 GATGTGCAATACCTTTGTAGAGGAA
1047 +SLXA-B3_649_FC8437_R1_1_1_610_79
1048 YYYYYYYYYYYYYYYYYYWYWYYSU
1049 @SLXA-B3_649_FC8437_R1_1_1_397_389
1050 GGTTTGAGAAAGAGAAATGAGATAA
1051 +SLXA-B3_649_FC8437_R1_1_1_397_389
1052 YYYYYYYYYWYYYYWWYYYWYWYWW
1053 @SLXA-B3_649_FC8437_R1_1_1_850_123
1054 GAGGGTGTTGATCATGATGATGGCG
1055 +SLXA-B3_649_FC8437_R1_1_1_850_123
1056 YYYYYYYYYYYYYWYYWYYSYYYSY
1057 @SLXA-B3_649_FC8437_R1_1_1_362_549
1058 GGAAACAAAGTTTTTCTCAACATAG
1059 +SLXA-B3_649_FC8437_R1_1_1_362_549
1060 YYYYYYYYYYYYYYYYYYWWWWYWY
1061 @SLXA-B3_649_FC8437_R1_1_1_183_714
1062 GTATTATTTAATGGCATACACTCAA
1063 +SLXA-B3_649_FC8437_R1_1_1_183_714
1064 YYYYYYYYYYWYYYYWYWWUWWWQQ
1065
1066 Using this module directly you might run:
1067
1068 >>> handle = open("Quality/solexa_example.fastq", "rU")
1069 >>> for record in FastqSolexaIterator(handle):
1070 ... print record.id, record.seq
1071 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1072 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1073 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1074 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1075 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1076 >>> handle.close()
1077
1078 Typically however, you would call this via Bio.SeqIO instead with
1079 "fastq-solexa" as the format:
1080
1081 >>> from Bio import SeqIO
1082 >>> handle = open("Quality/solexa_example.fastq", "rU")
1083 >>> for record in SeqIO.parse(handle, "fastq-solexa"):
1084 ... print record.id, record.seq
1085 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1086 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1087 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1088 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1089 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1090 >>> handle.close()
1091
1092 If you want to look at the qualities, they are recorded in each record's
1093 per-letter-annotation dictionary as a simple list of integers:
1094
1095 >>> print record.letter_annotations["solexa_quality"]
1096 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
1097
1098 These scores aren't very good, but they are high enough that they map
1099 almost exactly onto PHRED scores:
1100
1101 >>> print "%0.2f" % phred_quality_from_solexa(25)
1102 25.01
1103
1104 Let's look at faked example read which is even worse, where there are
1105 more noticeable differences between the Solexa and PHRED scores::
1106
1107 @slxa_0001_1_0001_01
1108 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1109 +slxa_0001_1_0001_01
1110 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1111
1112 Again, you would typically use Bio.SeqIO to read this file in (rather than
1113 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will
1114 contain thousands of reads, so you would normally use Bio.SeqIO.parse()
1115 as shown above. This example has only as one entry, so instead we can
1116 use the Bio.SeqIO.read() function:
1117
1118 >>> from Bio import SeqIO
1119 >>> handle = open("Quality/solexa_faked.fastq", "rU")
1120 >>> record = SeqIO.read(handle, "fastq-solexa")
1121 >>> handle.close()
1122 >>> print record.id, record.seq
1123 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1124 >>> print record.letter_annotations["solexa_quality"]
1125 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
1126
1127 These quality scores are so low that when converted from the Solexa scheme
1128 into PHRED scores they look quite different:
1129
1130 >>> print "%0.2f" % phred_quality_from_solexa(-1)
1131 2.54
1132 >>> print "%0.2f" % phred_quality_from_solexa(-5)
1133 1.19
1134
1135 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
1136 method to output the record(s):
1137
1138 >>> print record.format("fastq-solexa")
1139 @slxa_0001_1_0001_01
1140 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1141 +
1142 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1143 <BLANKLINE>
1144
1145 Note this output is slightly different from the input file as Biopython
1146 has left out the optional repetition of the sequence identifier on the "+"
1147 line. If you want the to use PHRED scores, use "fastq" or "qual" as the
1148 output format instead, and Biopython will do the conversion for you:
1149
1150 >>> print record.format("fastq")
1151 @slxa_0001_1_0001_01
1152 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1153 +
1154 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
1155 <BLANKLINE>
1156
1157 >>> print record.format("qual")
1158 >slxa_0001_1_0001_01
1159 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
1160 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1161 1 1
1162 <BLANKLINE>
1163
1164 As shown above, the poor quality Solexa reads have been mapped to the
1165 equivalent PHRED score (e.g. -5 to 1 as shown earlier).
1166 """
1167 q_mapping = dict()
1168 for letter in range(0, 255):
1169 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
1170 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1171 if title2ids:
1172 id, name, descr = title_line
1173 else:
1174 descr = title_line
1175 id = descr.split()[0]
1176 name = id
1177 record = SeqRecord(Seq(seq_string, alphabet),
1178 id=id, name=name, description=descr)
1179 qualities = [q_mapping[letter] for letter in quality_string]
1180
1181 if qualities and (min(qualities) < -5 or max(qualities)>62):
1182 raise ValueError("Invalid character in quality string")
1183
1184
1185 dict.__setitem__(record._per_letter_annotations,
1186 "solexa_quality", qualities)
1187 yield record
1188
1189
1191 """Parse new Illumina 1.3+ FASTQ like files (which differ in the quality mapping).
1192
1193 The optional arguments are the same as those for the FastqPhredIterator.
1194
1195 For each sequence in Illumina 1.3+ FASTQ files there is a matching string
1196 encoding PHRED integer qualities using ASCII values with an offset of 64.
1197
1198 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
1199 with an ASCII offset of 64. They are approximately equal but only for high
1200 qaulity reads. If you have an old Solexa/Illumina file with negative
1201 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
1202
1203 >>> from Bio import SeqIO
1204 >>> record = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-solexa")
1205 >>> print record.id, record.seq
1206 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1207 >>> print record.letter_annotations["solexa_quality"]
1208 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
1209 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina")
1210 Traceback (most recent call last):
1211 ...
1212 ValueError: Invalid character in quality string
1213
1214 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
1215 """
1216 q_mapping = dict()
1217 for letter in range(0, 255):
1218 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
1219 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1220 if title2ids:
1221 id, name, descr = title2ids(title_line)
1222 else:
1223 descr = title_line
1224 id = descr.split()[0]
1225 name = id
1226 record = SeqRecord(Seq(seq_string, alphabet),
1227 id=id, name=name, description=descr)
1228 qualities = [q_mapping[letter] for letter in quality_string]
1229 if qualities and (min(qualities) < 0 or max(qualities) > 62):
1230 raise ValueError("Invalid character in quality string")
1231
1232
1233 dict.__setitem__(record._per_letter_annotations,
1234 "phred_quality", qualities)
1235 yield record
1236
1238 """For QUAL files which include PHRED quality scores, but no sequence.
1239
1240 For example, consider this short QUAL file::
1241
1242 >EAS54_6_R1_2_1_413_324
1243 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1244 26 26 26 23 23
1245 >EAS54_6_R1_2_1_540_792
1246 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1247 26 18 26 23 18
1248 >EAS54_6_R1_2_1_443_348
1249 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1250 24 18 18 18 18
1251
1252 Using this module directly you might run:
1253
1254 >>> handle = open("Quality/example.qual", "rU")
1255 >>> for record in QualPhredIterator(handle):
1256 ... print record.id, record.seq
1257 EAS54_6_R1_2_1_413_324 ?????????????????????????
1258 EAS54_6_R1_2_1_540_792 ?????????????????????????
1259 EAS54_6_R1_2_1_443_348 ?????????????????????????
1260 >>> handle.close()
1261
1262 Typically however, you would call this via Bio.SeqIO instead with "qual"
1263 as the format:
1264
1265 >>> from Bio import SeqIO
1266 >>> handle = open("Quality/example.qual", "rU")
1267 >>> for record in SeqIO.parse(handle, "qual"):
1268 ... print record.id, record.seq
1269 EAS54_6_R1_2_1_413_324 ?????????????????????????
1270 EAS54_6_R1_2_1_540_792 ?????????????????????????
1271 EAS54_6_R1_2_1_443_348 ?????????????????????????
1272 >>> handle.close()
1273
1274 Becase QUAL files don't contain the sequence string itself, the seq
1275 property is set to an UnknownSeq object. As no alphabet was given, this
1276 has defaulted to a generic single letter alphabet and the character "?"
1277 used.
1278
1279 By specifying a nucleotide alphabet, "N" is used instead:
1280
1281 >>> from Bio import SeqIO
1282 >>> from Bio.Alphabet import generic_dna
1283 >>> handle = open("Quality/example.qual", "rU")
1284 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
1285 ... print record.id, record.seq
1286 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
1287 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
1288 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
1289 >>> handle.close()
1290
1291 However, the quality scores themselves are available as a list of integers
1292 in each record's per-letter-annotation:
1293
1294 >>> print record.letter_annotations["phred_quality"]
1295 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1296
1297 You can still slice one of these SeqRecord objects with an UnknownSeq:
1298
1299 >>> sub_record = record[5:10]
1300 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
1301 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
1302 """
1303
1304 while True:
1305 line = handle.readline()
1306 if line == "" : return
1307 if line[0] == ">":
1308 break
1309
1310 while True:
1311 if line[0] != ">":
1312 raise ValueError("Records in Fasta files should start with '>' character")
1313 if title2ids:
1314 id, name, descr = title2ids(line[1:].rstrip())
1315 else:
1316 descr = line[1:].rstrip()
1317 id = descr.split()[0]
1318 name = id
1319
1320 qualities = []
1321 line = handle.readline()
1322 while True:
1323 if not line : break
1324 if line[0] == ">": break
1325 qualities.extend([int(word) for word in line.split()])
1326 line = handle.readline()
1327
1328 if qualities and min(qualities) < 0:
1329 raise ValueError(("Negative quality score %i found in %s. " + \
1330 "Are these Solexa scores, not PHRED scores?") \
1331 % (min(qualities), id))
1332
1333
1334 record = SeqRecord(UnknownSeq(len(qualities), alphabet),
1335 id = id, name = name, description = descr)
1336
1337
1338 dict.__setitem__(record._per_letter_annotations,
1339 "phred_quality", qualities)
1340 yield record
1341
1342 if not line : return
1343 assert False, "Should not reach this line"
1344
1346 """Class to write standard FASTQ format files (using PHRED quality scores).
1347
1348 Although you can use this class directly, you are strongly encouraged
1349 to use the Bio.SeqIO.write() function instead via the format name "fastq"
1350 or the alias "fastq-sanger". For example, this code reads in a standard
1351 Sanger style FASTQ file (using PHRED scores) and re-saves it as another
1352 Sanger style FASTQ file:
1353
1354 >>> from Bio import SeqIO
1355 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1356 >>> out_handle = open("Quality/temp.fastq", "w")
1357 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1358 3
1359 >>> out_handle.close()
1360
1361 You might want to do this if the original file included extra line breaks,
1362 which while valid may not be supported by all tools. The output file from
1363 Biopython will have each sequence on a single line, and each quality
1364 string on a single line (which is considered desirable for maximum
1365 compatibility).
1366
1367 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
1368 quality scores) is converted into a standard Sanger style FASTQ file using
1369 PHRED qualities:
1370
1371 >>> from Bio import SeqIO
1372 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1373 >>> out_handle = open("Quality/temp.fastq", "w")
1374 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1375 5
1376 >>> out_handle.close()
1377
1378 This code is also called if you use the .format("fastq") method of a
1379 SeqRecord, or .format("fastq-sanger") if you prefer that alias.
1380
1381 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
1382 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1383 warning is issued.
1384
1385 P.S. To avoid cluttering up your working directory, you can delete this
1386 temporary file now:
1387
1388 >>> import os
1389 >>> os.remove("Quality/temp.fastq")
1390 """
1391 assert SANGER_SCORE_OFFSET == ord("!")
1392
1420
1422 """Class to write QUAL format files (using PHRED quality scores).
1423
1424 Although you can use this class directly, you are strongly encouraged
1425 to use the Bio.SeqIO.write() function instead. For example, this code
1426 reads in a FASTQ file and saves the quality scores into a QUAL file:
1427
1428 >>> from Bio import SeqIO
1429 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1430 >>> out_handle = open("Quality/temp.qual", "w")
1431 >>> SeqIO.write(record_iterator, out_handle, "qual")
1432 3
1433 >>> out_handle.close()
1434
1435 This code is also called if you use the .format("qual") method of a
1436 SeqRecord.
1437
1438 P.S. Don't forget to clean up the temp file if you don't need it anymore:
1439
1440 >>> import os
1441 >>> os.remove("Quality/temp.qual")
1442 """
1443 - def __init__(self, handle, wrap=60, record2title=None):
1444 """Create a QUAL writer.
1445
1446 Arguments:
1447 - handle - Handle to an output file, e.g. as returned
1448 by open(filename, "w")
1449 - wrap - Optional line length used to wrap sequence lines.
1450 Defaults to wrapping the sequence at 60 characters
1451 Use zero (or None) for no wrapping, giving a single
1452 long line for the sequence.
1453 - record2title - Optional function to return the text to be
1454 used for the title line of each record. By default
1455 a combination of the record.id and record.description
1456 is used. If the record.description starts with the
1457 record.id, then just the record.description is used.
1458
1459 The record2title argument is present for consistency with the
1460 Bio.SeqIO.FastaIO writer class.
1461 """
1462 SequentialSequenceWriter.__init__(self, handle)
1463
1464 self.wrap = None
1465 if wrap:
1466 if wrap < 1:
1467 raise ValueError
1468 self.wrap = wrap
1469 self.record2title = record2title
1470
1472 """Write a single QUAL record to the file."""
1473 assert self._header_written
1474 assert not self._footer_written
1475 self._record_written = True
1476
1477 if self.record2title:
1478 title = self.clean(self.record2title(record))
1479 else:
1480 id = self.clean(record.id)
1481 description = self.clean(record.description)
1482 if description and description.split(None, 1)[0]==id:
1483
1484 title = description
1485 elif description:
1486 title = "%s %s" % (id, description)
1487 else:
1488 title = id
1489 self.handle.write(">%s\n" % title)
1490
1491 qualities = _get_phred_quality(record)
1492 try:
1493
1494
1495 qualities_strs = [("%i" % round(q, 0)) for q in qualities]
1496 except TypeError, e:
1497 if None in qualities:
1498 raise TypeError("A quality value of None was found")
1499 else:
1500 raise e
1501
1502 if self.wrap:
1503 while qualities_strs:
1504 line = qualities_strs.pop(0)
1505 while qualities_strs \
1506 and len(line) + 1 + len(qualities_strs[0]) < self.wrap:
1507 line += " " + qualities_strs.pop(0)
1508 self.handle.write(line + "\n")
1509 else:
1510 data = " ".join(qualities_strs)
1511 self.handle.write(data + "\n")
1512
1514 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).
1515
1516 This outputs FASTQ files like those from the early Solexa/Illumina
1517 pipeline, using Solexa scores and an ASCII offset of 64. These are
1518 NOT compatible with the standard Sanger style PHRED FASTQ files.
1519
1520 If your records contain a "solexa_quality" entry under letter_annotations,
1521 this is used, otherwise any "phred_quality" entry will be used after
1522 conversion using the solexa_quality_from_phred function. If neither style
1523 of quality scores are present, an exception is raised.
1524
1525 Although you can use this class directly, you are strongly encouraged
1526 to use the Bio.SeqIO.write() function instead. For example, this code
1527 reads in a FASTQ file and re-saves it as another FASTQ file:
1528
1529 >>> from Bio import SeqIO
1530 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1531 >>> out_handle = open("Quality/temp.fastq", "w")
1532 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa")
1533 5
1534 >>> out_handle.close()
1535
1536 You might want to do this if the original file included extra line breaks,
1537 which (while valid) may not be supported by all tools. The output file
1538 from Biopython will have each sequence on a single line, and each quality
1539 string on a single line (which is considered desirable for maximum
1540 compatibility).
1541
1542 This code is also called if you use the .format("fastq-solexa") method of
1543 a SeqRecord. For example,
1544
1545 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1546 >>> print record.format("fastq-solexa")
1547 @Test PHRED qualities from 40 to 0 inclusive
1548 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1549 +
1550 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
1551 <BLANKLINE>
1552
1553 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
1554 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit,
1555 a warning is issued.
1556
1557 P.S. Don't forget to delete the temp file if you don't need it anymore:
1558
1559 >>> import os
1560 >>> os.remove("Quality/temp.fastq")
1561 """
1590
1592 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).
1593
1594 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
1595 using PHRED scores and an ASCII offset of 64. Note these files are NOT
1596 compatible with the standard Sanger style PHRED FASTQ files which use an
1597 ASCII offset of 32.
1598
1599 Although you can use this class directly, you are strongly encouraged to
1600 use the Bio.SeqIO.write() function with format name "fastq-illumina"
1601 instead. This code is also called if you use the .format("fastq-illumina")
1602 method of a SeqRecord. For example,
1603
1604 >>> from Bio import SeqIO
1605 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1606 >>> print record.format("fastq-illumina")
1607 @Test PHRED qualities from 40 to 0 inclusive
1608 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1609 +
1610 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
1611 <BLANKLINE>
1612
1613 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
1614 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1615 warning is issued.
1616 """
1645
1647 """Iterate over matched FASTA and QUAL files as SeqRecord objects.
1648
1649 For example, consider this short QUAL file with PHRED quality scores::
1650
1651 >EAS54_6_R1_2_1_413_324
1652 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1653 26 26 26 23 23
1654 >EAS54_6_R1_2_1_540_792
1655 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1656 26 18 26 23 18
1657 >EAS54_6_R1_2_1_443_348
1658 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1659 24 18 18 18 18
1660
1661 And a matching FASTA file::
1662
1663 >EAS54_6_R1_2_1_413_324
1664 CCCTTCTTGTCTTCAGCGTTTCTCC
1665 >EAS54_6_R1_2_1_540_792
1666 TTGGCAGGCCAAGGCCGATGGATCA
1667 >EAS54_6_R1_2_1_443_348
1668 GTTGCTTCTGGCGTGGGTGGGGGGG
1669
1670 You can parse these separately using Bio.SeqIO with the "qual" and
1671 "fasta" formats, but then you'll get a group of SeqRecord objects with
1672 no sequence, and a matching group with the sequence but not the
1673 qualities. Because it only deals with one input file handle, Bio.SeqIO
1674 can't be used to read the two files together - but this function can!
1675 For example,
1676
1677 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1678 ... open("Quality/example.qual", "rU"))
1679 >>> for record in rec_iter:
1680 ... print record.id, record.seq
1681 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1682 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1683 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1684
1685 As with the FASTQ or QUAL parsers, if you want to look at the qualities,
1686 they are in each record's per-letter-annotation dictionary as a simple
1687 list of integers:
1688
1689 >>> print record.letter_annotations["phred_quality"]
1690 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1691
1692 If you have access to data as a FASTQ format file, using that directly
1693 would be simpler and more straight forward. Note that you can easily use
1694 this function to convert paired FASTA and QUAL files into FASTQ files:
1695
1696 >>> from Bio import SeqIO
1697 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1698 ... open("Quality/example.qual", "rU"))
1699 >>> out_handle = open("Quality/temp.fastq", "w")
1700 >>> SeqIO.write(rec_iter, out_handle, "fastq")
1701 3
1702 >>> out_handle.close()
1703
1704 And don't forget to clean up the temp file if you don't need it anymore:
1705
1706 >>> import os
1707 >>> os.remove("Quality/temp.fastq")
1708 """
1709 from Bio.SeqIO.FastaIO import FastaIterator
1710 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \
1711 title2ids=title2ids)
1712 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \
1713 title2ids=title2ids)
1714
1715
1716
1717 while True:
1718 try:
1719 f_rec = fasta_iter.next()
1720 except StopIteration:
1721 f_rec = None
1722 try:
1723 q_rec = qual_iter.next()
1724 except StopIteration:
1725 q_rec = None
1726 if f_rec is None and q_rec is None:
1727
1728 break
1729 if f_rec is None:
1730 raise ValueError("FASTA file has more entries than the QUAL file.")
1731 if q_rec is None:
1732 raise ValueError("QUAL file has more entries than the FASTA file.")
1733 if f_rec.id != q_rec.id:
1734 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \
1735 % (f_rec.id, q_rec.id))
1736 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]):
1737 raise ValueError("Sequence length and number of quality scores disagree for %s" \
1738 % f_rec.id)
1739
1740 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"]
1741 yield f_rec
1742
1743
1744
1746 """Run the Bio.SeqIO module's doctests.
1747
1748 This will try and locate the unit tests directory, and run the doctests
1749 from there in order that the relative paths used in the examples work.
1750 """
1751 import doctest
1752 import os
1753 if os.path.isdir(os.path.join("..", "..", "Tests")):
1754 print "Runing doctests..."
1755 cur_dir = os.path.abspath(os.curdir)
1756 os.chdir(os.path.join("..", "..", "Tests"))
1757 assert os.path.isfile("Quality/example.fastq")
1758 assert os.path.isfile("Quality/example.fasta")
1759 assert os.path.isfile("Quality/example.qual")
1760 assert os.path.isfile("Quality/tricky.fastq")
1761 assert os.path.isfile("Quality/solexa_faked.fastq")
1762 doctest.testmod(verbose=0)
1763 os.chdir(cur_dir)
1764 del cur_dir
1765 print "Done"
1766
1767 if __name__ == "__main__":
1768 _test()
1769