2 Jul 09:20
Re: Next-gen modules
Peter Cock <p.j.a.cock <at> googlemail.com>
2009-07-02 07:20:07 GMT
2009-07-02 07:20:07 GMT
On 7/1/09, Giles Weaver wrote: > Peter, the trimming algorithm I use employs a sliding window, as follows: > > - For each sequence position calculate the mean phred quality score for a > window around that position. > - Record whether the mean score is above or below a threshold as an array > of zeros and ones. > - Use a regular expression on the joined array to find the start and end > of the good quality sequence(s). > - Extract the quality sequence(s) and replace any bases below the quality > threshold with N. > - Trim any Ns from the ends. > > A refinement would be to weight the scores from positions in the window, but > this could give a performance hit, and the method seems to work well enough > as is. Thanks for the details - that is a bit more complex that what I had been thinking. Do you have any favoured window size and quality threshold, or does this really depend on the data itself? Also, if you find a sequence read that goes "good - poor - good" for example, do you extract the two good regions as two sub reads (presumably with a minimum length)? This may be silly for Illumina where the reads are very short, but might make sense for Roche 454. > Chris, thanks for committing the fix, I'll give bioperl illumina fastq > parsing a workout soon. Peter, as much as I'd love to help out with > biopython, I'm under too much time pressure right now! Even use cases are useful - so thank you. > Jonathan, some of the Illumina sequencing adapters are listed at > http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland > http://seqanswers.com/forums/showthread.php?t=198 > Adapter sequence typically appears towards the end of the read, though the > latter part of it is often misread as the sequencing quality drops off. > I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify > real alignments and trim the sequence. This is not the ideal way of doing > things, but it's fast enough, and does seem to work. The adapter sequence > shouldn't be gapped, so I'm sure there is a lot of scope for optimising the > adapter removal. > > I'll happily share some code once I've got it to the stage where I'm not > embarrassed by it! > > Giles Cheers, Peter
RSS Feed