This will likely be the approach for more NGS-friendly Bio::Seq class.
Calculation of the PHRED scores could also be deferred until needed.
seqtk has some C-based methods that we could possibly take advantage of,
but will have to look into it.
On Feb 7, 2013, at 9:25 AM, Aaron Mackey wrote:
> You might also want to consider a lazy/pull-based parser to defer
parsing/object-building for pieces of the object that don't get used. This
also usually provides some error tolerance.
> Aaron J. Mackey, PhD
> Assistant Professor
> Center for Public Health Genomics
> University of Virginia
> [email protected]
> On Wed, Feb 6, 2013 at 5:53 PM, Fields, Christopher J
> On Feb 6, 2013, at 4:43 PM, Peter Cock wrote:
> > On Wed, Feb 6, 2013 at 10:11 PM, Fields, Christopher J
> > wrote:
> >> I see no problem in stating any generic parsing and low-level
> >> are just as much a part of what BioPerl encompasses as the
> >> Bio::* classes themselves. Steve and Jason were on to something with
> >> SearchIO; it's maybe not as performant as we would like, but it
> >> is more flexible in terms of what can be done, b/c it separates out
> >> low-level parsing from object creation. That's the general model we
> >> should look at. There is a good reason Biopython is following this
> >> model with their SearchIO implementation (Peter C, are you reading
> > Actually I don't think we did end up with that kind of separation in
> > Biopython SearchIO - which is not so say it isn't an excellent model
> > to follow. Rather the Biopython SearchIO (like the BioPerl one) had
> > as the first goal a consistent object model across assorted file
> > formats.
> > The idea of a low level minimal overhead parsers (which are very
> > format specific), on which a heavier but consistent object model
> > can be built might be a good balance - the high level API has the
> > connivence, but if you give that up you can have more speed.
> > That's what I recommend with FASTQ and Biopython, e.g.
> > http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
> >> I have started a wrapper around Heng's FASTQ/FASTA parsing
> >> code (kseq), it seems to work quite well (~20M FASTQ in 30 sec
> >> last I recall?).
> > I'd have to dig through my emails, but I think the BioRuby guys
> > looked at that too - as I recall while it was fast, the error handling
> > left something to be desired. Email me directly or on the BioRuby
> > list if you want to follow up on that.
> > Regards,
> > Peter
> I did a little on this, worth following up on, but I pulled the FASTQ
test examples you created from the paper to test it out. IIRC it parsed
where it needed to, but I'm not sure how it handled bad sequences, so yes,
worth looking into. Maybe worth moving to open-bio-l for broader
> Bioperl-l mailing list
> [email protected]