Fields, Christopher J | 7 Feb 16:56 2013

Re: FASTQ, was Re: BioPerl long-term, was Re: dependencies on perl version

This will likely be the approach for more NGS-friendly Bio::Seq class.  Calculation of the PHRED scores
could also be deferred until needed.

seqtk has some C-based methods that we could possibly take advantage of, but will have to look into it.

chris

On Feb 7, 2013, at 9:25 AM, Aaron Mackey <amackey <at> virginia.edu> wrote:

> You might also want to consider a lazy/pull-based parser to defer parsing/object-building for pieces of
the object that don't get used.  This also usually provides some error tolerance.
> 
> -Aaron
> 
> --
> Aaron J. Mackey, PhD
> Assistant Professor
> Center for Public Health Genomics
> University of Virginia
> amackey <at> virginia.edu
> http://www.cphg.virginia.edu/mackey
> 
> 
> On Wed, Feb 6, 2013 at 5:53 PM, Fields, Christopher J <cjfields <at> illinois.edu> wrote:
> On Feb 6, 2013, at 4:43 PM, Peter Cock <p.j.a.cock <at> googlemail.com> wrote:
> 
> > On Wed, Feb 6, 2013 at 10:11 PM, Fields, Christopher J
> > <cjfields <at> illinois.edu> wrote:
> >>
> >> I see no problem in stating any generic parsing and low-level interfaces
> >> are just as much a part of what BioPerl encompasses as the higher-level
> >> Bio::* classes themselves.  Steve and Jason were on to something with
> >> SearchIO; it's maybe not as performant as we would like, but it certainly
> >> is more flexible in terms of what can be done, b/c it separates out
> >> low-level parsing from object creation.  That's the general model we
> >> should look at.  There is a good reason Biopython is following this
> >> model with their SearchIO implementation (Peter C, are you reading this?)
> >
> > Actually I don't think we did end up with that kind of separation in the
> > Biopython SearchIO - which is not so say it isn't an excellent model
> > to follow. Rather the Biopython SearchIO (like the BioPerl one) had
> > as the first goal a consistent object model across assorted file
> > formats.
> >
> > The idea of a low level minimal overhead parsers (which are very
> > format specific), on which a heavier but consistent object model
> > can be built might be a good balance - the high level API has the
> > connivence, but if you give that up you can have more speed.
> > That's what I recommend with FASTQ and Biopython, e.g.
> > http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
> >
> >>
> >> I have started a wrapper around Heng's FASTQ/FASTA parsing
> >> code (kseq), it seems to work quite well (~20M FASTQ in 30 sec
> >> last I recall?).
> >>
> >
> > I'd have to dig through my emails, but I think the BioRuby guys
> > looked at that too - as I recall while it was fast, the error handling
> > left something to be desired. Email me directly or on the BioRuby
> > list if you want to follow up on that.
> >
> > Regards,
> >
> > Peter
> 
> I did a little on this, worth following up on, but I pulled the FASTQ test examples you created from the paper
to test it out.  IIRC it parsed where it needed to, but I'm not sure how it handled bad sequences, so yes, worth
looking into.  Maybe worth moving to open-bio-l for broader discussion.
> 
> chris
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at> lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 

Gmane