7 Feb 2013 17:55
Re: FASTQ, was Re:BioPerl long-term, was Re: dependencies on perl version
Fields, Christopher J <cjfields <at> illinois.edu>
2013-02-07 16:55:53 GMT
2013-02-07 16:55:53 GMT
I think we will want to allow for a multitude of implementations. SeqIO already allows for that to a degree, but multiple backend implementations (say, different ways of parsing/processing FASTQ and others) isn't supported yet. chris On Feb 7, 2013, at 10:38 AM, Siddhartha Basu <sidd.basu <at> gmail.com> wrote: > Another approach might be use map-reduce(Hadoop) if possible. I have > seen one implementation in biopython's GFF3 parser. > http://bcbio.wordpress.com/2009/03/22/mapreduce-implementation-of-gff-parsing-for-biopython/ > > -siddhartha > > > On Thu, 07 Feb 2013, Aaron Mackey wrote: > >> e.g., a pull-based FASTQ parser that did nothing else at the top level but >> "chunk" the file into as-yet-unparsed four-line blobs could appear to work >> very fast, if the user code did nothing but count the number of entries: >> >> while (my $seq = $seqio->nextseq) { $ct++ }; >> >> in other words, you defer *everything* except the minimal amount of >> parsing/logic required to detect object boundaries. >> >> This is, in fact, the exact opposite of the event-based SearchIO "push" >> parsers, which always perform the most parsing possible, despite the user >> never accessing most of the material. >> >> Lastly, with respect to performance, if the parsing/object building >> operation is not simply IO bound, then parallel parser/object-building CPU >> threads could be considered, which could then dynamically adapt to >> pre-parse attributes (e.g. quality scores) that the calling code was >> actually using. What's the state of thread-safe Perl these days? >> >> -Aaron >> >> >> On Thu, Feb 7, 2013 at 10:56 AM, Fields, Christopher J < >> cjfields <at> illinois.edu> wrote: >> >>> This will likely be the approach for more NGS-friendly Bio::Seq class. >>> Calculation of the PHRED scores could also be deferred until needed. >>> >>> seqtk has some C-based methods that we could possibly take advantage of, >>> but will have to look into it. >>> >>> chris >>> >>> On Feb 7, 2013, at 9:25 AM, Aaron Mackey <amackey <at> virginia.edu> wrote: >>> >>>> You might also want to consider a lazy/pull-based parser to defer >>> parsing/object-building for pieces of the object that don't get used. This >>> also usually provides some error tolerance. >>>> >>>> -Aaron >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l <at> lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > _______________________________________________ > Bioperl-l mailing list > Bioperl-l <at> lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l
RSS Feed