Fields, Christopher J | 7 Feb 17:55 2013

Re: FASTQ, was Re:BioPerl long-term, was Re: dependencies on perl version

I think we will want to allow for a multitude of implementations.  SeqIO already allows for that to a degree,
but multiple backend implementations (say, different ways of parsing/processing FASTQ and others)
isn't supported yet.


On Feb 7, 2013, at 10:38 AM, Siddhartha Basu <sidd.basu <at>> wrote:

> Another approach might be use map-reduce(Hadoop) if possible. I have
> seen one implementation in biopython's GFF3 parser.
> -siddhartha
> On Thu, 07 Feb 2013, Aaron Mackey wrote:
>> e.g., a pull-based FASTQ parser that did nothing else at the top level but
>> "chunk" the file into as-yet-unparsed four-line blobs could appear to work
>> very fast, if the user code did nothing but count the number of entries:
>>  while (my $seq = $seqio->nextseq) { $ct++ };
>> in other words, you defer *everything* except the minimal amount of
>> parsing/logic required to detect object boundaries.
>> This is, in fact, the exact opposite of the event-based SearchIO "push"
>> parsers, which always perform the most parsing possible, despite the user
>> never accessing most of the material.
>> Lastly, with respect to performance, if the parsing/object building
>> operation is not simply IO bound, then parallel parser/object-building CPU
>> threads could be considered, which could then dynamically adapt to
>> pre-parse attributes (e.g. quality scores) that the calling code was
>> actually using.  What's the state of thread-safe Perl these days?
>> -Aaron
>> On Thu, Feb 7, 2013 at 10:56 AM, Fields, Christopher J <
>> cjfields <at>> wrote:
>>> This will likely be the approach for more NGS-friendly Bio::Seq class.
>>> Calculation of the PHRED scores could also be deferred until needed.
>>> seqtk has some C-based methods that we could possibly take advantage of,
>>> but will have to look into it.
>>> chris
>>> On Feb 7, 2013, at 9:25 AM, Aaron Mackey <amackey <at>> wrote:
>>>> You might also want to consider a lazy/pull-based parser to defer
>>> parsing/object-building for pieces of the object that don't get used.  This
>>> also usually provides some error tolerance.
>>>> -Aaron
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l <at>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at>