Features Download
From: Florent Angly <florent.angly <at> gmail.com>
Subject: Re: Bio::PrimarySeq speedup
Newsgroups: gmane.comp.lang.perl.bio.general
Date: Tuesday 6th November 2012 11:06:56 UTC (over 4 years ago)
Yes, good idea, Chris.

Actually, thinking about it, most of these warnings were redundant. So, 
I changed the behaviour of Bio::PrimarySeq::validate_seq() so that it 
issues exceptions if requested.


On 05/11/12 12:43, Fields, Christopher J wrote:
> Florent,
> Ran tests on it, they pass but I am seeing this (if these are expected,
you can catch the warnings using Test::Warn):
> [cjfields@pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr
> t/Seq/PrimarySeq.t .. 1/167
> --------------------- WARNING ---------------------
> MSG: Got a sequence without letters. Could not guess alphabet
> ---------------------------------------------------
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
> ---------------------------------------------------
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
> ---------------------------------------------------
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
> ---------------------------------------------------
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is
> ---------------------------------------------------
> --------------------- WARNING ---------------------
> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
> ---------------------------------------------------
> t/Seq/PrimarySeq.t .. ok
> All tests successful.
> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr 
0.01 csys =  0.23 CPU)
> Result: PASS
> chris
> On Nov 4, 2012, at 6:46 PM, Florent Angly 
>   wrote:
>> I am planning on merging the branch with master this week.
>> Best,
>> Florent
>> On 01/11/12 15:49, Florent Angly wrote:
>>> Hi all,
>>> I was working with Ben Woodcroft on identifying ways to speed up
Grinder, which relies heavily on Bioperl. Ben did some profiling with
NYTProf and we realized that a lot of computation time was spent in
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we
used for the profiling were microbial genomes, i.e. several Mbp long
sequences, which is quite long. A lot of the performance cost was
associated with passing full genomes between functions. For example, when
doing a call to length(), length() requests the full sequence from seq(),
which returns it back to length() (it makes a copy!). So, every call to
length is very expensive for long sequences. And there is a lot of code
that calls length(), for error checking.
>>> I know that there are a few Bioperl modules that are more adapted to
handling very long sequences, e.g. Bio::DB::Fasta or
Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at
Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5.
But in fact, there were more things that I wanted to try to improve, which
led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>>> I wrote quite a few tests for functionalities that were not previously
covered by tests, and tried to improve the documentation. In addition, to
address the speed issue, I did some changes to Bio::PrimarySeq and
Bio::PrimarySeqI :
>>> • The length of a sequence is now computed as soon as the sequence is
set, not after. This way, there is no extra call to seq() (which would
incur the cost of copying the entire sequence between functions).
>>> • The length is saved as an object attribute. So, calling length() is
very cheap since it only needs to retrieve the stored value for the length.
>>> • There is a constructor called -direct, which skips sequence
validation. However, it was only active in conjunction with the -ref_to_seq
constructor. To make -direct conform better to its documented purpose, I
made it -direct work when a sequence is set through -seq as well.
>>> • This brings us to trunc(), revcom() and other methods of
Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq
object from an existing (already validated!) Bio::PrimarySeq object, the
new object can be constructed with the -direct constructor, to save some
>>> • Finally, I noticed that subseq() used calls to eval() to do its
work. eval() is notoriously slow and these calls were easily replaced by
simple calls to substr() to save some time.
>>> A real-world test I performed with Grinder took 3m28s before the
changes (and ~1 min is spent doing something unrelated). After the changes,
the same test took only 2min28s. So, it's quite a significant improvement
and on more specific test cases, performance gains can obviously be much
bigger. Also, I anticipate that the gains would be bigger for even longer
>>> All the changes I made are meant to be backward compatible and all the
tests in the Bioperl test suite passed. So, there _should_ not be any
issues. However, I know that Bio::PrimarySeq is a central module of
Bioperl, so please, have a look at it and let me know if there are any
glaring errors.
>>> Thanks,
>>> Florent
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
CD: 11ms