Fields, Christopher J | 5 Nov 03:43 2012

Re: Bio::PrimarySeq speedup

Florent,

Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):

[cjfields <at> pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t 
t/Seq/PrimarySeq.t .. 1/167 
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is  <at> /
---------------------------------------------------
t/Seq/PrimarySeq.t .. ok       
All tests successful.
Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
Result: PASS

chris

On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly <at> gmail.com>
 wrote:

> I am planning on merging the branch with master this week.
> Best,
> Florent
> 
> 
> On 01/11/12 15:49, Florent Angly wrote:
>> Hi all,
>> 
>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on
Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were
microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was
associated with passing full genomes between functions. For example, when doing a call to length(),
length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So,
every call to length is very expensive for long sequences. And there is a lot of code that calls length(),
for error checking.
>> 
>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g.
Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at
Bio::PrimarySeq with Ben and I released this commit:
https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5.
But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>> 
>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to
improve the documentation. In addition, to address the speed issue, I did some changes to
Bio::PrimarySeq and Bio::PrimarySeqI :
>> • The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no
extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>> • The length is saved as an object attribute. So, calling length() is very cheap since it only needs to
retrieve the stored value for the length.
>> • There is a constructor called -direct, which skips sequence validation. However, it was only active
in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented
purpose, I made it -direct work when a sequence is set through -seq as well.
>> • This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods
create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the
new object can be constructed with the -direct constructor, to save some time.
>> • Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and
these calls were easily replaced by simple calls to substr() to save some time.
>> 
>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing
something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant
improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I
anticipate that the gains would be bigger for even longer sequences.
>> 
>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite
passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of
Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>> 
>> Thanks,
>> 
>> Florent
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at> lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Gmane