5 Nov 2012 01:46
Re: Bio::PrimarySeq speedup
Florent Angly <florent.angly <at> gmail.com>
2012-11-05 00:46:44 GMT
2012-11-05 00:46:44 GMT
I am planning on merging the branch with master this week. Best, Florent On 01/11/12 15:49, Florent Angly wrote: > Hi all, > > I was working with Ben Woodcroft on identifying ways to speed up > Grinder, which relies heavily on Bioperl. Ben did some profiling with > NYTProf and we realized that a lot of computation time was spent in > Bio::PrimarySeq, doing calls to subseq() and length(). The sequences > we used for the profiling were microbial genomes, i.e. several Mbp > long sequences, which is quite long. A lot of the performance cost was > associated with passing full genomes between functions. For example, > when doing a call to length(), length() requests the full sequence > from seq(), which returns it back to length() (it makes a copy!). So, > every call to length is very expensive for long sequences. And there > is a lot of code that calls length(), for error checking. > > I know that there are a few Bioperl modules that are more adapted to > handling very long sequences, e.g. Bio::DB::Fasta or > Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at > Bio::PrimarySeq with Ben and I released this commit: > https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. > But in fact, there were more things that I wanted to try to improve, > which led me to start this new branch: > https://github.com/bioperl/bioperl-live/tree/seqlength > > I wrote quite a few tests for functionalities that were not previously > covered by tests, and tried to improve the documentation. In addition, > to address the speed issue, I did some changes to Bio::PrimarySeq and > Bio::PrimarySeqI : > • The length of a sequence is now computed as soon as the sequence is > set, not after. This way, there is no extra call to seq() (which would > incur the cost of copying the entire sequence between functions). > • The length is saved as an object attribute. So, calling length() is > very cheap since it only needs to retrieve the stored value for the > length. > • There is a constructor called -direct, which skips sequence > validation. However, it was only active in conjunction with the > -ref_to_seq constructor. To make -direct conform better to its > documented purpose, I made it -direct work when a sequence is set > through -seq as well. > • This brings us to trunc(), revcom() and other methods of > Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq > object from an existing (already validated!) Bio::PrimarySeq object, > the new object can be constructed with the -direct constructor, to > save some time. > • Finally, I noticed that subseq() used calls to eval() to do its > work. eval() is notoriously slow and these calls were easily replaced > by simple calls to substr() to save some time. > > A real-world test I performed with Grinder took 3m28s before the > changes (and ~1 min is spent doing something unrelated). After the > changes, the same test took only 2min28s. So, it's quite a significant > improvement and on more specific test cases, performance gains can > obviously be much bigger. Also, I anticipate that the gains would be > bigger for even longer sequences. > > All the changes I made are meant to be backward compatible and all the > tests in the Bioperl test suite passed. So, there _should_ not be any > issues. However, I know that Bio::PrimarySeq is a central module of > Bioperl, so please, have a look at it and let me know if there are any > glaring errors. > > Thanks, > > Florent >
RSS Feed