15 Nov 2012 17:29
Re: Bio::PrimarySeq speedup
Florent Angly <florent.angly <at> gmail.com>
2012-11-15 16:29:30 GMT
2012-11-15 16:29:30 GMT
I now merged the branch with master. Best, Florent On 06/11/12 21:06, Florent Angly wrote: > Yes, good idea, Chris. > > Actually, thinking about it, most of these warnings were redundant. > So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that > it issues exceptions if requested. > > Florent > > > On 05/11/12 12:43, Fields, Christopher J wrote: >> Florent, >> >> Ran tests on it, they pass but I am seeing this (if these are >> expected, you can catch the warnings using Test::Warn): >> >> [cjfields <at> pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr >> t/Seq/PrimarySeq.t >> t/Seq/PrimarySeq.t .. 1/167 >> --------------------- WARNING --------------------- >> MSG: Got a sequence without letters. Could not guess alphabet >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is ! >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is & >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is >> \,$,+ >> --------------------------------------------------- >> >> --------------------- WARNING --------------------- >> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is <at> / >> --------------------------------------------------- >> t/Seq/PrimarySeq.t .. ok >> All tests successful. >> Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18 >> cusr 0.01 csys = 0.23 CPU) >> Result: PASS >> >> chris >> >> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly <at> gmail.com> >> wrote: >> >>> I am planning on merging the branch with master this week. >>> Best, >>> Florent >>> >>> >>> On 01/11/12 15:49, Florent Angly wrote: >>>> Hi all, >>>> >>>> I was working with Ben Woodcroft on identifying ways to speed up >>>> Grinder, which relies heavily on Bioperl. Ben did some profiling >>>> with NYTProf and we realized that a lot of computation time was >>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The >>>> sequences we used for the profiling were microbial genomes, i.e. >>>> several Mbp long sequences, which is quite long. A lot of the >>>> performance cost was associated with passing full genomes between >>>> functions. For example, when doing a call to length(), length() >>>> requests the full sequence from seq(), which returns it back to >>>> length() (it makes a copy!). So, every call to length is very >>>> expensive for long sequences. And there is a lot of code that calls >>>> length(), for error checking. >>>> >>>> I know that there are a few Bioperl modules that are more adapted >>>> to handling very long sequences, e.g. Bio::DB::Fasta or >>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look >>>> at Bio::PrimarySeq with Ben and I released this commit: >>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. >>>> But in fact, there were more things that I wanted to try to >>>> improve, which led me to start this new branch: >>>> https://github.com/bioperl/bioperl-live/tree/seqlength >>>> >>>> I wrote quite a few tests for functionalities that were not >>>> previously covered by tests, and tried to improve the >>>> documentation. In addition, to address the speed issue, I did some >>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI : >>>> • The length of a sequence is now computed as soon as the sequence >>>> is set, not after. This way, there is no extra call to seq() (which >>>> would incur the cost of copying the entire sequence between >>>> functions). >>>> • The length is saved as an object attribute. So, calling length() >>>> is very cheap since it only needs to retrieve the stored value for >>>> the length. >>>> • There is a constructor called -direct, which skips sequence >>>> validation. However, it was only active in conjunction with the >>>> -ref_to_seq constructor. To make -direct conform better to its >>>> documented purpose, I made it -direct work when a sequence is set >>>> through -seq as well. >>>> • This brings us to trunc(), revcom() and other methods of >>>> Bio::PrimarySeqI. Since all these methods create a new >>>> Bio::PrimarySeq object from an existing (already validated!) >>>> Bio::PrimarySeq object, the new object can be constructed with the >>>> -direct constructor, to save some time. >>>> • Finally, I noticed that subseq() used calls to eval() to do its >>>> work. eval() is notoriously slow and these calls were easily >>>> replaced by simple calls to substr() to save some time. >>>> >>>> A real-world test I performed with Grinder took 3m28s before the >>>> changes (and ~1 min is spent doing something unrelated). After the >>>> changes, the same test took only 2min28s. So, it's quite a >>>> significant improvement and on more specific test cases, >>>> performance gains can obviously be much bigger. Also, I anticipate >>>> that the gains would be bigger for even longer sequences. >>>> >>>> All the changes I made are meant to be backward compatible and all >>>> the tests in the Bioperl test suite passed. So, there _should_ not >>>> be any issues. However, I know that Bio::PrimarySeq is a central >>>> module of Bioperl, so please, have a look at it and let me know if >>>> there are any glaring errors. >>>> >>>> Thanks, >>>> >>>> Florent >>>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l <at> lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >
RSS Feed