10 Nov 2008 23:29
Re: Problems with Bio::SearchIO
Dan Bolser <dan.bolser <at> gmail.com>
2008-11-10 22:29:08 GMT
2008-11-10 22:29:08 GMT
2008/11/7 Chris Fields <cjfields <at> illinois.edu>: > On Nov 7, 2008, at 8:27 AM, Dan Bolser wrote: > >> Hi, >> >> I'm new to BioPerl, so apologies if I'm doing something wrong. Please >> just let me know what I'm doing wrong or what you need. >> >> >> I downloaded the latest BioPerl via SVN (revision 14980), and did the, >> >> perl Build.PL >> ./Build test >> ./Build install >> >> (note that I have lib::local installed, pointing things at ~/perl5) >> >> I ignored the failed tests, as-per the "everyone who is l88t does >> this" instructions> > SearchIO (specifically, HSP methods used to calculate seq start/end > correctly) are being revised in svn, so they are failing at the moment. My > local (not-yet-committed) revisions are failing much more, but that's b/c > the tests are wrong; these should be added in relatively soon once I work > out more kinks in the code. > >> I am trying to parse the output of a "blastall -p blastn" job >> (blastall 2.2.18) using either the default format or tabular format. I >> copied the "search_overview.PLS" script found here: >> >> ... >> >> Looking closer I found that $parser->result_count() only gets set >> after calling $parser->next_result. Any way to force this? In some >> Perl objects I've seen a 'parse' method that kicks the object into >> (silently) calling all its get methods. Is there an equivalent (but >> apparently undocumented) method? Actually, I think it should kick >> itself when called... or not? Certainly the docs do not suggest that >> is won't return a the number of results ("Function: Gets the number of >> Blast results that have been parsed.") So I think this is a bug. > > We could make it so that the result_count() is eager (parses the results and > reports the total back). Not sure, but we could optionally cache the > already-parsed Result objects (that could run into memory issues if one is > parsing a ton of reports, so it needs to be off by default). I see (I think). Anyone first calling result_count() and *then* iterating over the results is getting a performance hit by effectively parsing the results twice? I would suggest that you make this function eager, but document the potential performance issue so that people can choose not to call it first. However, I don't think I can have understood correctly. How can its value be set correctly after calling next() only once? >> Actually I just checked against version 1.4 and "result_count()" fails >> with the default, the XML or the tabular format results (all three >> representing the same query against the same database with the same >> blast version). Note that trying with the XML format in the latest >> version of BioPerl results in the outright failure: >> >> ------------- EXCEPTION: Bio::Root::NotImplemented ------------- >> MSG: Abstract method "Bio::SearchIO::result_count" is not implemented >> by package Bio::SearchIO::blastxml. >> This is not your fault - author of Bio::SearchIO::blastxml should be >> blamed! >> >> ... > > That's a bug. Could you file this so we can track it? > > http://bugzilla.open-bio.org/ http://bugzilla.open-bio.org/show_bug.cgi?id=2650 >> While parsing and comparing the different formats (and versions) I >> noticed a couple of other problems that I could not find reported >> anywhere, so I'll report them here. >> >> The appropriate parts of the code are: >> >> while( my $r = $parser->next_result ) { >> print $r->query_length, "\n"; >> while(my $h = $r->next_hit ) { >> print $h->length, "\n"; >> my $p = $h->hsp('best'); >> print $h->significance, "\t", $h->score, "\t", $h->bits, "\n"; >> } >> } >> >> It seems that both the "$r->query_length" and the "$h->length" are >> missing when using the tabular format blast results (using either >> BioPerl version). I get confused here because there is also some >> undocumented behavior used in the above script that means a 'hit' >> (apparently) returns the start and end point of the two most extreme >> HSPs (and the length of that range). > > The HSPs are tiled prior to returning values for these methods. The tiling > algorithm works for the most part but still has a few odd issues (one is > reported in bugzilla). I am not familiar with that bit of code so can't > comment, but Sendu or Steve might answer. I'll also log a bug. > I try to stay away from using the various hit methods unless necessary and > usually go with simple HSP coordinates. OK. >> Secondly, I found in all but the default format (using either BioPerl >> version), the "$p->score" is missing (set to an empty string). It >> shows up fine using the default format. The significance and the >> bit-score show up fine... or at least they show up... The values look >> wrong now I come to check. e.g. the score is equal to the bit-score >> when using the default format with version 1.4, and is often smaller >> than the bit-score using the latest version (or is $h->bits not the >> bit score?) > > There is some discussion about this in the mail list archives: > > http://thread.gmane.org/gmane.comp.lang.perl.bio.general/16273/focus=16296 > > NCBI BLAST has the hit summary table reporting the raw score(), whereas in > WU_BLAST the hit table reports bit_score(); in the HSPs I think everything > is defined. If you have a minimal hit constructed (data only reported in > the hit table, no HSP data is reported) some may be undefined. The hit > should be calculating the best overall score values from the contained HSPs > and falling back to directly-set hit table data (set as above). It is > possible this may not be happening when parsing other formats so it may be a > bug (and would be worth filing so we can look into it). OK. Also I'll have to double check the actual query report! It could be the problem is there. >> The closest hits in the mailing list that I could find to these probemes >> were: >> >> http://lists.open-bio.org/pipermail/bioperl-l/2002-May/007936.html >> http://lists.open-bio.org/pipermail/bioperl-l/2002-September/009586.html >> >> but I don't think that they are relevant. >> >> Since it comes up here, how is the 'best' HSP defined? it isn't >> documented as far as I can tell. > > 'best' - when comparing HSP data to the summary hit table (in text output > only), the highest scoring HSP represent the hit (highest score/raw_score, > lowest evalue). Which? >> About the documentation... looking here: >> >> http://search.cpan.org/~birney/bioperl-1.4/Bio/SearchIO.pm >> >> >> Several of the structured methods 'blocks' are followed by a "See >> Bio::..." link to other pages in CPAN. However the 'next_result' >> method is followed by a link to >> http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/RootI.pm - I think >> it should be a link to >> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Result/ResultI.pm >> >> Also, it would be nice (especially for noobs) if the full list of >> accepted format codes were given on that page. The current text "# >> format can be 'fasta', 'blast', 'exonerate', ..." is extremely >> frustrating for a beginner "... what?!". I now realize that each >> format code is matched by a "Bio::SearchIO::formatcode" module, but I >> didn't know that from reading the above. >> >> While I'm at it, on page >> http://search.cpan.org/~birney/bioperl-1.4/Bio/Search/Hit/HitI.pm - >> the phrase "Equivalent to raw_score()" appearing under the heading >> "score" is a broken link. In fact every "See also : $this->method()" >> type link on that page is broken (there are about 25). Also the link >> to "See also : BUGS" is broken. > > The pdoc documentation is better and more up-to-date (unfortunately the > bioperl-1.4 CPAN docs are out-of-date but always come up first, I think b/c > of the stable release status). > >>> User feedback is an integral part of the evolution of this and other >>> Bioperl modules. Send your comments and suggestions preferably to one of the >>> Bioperl mailing lists. Your participation is much appreciated. >> >> Thank you for your participation! I hope the above can help in some >> way, and I hope its not down to me making trivial mistakes! If these >> look like genuine bugs, should I report them on RT? > > No, use the bugzilla set up. We do not use CPAN's RT and generally redirect > any bugs to bugzilla. > >> Out of interest, I did get some fail while testing, specifically (or >> perhaps coincidentally) some related to SearchIO... >> >> ./Build test verbose=1 > test.results.dump &> test.results.dump >> >> ... >> Dan. > > Those are due to the changes I have been making (using svn code is bleeding > edge!). > >> P.S. I've also been attacking the wiki, so please undo any mess that I >> may have made there. Thanks very much for the detailed reply. Overall, would you recommend that I use SVN or 1.4 or 1.5.2? All the best, Dan. >> -- >> http://network.nature.com/profile/dan >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l <at> lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Marie-Claude Hofmann > College of Veterinary Medicine > University of Illinois Urbana-Champaign > > > > > -- -- http://network.nature.com/profile/dan
>
> SearchIO (specifically, HSP methods used to calculate seq start/end
> correctly) are being revised in svn, so they are failing at the moment. My
> local (not-yet-committed) revisions are failing much more, but that's b/c
> the tests are wrong; these should be added in relatively soon once I work
> out more kinks in the code.
>
>> I am trying to parse the output of a "blastall -p blastn" job
>> (blastall 2.2.18) using either the default format or tabular format. I
>> copied the "search_overview.PLS" script found here:
>>
>> ...
>>
>> Looking closer I found that $parser->result_count() only gets set
>> after calling $parser->next_result. Any way to force this? In some
>> Perl objects I've seen a 'parse' method that kicks the object into
>> (silently) calling all its get methods. Is there an equivalent (but
>> apparently undocumented) method? Actually, I think it should kick
>> itself when called... or not? Certainly the docs do not suggest that
>> is won't return a the number of results ("Function: Gets the number of
>> Blast results that have been parsed.") So I think this is a bug.
>
> We could make it so that the result_count() is eager (parses the results and
> reports the total back). Not sure, but we could optionally cache the
> already-parsed Result objects (that could run into memory issues if one is
> parsing a ton of reports, so it needs to be off by default).
I see (I think). Anyone first calling result_count() and *then*
iterating over the results is getting a performance hit by effectively
parsing the results twice? I would suggest that you make this function
eager, but document the potential performance issue so that people can
choose not to call it first. However, I don't think I can have
understood correctly. How can its value be set correctly after calling
next() only once?
>> Actually I just checked against version 1.4 and "result_count()" fails
>> with the default, the XML or the tabular format results (all three
>> representing the same query against the same database with the same
>> blast version). Note that trying with the XML format in the latest
>> version of BioPerl results in the outright failure:
>>
>> ------------- EXCEPTION: Bio::Root::NotImplemented -------------
>> MSG: Abstract method "Bio::SearchIO::result_count" is not implemented
>> by package Bio::SearchIO::blastxml.
>> This is not your fault - author of Bio::SearchIO::blastxml should be
>> blamed!
>>
>> ...
>
> That's a bug. Could you file this so we can track it?
>
>
RSS Feed