28 Feb 2013 16:36
Fix for Bug #3376 broke somewhere else
=?iso-8859-1?Q?Francisco_J._Ossand=F3n?= <fossandonc <at> hotmail.com>
2013-02-28 15:36:34 GMT
2013-02-28 15:36:34 GMT
Hi, I was re-checking Bug #3302 using the Bio::SearchIO modules of the repository and found that now it can't parse a Hmmer2 file that was previously fine. After tracking the problem, I discovered that a change in a regular expression to fix another bug broke the parse. The fix for the Bug #3376 consisted in adding an extra condition to omit lines where end of domain indicator is split across lines (https://redmine.open-bio.org/issues/3376): TEST: domain 1 of 1, from 8 to 97: score 184.7, E = 2.5e-56 *->svfqqqqssksttgstvtAiAiAigYRYRYRAvtWnsGsLssGvnDn sv+qqqq+ + +vtAiAiAigYRYRYRAv Wn GsLs G nDn Test 8 SVYQQQQGGSA----MVTAIAIAIGYRYRYRAVVWNKGSLSTGTNDN 50 DnDqqsdgLYtiYYsvtvpssslpsqtviHHHaHkasstkiiikiePr<- DnDq +d LYtiYYsvtv +ss+p q+v+HHHaH+asstkiiiki P Test 51 DNDQAAD-LYTIYYSVTVSASSWPGQSVTHHHAHPASSTKIIIKIAPS 97 * Test - - This case is characterized by the 2 dashes in the line... So the expression added in hmmer2.pm - next_result (https://github.com/bioperl/bioperl-live/commit/142e5d79e3a6593db32bf0af9904 8f47d01bd3f2): elsif (CORE::length($_) == 0 || ( $count != 1 && /^\s+$/o ) || /^\s+\-?\*\s*$/ || /^.+\-\s+\-\s*$/ ) ### <--- This regex was designed for bug 3376 { next; } But the expression used is too broad because it uses the "^.+" just before the 2 dashes, and it broke these lines parsing, where is full of dashes: KyACrqCdtiVQAPaPakpIErGiptaGLLArvlVSKyaEHlPLYRQsEI lcl|gi|340 - -------------------------------------------------- - yaRqGVeiaRstLadWVgrtgarLaPLvdALaeyVLkeGklHADeTPVqV +i s L V++ + r lcl|gi|340 60938 ------AIMISGLIHGVSARCLRF-------------------------- 60955 I think a reasonable fix that still fixes the original bug and restore the function for this case is to add an extra \s+ in the regex just before the first dash, so the expression makes sure that the first dash is the one that comes AFTER the description (and is replacing the usual coordinate number) and is not the last of an alignment or a series of dashes like the one above: elsif (CORE::length($_) == 0 || ( $count != 1 && /^\s+$/o ) || /^\s+\-?\*\s*$/ || /^.+\s+\-\s+\-\s*$/ ) ### <--- Tweaked regex { next; } I tested it and it works fine, hope you find the fix acceptable. Cheers, -- Francisco J. Ossandon Bioinformatician. Ph.D. Candidate, University Andres Bello. Center for Bioinformatics and Genome Biology, Fundacion Ciencia para la Vida. Santiago, Chile. www.cienciavida.cl/CBGB.htm
RSS Feed