Jim Hu | 4 Jan 22:57 2013
Picon

Re: Converting blast+ output to gff (with gaps)

Malcolm,

Thanks, I should have reread the GFF3 spec before posting!

In the section on the Gap attrribute and below on alignments it discusses two ways to represent an
alignment. I was originally thinking of something like the later example shown for cDNA vs genome. But the
gap attribute representation would be fine too. So, I can see how the final output could be done in
different ways, but I'm still stuck on how to get there.  

I don't have a specific application in mind; I'm mostly just trying to understand how to get from having
standalone blast+ output to get to things that look like the examples in the gff spec and the gbrowse
documentation - really basic display of alignments that are gapped. For my teaching, we do EST vs genomic
blast and want gapped cDNA alignments to show where the introns go. My other work is with bacteria where
introns are rare, but there are times when I'd like to show an alignment that is interrupted by a
transposable element, for example.

Excerpting from blastp -help

 *** Formatting options
 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1) 

Several of these are "lossy" in terms of where the actual gaps occur (e.g. 6). Others seem to me to be more
human readable than suited for parsing. So I was hoping to get pointed to an existing script that would
generate either the single feature with gap attribute OR the multi-line match features OR a combination
from one of these output formats. 

I'm probably missing something very, very obvious.

Best,

Jim

On Jan 4, 2013, at 2:20 PM, Cook, Malcolm wrote:

> Jim,
> 
> Getting to your original question:
> 
>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of gaps
and mismatches, and .create gff with appropriate subfeatures.
> 
> Exactly what/how do you want/expect to encode the blast output as GFF{1,2,2.5,3}??
> 
> If GFF3 pe http://www.sequenceontology.org/gff3.shtml then are you hoping to get GFF3 marked up as
described in section 'THE GAP ATTRIBUTE' or as in 'ALIGNMENTS'
> 
> I would guess not because neither of them have 'subfeatures'.
> 
> If you could explain more fully with examples (hand cobbled or borrowed from someone else) of what you
expect then I might have a better idea of what options might suit your needs.
> 
> 
> ~Malcolm
> 
> 
> .-----Original Message-----
> .From: bioperl-l-bounces <at> lists.open-bio.org [mailto:bioperl-l-bounces <at> lists.open-bio.org] On
Behalf Of Jim Hu
> .Sent: Friday, January 04, 2013 1:50 PM
> .To: Brian Osborne
> .Cc: Fields, Christopher J; Scott Cain; bioperl-l <at> bioperl.org
> .Subject: Re: [Bioperl-l] Converting blast+ output to gff (with gaps)
> .
> .Thanks for the replies, but...
> .
> .I can't tell what input formats for the blast results file are supported.  Format 11 and format 6 give no
output and no feedback. Putting
> .some diagnostic print statements in the code suggests that I'm not getting any result objects from Bio::SearchIO.
> .
> .The script uses Bio::SearchIO, but does not seem to call the submodules for blast.  Documentation links
on the wiki seem to be
> .broken, at least on this page:
> .
> .	http://www.bioperl.org/wiki/Module:Bio::SearchIO
> .
> .Jim
> .
> .
> .On Jan 2, 2013, at 4:53 PM, Brian Osborne wrote:
> .
> .> Scott and Chris,
> .>
> .> I'll test it and see...
> .>
> .> Brian O.
> .>
> .>
> .> On Jan 2, 2013, at 5:26 PM, "Fields, Christopher J" <cjfields <at> illinois.edu> wrote:
> .>
> .>> It should (I recall using it at one point).  If it doesn't we should fix it so it does.
> .>>
> .>> How does MAKER deal with this?  IIRC it uses (a modified) SearchIO-based method...
> .>>
> .>> chris
> .>>
> .>> On Jan 2, 2013, at 3:32 PM, Scott Cain <scott <at> scottcain.net> wrote:
> .>>
> .>>> Hi Brian,
> .>>>
> .>>> I was going to suggest the same thing--though that script is fairly
> .>>> old, it's not as old as the blast2gff script in the GBrowse
> .>>> distribution (which probably should be retired).  I believe it
> .>>> supports GFF3, though I don't have any sample data with which to test
> .>>> it to be sure.  I also don't know if it supports BLAST+ input--I
> .>>> haven't kept up with SearchIO (on which search2gff.pl depends); will
> .>>> it accept it?
> .>>>
> .>>> Scott
> .>>>
> .>>>
> .>>> On Wed, Jan 2, 2013 at 3:26 PM, Brian Osborne <bosborne11 <at> verizon.net> wrote:
> .>>>> Here's one:
> .>>>>
> .>>>> https://github.com/GMOD/GBrowse/blob/master/contrib/blast2gff.pl
> .>>>>
> .>>>> Another one:
> .>>>>
> .>>>> ~/git/bioperl-live>head scripts/utilities/bp_search2gff.pl
> .>>>> #!perl
> .>>>>
> .>>>> # Author:      Jason Stajich <jason-at-bioperl-dot-org>
> .>>>> # Description: Turn SearchIO parseable report(s) into a GFF report
> .>>>> #
> .>>>> =head1 NAME
> .>>>>
> .>>>> bp_search2gff - Turn SearchIO parseable reports(s) into a GFF report
> .>>>>
> .>>>>
> .>>>>
> .>>>> Brian O.
> .>>>>
> .>>>> On Jan 2, 2013, at 2:44 PM, Jim Hu <jimhu <at> tamu.edu> wrote:
> .>>>>
> .>>>>> I assume this has already been done many times, but I can't seem to find it on bioperl.org or via google.
> .>>>>>
> .>>>>> I'm looking for a script that will take one of the blast+ outformats that includes the positions of
gaps and mismatches, and
> .create gff with appropriate subfeatures.
> .>>>>>
> .>>>>> Thanks,
> .>>>>>
> .>>>>> Jim
> .>>>>> =====================================
> .>>>>> Jim Hu
> .>>>>> Professor
> .>>>>> Dept. of Biochemistry and Biophysics
> .>>>>> 2128 TAMU
> .>>>>> Texas A&M Univ.
> .>>>>> College Station, TX 77843-2128
> .>>>>> 979-862-4054
> .>>>>>
> .>>>>>
> .>>>>>
> .>>>>> _______________________________________________
> .>>>>> Bioperl-l mailing list
> .>>>>> Bioperl-l <at> lists.open-bio.org
> .>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> .>>>>
> .>>>>
> .>>>> _______________________________________________
> .>>>> Bioperl-l mailing list
> .>>>> Bioperl-l <at> lists.open-bio.org
> .>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> .>>>
> .>>>
> .>>>
> .>>> --
> .>>> ------------------------------------------------------------------------
> .>>> Scott Cain, Ph. D.                                   scott at scottcain dot net
> .>>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> .>>> Ontario Institute for Cancer Research
> .>>> _______________________________________________
> .>>> Bioperl-l mailing list
> .>>> Bioperl-l <at> lists.open-bio.org
> .>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> .>>
> .>
> .
> .=====================================
> .Jim Hu
> .Professor
> .Dept. of Biochemistry and Biophysics
> .2128 TAMU
> .Texas A&M Univ.
> .College Station, TX 77843-2128
> .979-862-4054
> .
> .
> .
> ._______________________________________________
> .Bioperl-l mailing list
> .Bioperl-l <at> lists.open-bio.org
> .http://lists.open-bio.org/mailman/listinfo/bioperl-l

=====================================
Jim Hu
Professor
Dept. of Biochemistry and Biophysics
2128 TAMU
Texas A&M Univ.
College Station, TX 77843-2128
979-862-4054

Gmane