jlklassen | 16 Jul 16:36 2013
Picon

Location having both "before" and "after" (<x..>y)

Hello all,

I am attempting to output a CDS annotation where the CDS spans both the 
beginning and ending of my contig, according to the NCBI standard: 
http://www.ncbi.nlm.nih.gov/genbank/examples.wgs#partialcds.  According to 
http://www.bioperl.org/wiki/HOWTO:Feature-Annotation, my understanding is 
that BioPerl supports both "BEFORE" and "AFTER" annotations, and indeed I 
can output an appropriate gbk with either the "BEFORE" or "AFTER" 
individually.  However, when I have them both together I get something like 
this:

LOCUS       Contig_381.1             220 bp    dna     linear   UNK 
ACCESSION   unknown
FEATURES             Location/Qualifiers
     CDS             <3
                     /locus_id="M32_06772"
                     /gene="Contig_381.1_1"

/translation="AYGARAKGGVVAWEPLPVQYADYALWQQELLGGGGDAGGVVSQG
                     LAYWEGVLEGLPQELELPVDRPRPVRPS"
ORIGIN      
        1 gggcgtacgg ggcgcgtgcc aagggtgggg tggtggcgtg ggagccgttg ccggtgcagt
       61 acgcggatta cgcgttgtgg cagcaggagc tgctcggggg tgggggtgat gcgggcggtg
      121 tggtctcgca agggttggcg tattgggagg gggtgctgga ggggttgccg caggagctgg
      181 agttgccggt ggaccggccg cgtccggtgc gtccgtccta
//

You can see that the the CDS location annotation has the appropriate 5' 
annotation, but not the 3' (should be >218).  Note that the translation is 
parsed from the Prodigal output and not generated directly in BioPerl.  I'm 
a somewhat new to BioPerl, so am hoping that I'm missing something 
relatively simple, but a search of the mailing list did not reveal an 
obvious answer.  I'm using BioPerl v1.6.901 on Ubuntu LINUX 12.04 (Biolinux 
7).  My code is to create a multiple gbk file from a set of bacterial gene 
models, as predicted using Prodigal using multiple contigs as a template.

Many thanks in advance for the help

Jonathan Klassen
Postdoctoral Fellow
University of Wisconsin-Madison

abbreviated code:

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::SeqUtils;
use Bio::SeqFeature::Generic;

my $ffn_obj = Bio::SeqIO->new(-file => $ffn_infile, -format => 'fasta', 
-alphabet => 'dna');

while (my $each_gene = $ffn_obj->next_seq){

    # parse prodigal ffn fasta headers

    my $gene_id = $each_gene->display_id;
    $gene_id =~ /^(.*)_\d+$/ or die "Cannot match gene_id: $gene_id";
    my $contig_id = $1;
    my $gene_desc = $each_gene->desc;
    $gene_desc =~ /\# (\d+) \# (\d+) # (-1|1) # .*;partial=(0|1)(0|1);/ or 
die "Cannot match gene_desc: $gene_desc";
    my $start         = $1;
    my $end           = $2;
    my $orientation   = $3;
    my $five_partial  = $4;
    my $three_partial = $5;

    # accommodates partial genes

##############################################
# comment out these next 6 lines to reproduce the error
##############################################

    if ($five_partial == 1){
        $start = "<".$start;
    }
    if ($three_partial == 1){
        $end = ">".$end;
    }

    # creates annotation feature for this gene

    my $feature = new Bio::SeqFeature::Generic(
        -start       => $start,
        -end         => $end,
        -strand      => $orientation,
        -primary_tag => 'CDS',
    );

    # hash of arrays, keyed by contig_id

    push  <at> {$genes{$contig_id}{gene_obj}}, $each_gene;
    push  <at> {$genes{$contig_id}{feature_obj}}, $feature;
    push  <at> {$genes{$contig_id}{gene_id}}, $gene_id;
    push  <at> {$genes{$contig_id}{locus_id}}, $locus_id;
}

# read through fna, annotating and outputting genes and prots

my $genome_gbk  = Bio::SeqIO->new(-file => ">$output_prefix.gbk", -format 
=> 'genbank', alphabet => 'dna'    );
my $fna_obj = Bio::SeqIO->new(-file => $fna_infile, -format => 'fasta', 
-alphabet => 'dna');
while (my $contig = $fna_obj->next_seq){
    my $contig_id = $contig->display_id;

    for my $a (0..$#{$genes{$contig_id}{gene_obj}}){
        $contig->add_SeqFeature($genes{$contig_id}{feature_obj}[$a]);
    }
    $genome_gbk->write_seq($contig);
}

Gmane