Jason Stajich | 26 Jul 06:04 2013
Picon

Re: counting gaps in a column of alignment

This is pretty inefficient - the multiple calls to slice - if you don't need the random access aspect of
slice, but instead are going to process every column, I would just operate on the strings themselves.

Look at how gap_col_matrix function is implemented  - it takes all the strings from the seqs and does the
calculation in one pass.

you just want to walk through each sequence and then each position in each sequence and update an array which
has a hash in each slot. The hash contains the nucleotide counts e.g.

 $counts[0]->{A}  would tell you the number of A's in Column 0.

you would build up this datastructure with 

for my $seq ( $aln->each_seq ) {
 my  <at> seqarray = split(//,$seq->seq);
  for my $col ( 0..$aln->length ) {
   my $base = $seqarray[$col]
    $counts[$col]->{$base}++;
  }
}

Or you can use substr instead of split to operate on the string itsself.

On Jul 24, 2013, at 9:37 PM, subha kalyanamoorthy <sksweety24 <at> gmail.com> wrote:

> Hi there,
> 
> I am a new bioperl user.  I made a script to count the nucleotide
> composition in each column of the alignment. I am able to get the count for
> the nucleotides, but not for the gap characters from the following program.
> I would greatly appreciate any suggestions to correct this program.
> 
> Thanks.
> 
> #***************************My Program**********************************
> 
> #!/bin/perl -w
> use strict;
> use warnings;
> 
> use List::Util 'max';
> use Bio::SimpleAlign;
> use Bio::Align::AlignI;
> use Bio::AlignIO;
> use Bio::SeqIO;
> 
> my $in= Bio::AlignIO->new( -file => "seq.fst", -format => "fasta");
> 
> my $align = $in->next_aln();
> 
> print "column\tA's\tT's\C's\G's\n";
> 
> for (my $i = 1; $i <= $align->length; $i++) {
> 
> my %count;
> 
> my $seqs = $align->slice($i,$i);
> 
> my $gap_char = $seqs->gap_char();
> 
>    my $count_A=0;
>    my $count_C=0;
>    my $count_T=0;
>    my $count_G=0;
>    my $count_N=0;
>    my $count_gap=0;
> 
>    foreach my $seq ($seqs->each_seq) {
> 
>    my $col=$seq->seq;
> 
>        if ($col eq 'A'){
>        $count_A++;
>        }elsif ($col eq 'C'){
>        $count_C++;
>        }elsif ($col eq 'T'){
>        $count_T++;
>        }elsif ($col eq 'G'){
>        $count_G++;
>        }elsif ( $col eq 'N'){
>        $count_N++;
>        }elsif ($col =~ m/^\Q$gap_char\E$/){
>        $count_gap++;
> }
>      $count{$seq->seq} += 1;
> }
> 
> print"$i\t$count_A\t$count_T\t$count_C\t$count_G\n";
> 
> }
> 
> #***********************************************************************`
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at> lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason.stajich <at> gmail.com
jason <at> bioperl.org

Gmane