Jason Stajich | 17 Feb 18:42 2012
Picon

Fwd: [Bioperl-guts-l] [BioPerl - Bug #3328] (New) segregating sites calculation fails on gapped sequences


This should be an easy bug for someone to fix -- I am pretty sure the solution is to ignore gapped columns but I
haven't looked deeper and I don't have any time right now to work on bioperl fixes so be great if someone
wanted to help out here.

The redmine bug info is appended below.

Jason

Begin forwarded message:

> From: redmine <at> redmine.open-bio.org
> Subject: [Bioperl-guts-l] [BioPerl - Bug #3328] (New) segregating sites calculation fails on gapped sequences
> Date: February 17, 2012 9:39:42 AM PST
> To: bioperl-guts-l <at> lists.open-bio.org
> 
> 
> Issue #3328 has been reported by Jason Stajich.
> 
> ----------------------------------------
> Bug #3328: segregating sites calculation fails on gapped sequences
> https://redmine.open-bio.org/issues/3328
> 
> Author: Jason Stajich
> Status: New
> Priority: Normal
> Assignee: Bioperl Guts
> Category: Bio::PopGen
> Target version: 
> URL: 
> 
> 
> 
>   I am Cheng-Ruei Lee, a graduate student in Duke Biology. I'm analyzing many DNA alignments of a plant species.
>   I first used (Bio::PopGen::Utilities -> aln_to_population()) to read in the fasta format alignment,
and then use Bio::PopGen::Statistics to calculate some statistics without outgroup. Most gene work
fine, but I think a bug happened when it meets alignments like this:
> 
>> Genotype1
> ATGATCGTAGCTGATGCTGTGATCGATCGCTAGCTAGCTCGA
>> Genotype2
> ------------GATGCTGTGATCGATCGCTAGCTAGCTCGA
>> Genotype3
> ------------GATGCTGTGATCGATCGCTAGCTAGCTCGA
>> Genotype4
> ------------GATGCTGTGATCGATCGCTAGCTAGCTCGA
> 
>   I get this data set from other people. I guess due to the annotation program people used, the definition of
coding sequence is much longer in genotype 1 than in other genotypes. This creates a long stretch of gap in
the very beginning. Whenever Bio::PopGen meets this kind of genes, the number of singleton counts boost a
lot - seems like the long stretch of sites with gap is also counted as singletons. Also, some Fu & Li
statistics boosted. The "number of segregation sites" seems not to be affected. (And therefore, there
are genes with hundreds of singleton sites but only a few total segregating sites.)
>   May be a possible bug in Bio::PopGen::Utilities when reading in the data? Or when calculating singletons?
> 
> Sincerely,
> Cheng-Ruei Lee <cl134 <at> duke.edu>
> 
> 
> -- 
> You have received this notification because you have either subscribed to it, or are involved in it.
> To change your notification preferences, please click here and login: http://redmine.open-bio.org
> 
> _______________________________________________
> Bioperl-guts-l mailing list
> Bioperl-guts-l <at> lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-guts-l

Jason Stajich
jason.stajich <at> gmail.com
jason <at> bioperl.org

Gmane