Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Jim Hu <jimhu <at> tamu.edu>
Subject: Re: codon usage
Newsgroups: gmane.comp.lang.perl.bio.general
Date: Monday 7th May 2012 23:08:36 UTC (over 4 years ago)
Actually, I was looking at the wrong method for the topic of codon usage. 
Count codons doesn't use the regex; it just peels off triplets and uses a
hash to track the counts.

I was looking at count_monomers, which uses this:
...
#  For each letter, count the number of times it appears in
	#  the sequence
 LETTER:
	foreach $element (@$alphabet) {
		# skip terminator symbol which may confuse regex
		next LETTER if $element eq '*';
		$count{$element} = ( $seqstring =~ s/$element/$element/g);
	}

	if ($_is_instance) {
		$self->{'_monomer_count'} =\% count;  # Save in case called again later
	}

	return\% count;


In this case the theoretical problem is that the regex engine is calling
the $seqstring repeatedly, so that even if the engine is using a state
machine, its repeating the scan of the sequence for each symbol in the
alphabet.  This is still likely to be way faster than an interpreted state
machine, since the elements are just single characters and there are not
many of them.

As I said in the first email, a state machine written for the interpreter
running a more efficient algorthm probably would still lose to beat a
built-in engine running a less efficient algorithm.  And being able to
stream instead of holding the whole sequence in memory is not a concern
now, as it was when we had 4K of RAM.

JH

On May 7, 2012, at 2:15 PM, Joel Martin wrote:

> I think you're under-appreciating the perl regex engine, it operates
> as a state machine not by matching all occurrences every time.  Here
> is a nice excerpt from pro perl parsing
> http://www.devshed.com/c/a/Perl/Parsing-and-Regular-Expression-Basics/2/
> 
> Joel
> 
> On Mon, May 7, 2012 at 8:28 AM, Jim Hu <[email protected]> wrote:
>> I was looking at Bio::Tools::SeqStats.  It reminds me of the very first
bioinformatics program I worked on back when I was in grad school,
sequencing was done by Maxam-Gilbert chemistry on gels with radioactive DNA
(we were doing short reads, but we didn't call it that and the reads per
run was something like 8).  We were writing a program in Apple II basic to
find restriction sites.  Everyone in the group was doing this by putting
the target sequence in a string variable and looking for the site as a
substring match by whatever it was BASIC used.  Loop over all the sites you
are looking for.  This strikes me as the equivalent of how
Bio::Tools::SeqStats works, only with regexes.
>> 
>> My roommate at the time, who was a math PhD student doing an MS in
CompSci, pointed out to me that this would be more efficient using a
discrete finite automaton algorithm, where each site we were looking for
would be a state automaton. This has the advantage of being able to process
the sequence as a stream.  Back when we were working with computers with
RAM measured in Kbytes, this was a big help.  I'm not sure if it would be
worth it today.  The slow interpreted implementation of the state machines
would likely lose to the fast internal implementation of the regex routines
for sequence lengths we are looking at these days.
>> 
>> But it might be interesting to compare.
>> 
>> Jim
>> 
>> On May 6, 2012, at 3:52 PM, Smithies, Russell wrote:
>> 
>>> Or use Bio::Tools::SeqStats
>>> (this is straight from the perldocs http://search.cpan.org/~cjfields/BioPerl-1.6.901/Bio/Tools/SeqStats.pm
)
>>> 
>>>         $seqobj = Bio::PrimarySeq->new(-seq      =>
'ACTGTGGCGTCAACTG',-alphabet => 'dna',-id       => 'test');
>>>         $seq_stats  =  Bio::Tools::SeqStats->new(-seq => $seqobj);
>>> 
>>>         $hash_ref = $seq_stats-> count_codons();  # for nucleic acid
sequence
>>>         foreach $base (sort keys %$hash_ref) {
>>>             print "Number of codons of type ", $base, "= ", 
%$hash_ref->{$base},"\n";
>>>         }
>>> 
>>> 
>>> --Russell
>>> 
>>> 
>>> -----Original Message-----
>>> From: [email protected]
[mailto:[email protected]] On Behalf Of subarna thakur
>>> Sent: Saturday, 21 April 2012 3:00 p.m.
>>> To: [email protected]
>>> Subject: [Bioperl-l] codon usage
>>> 
>>> I am writing a script for determining number of genes containing a
particular codon. The codons are mentioned in a separate file. The output
is coming all right for the first codon mentioned in the file but for the
other codons , the script is not working. Please suggest the error in the
script. The script is as follows ----
>>> 
>>> #!/usr/bin/perl -w
>>> 
>>> use Bio::SeqIO;
>>> 
>>> $file2="table.txt";
>>> 
>>> $codon=0;
>>> 
>>> open OUT, ">out-test.txt" or die $!;
>>> 
>>> $seqio_obj = Bio::SeqIO->new( -file => "gopi2.txt" , '-format'
=> 'Fasta');
>>> 
>>> open( my $fh2, $file2 ) or die "$!";
>>> 
>>> while( my $line = <$fh2> ){
>>> 
>>> $acc=$line;
>>> 
>>> chomp $acc;
>>> 
>>> while ($seq1= $seqio_obj->next_seq){
>>> 
>>> my @output = $seq1->id;
>>> 
>>> my $string = $seq1->seq;
>>> 
>>> $v=0;
>>> 
>>> $l= length($string);
>>> 
>>> $t=$l/3;
>>> 
>>> $k=0;
>>> 
>>> for ($i=1; $i <= $t; $i++){
>>> 
>>> @array2 = substr($string, $k, 3);
>>> 
>>> $k=$k+3;
>>> 
>>> foreach $value (@array2)
>>> 
>>> {
>>> 
>>> if ($value eq "$acc")
>>> 
>>> {
>>> 
>>> print OUT " The sequence id is @output\n";
>>> 
>>> print OUT "$acc codon found in position $i\n\n";
>>> 
>>> $v=$v+1;
>>> 
>>> }
>>> 
>>> }
>>> 
>>> }
>>> 
>>> if ($v==0)
>>> 
>>> {
>>> 
>>> $h=0;
>>> 
>>> }
>>> 
>>> else
>>> 
>>> {
>>> 
>>> $h=1;
>>> 
>>> }
>>> 
>>> $codon=$codon+$h;
>>> 
>>> }
>>> 
>>> print OUT "Total number of sequences with $acc codon";
>>> 
>>> print OUT "\t";
>>> 
>>> print OUT $codon;
>>> 
>>> }
>>> 
>>> exit;
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> =======================================================================
>>> Attention: The information contained in this message and/or attachments
>>> from AgResearch Limited is intended only for the persons or entities
>>> to which it is addressed and may contain confidential and/or privileged
>>> material. Any review, retransmission, dissemination or other use of, or
>>> taking of any action in reliance upon, this information by persons or
>>> entities other than the intended recipients is prohibited by AgResearch
>>> Limited. If you have received this message in error, please notify the
>>> sender immediately.
>>> =======================================================================
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [email protected]
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> =====================================
>> Jim Hu
>> Professor
>> Dept. of Biochemistry and Biophysics
>> 2128 TAMU
>> Texas A&M Univ.
>> College Station, TX 77843-2128
>> 979-862-4054
>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> [email protected]
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l

=====================================
Jim Hu
Professor
Dept. of Biochemistry and Biophysics
2128 TAMU
Texas A&M Univ.
College Station, TX 77843-2128
979-862-4054
 
CD: 3ms