Features Download
From: Fields, Christopher J <cjfields <at> illinois.edu>
Subject: Re: get CDS start site for entry in NCBI
Newsgroups: gmane.comp.lang.perl.bio.general
Date: Saturday 20th April 2013 05:24:16 UTC (over 5 years ago)
On Apr 17, 2013, at 6:08 PM, Matthew McCormack

> I am not much of a Perl coder and I have a few questions.
>     First, I would like to write a script that will go to NCBI genebank
and get the base number for the start of the CDS region, e.g. 235 (given a
particular accession number). I have looked at HOWTO's and documentation
for Bio::SeqIO and Bio::DB::GenBank and I can cut and paste the examples
and they work, but I can not figure out how to get what I want; the CDS
start site. I have difficulty knowing what all the methods and their
options are for the seqio object and seq_object. Most of the examples seem
to be using a file to get information and not a website.
>   Actually, what I have to start with is a TAIR locus number such as
AT4g08500, but I can not search on this at NCBI and come up with a unique
entry. I may have to have a table of conversions from TAIR locus number to
accession numbers.

Saying this as a Bioperl developer, there are easier ways than using NCBI. 
I would possibly use Gramene's Biomart for this:


You can select TAIR10, then (under 'Filters') select ID list limit and
paste in your list. Under 'Attributes' there are several options under the
'Features' button; you can change that button to 'Structure' and 'Exon'
will pop up, where you can select CDS start/end.  Using your ID, when you
click results you get (sorry about the tabs):

Ensembl Gene ID	Ensembl Transcript ID	CDS end (within cDNA)	CDS start
(within cDNA)
AT4G08500	AT4G08500.1	1058	1
AT4G08500	AT4G08500.1	1143	1059
AT4G08500	AT4G08500.1	1206	1144
AT4G08500	AT4G08500.1	1364	1207
AT4G08500	AT4G08500.1	1437	1365
AT4G08500	AT4G08500.1	1497	1438
AT4G08500	AT4G08500.1	1611	1498
AT4G08500	AT4G08500.1	1827	1612

Also appears you can also retrieve the protein sequences in bulk from TAIR:


The sequence descriptors contain coords (start-end and strand), this might
suffice if you merely want the translational start/end.

>  Also, I was looking for a bit of advice. What I am doing is getting data
off another web site. I have a script using the WWW::Mechanize module in
which I can input a link and go to that webpage, and then go down a line of
links (over 100) getting information from each link. As part of that
information that I am getting is the number base of a binding site, but I
want to know if that binding site is in the CDS. The start number is the
start of the gene, so say if the binding site is 235, then I want to know
if this is in the CDS. This data is not provided by the website, that is
why I want to go to NCBI and get the start of the CDS. The data at NCBI for
'gene' has the same length as the first webpage, but also contains the
beginning of the CDS, say 299, so with this information I can tel
 l if the binding site is in the CDS. Do you think the best way to do this
is extract the info from the link on the first web page, then go to NCBI
and extract the CDS, then back to the orig!
 inal web page and the next link, and so on, for a couple of hundred links
? Or is there a better way ? I am concerned about a script that will keep
going back to NCBI.

First, I would normally suggest if there is any way to get your hands on
the raw data then do it; however, if you've gone to the extent of using
WWW::Mechanize you've probably gone that route already w/o success.  If you
can get the initial data in batches (e.g. limit the number of requests)
then it might help speed things up.

Also, re: NCBI and other website queries: try to limit these to either a
small # per day (hundred); depending on the site, if they see an inordinate
# of requests from the same IP they may block it to prevent overloading the
site or even potential DDoS attacks (NCBI and UCSC do this for example). 
You can also use EUtilities to retrieve the raw data in bulk.

What you are trying to do is actually very common, namely finding
intersections/unions of features.  BioPerl can do that; many other tools
are also available (BEDtools being the most prominently used).  The key
part is getting it into a format that is acceptable (BED, GFF, etc), but
once that is in place such comparisons aren't terribly hard to do.


> Matthew
> The information in this e-mail is intended only for the person to whom it
> addressed. If you believe this e-mail was sent to you in error and the
> contains patient information, please contact the Partners Compliance
HelpLine at
> http://www.partners.org/complianceline
. If the e-mail was sent to you in error
> but does not contain patient information, please contact the sender and
> dispose of the e-mail.
> _______________________________________________
> Bioperl-l mailing list
> [email protected]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
CD: 4ms