Features Download
From: Hamish McWilliam <hamish.mcwilliam <at> bioinfo-user.org.uk>
Subject: Re: Protein Records without Sequence
Newsgroups: gmane.comp.lang.perl.bio.general
Date: Saturday 8th June 2013 06:30:20 UTC (over 5 years ago)
No idea how NCBI are planning to handle this. My guess is that the
announcement mentioned in the RefSeq release notes will detail how to
deal with this. Since I cannot find this announcement, I suggest you
contact the RefSeq folks
to see what
their plan is.

FWIW from phrasing of the information in the release notes, it looks
like this may have gone live a bit earlier than originally planned, so
some things may not be in place yet...

All the best,


On 5 June 2013 23:10, Warren Gallin  wrote:
> OK, I see where this is coming from.  If I get a record without the
protein sequence I can evaluate it and then retrieve again as a fast file
and put it into the Bio::Seq object.
> Do you have any idea how they are going to handle this one-to-many
mapping?  Given that a single protein sequence may be linked to multiple
different nucleotide sequences, even from different species, that means
that a single protein sequence record may be tied to several different
nucleotide sequence records.
> However, when I look up a couple of WP_XXXXXX records, I get the protein
sequence but there is no DBSOURCE field, and there is no "coded_by"tag in
the feature table so there is no direct way to find the underlying
nucleotide sequence(s).
> See WP_004062662.1  GI:490164010 as an example.
> Is there any current way of retrieving the coding sequence starting from
a protein record like this?
> Warren Gallin
> I am hoping that all of these fields that will now have multiple entries
are going to be easily
> On 2013-06-05, at 1:58 PM, Hamish McWilliam
>> Hi Warren,
>> This is due to a change to the way some entries are handled in the
>> latest RefSeq (protein) data. From the RefSeq release notes
>> (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release59.txt):
>> ---
>> [1] ATTENTION: anticipated change in release 60 or 61
>> The RefSeq project is planning a significant expansion of the
>> prokaryotic dataset.
>> Specifically, NCBI's prokaryotic genome annotation pipeline is
>> generating annotation for all
>> submitted prokaryotic genomes representing strains, disease outbreak
>> sequences, population
>> sequencing, and diversity studies. These genomes include both complete
>> genomes and WGS (draft)
>> genomes submitted as 500 or fewer contigs. There will be a significant
>> increase in the number
>> of prokaryotic genomes and proteins provided in the RefSeq release.
>> We will avoid providing redundant protein records by providing a
>> single protein record when
>> identical proteins can be annotated on more than one genome.
>> A new RefSeq protein accession prefix, WP_, will be used for these
>> The accession has the format:
>>  WP_ + 9 numerals + version number, e.g., WP_000000001.1
>> WP_ records will be 'protein-only' records.  When an identical protein
>> is annotated on more
>> than one bacterial genome record, the annotated CDS will point to the
>> same WP_ accession.
>> Thus, a given WP_ record may represent a protein found in more than
>> one strain or in more
>> than one bacterial 'species'.
>> A separate announcement with more details will be provided in the next
>> few weeks.
>> ---
>> This is implemented using the GenBank 'CONTIG' field to point to the
>> actual sequence entry from the various annotation entries. This means
>> that the flat-file data no longer contains the sequence, but NCBI
>> Entrez can construct the sequence when it is requested (e.g. when you
>> ask for fasta), in a similar way to the handling of contig entries in
>> GenBank and RefSeq (nucleotide).
>> All the best,
>> Hamish
>> On 5 June 2013 19:16, Warren Gallin  wrote:
>>> Hi,
>>> I am encountering a problem with a number of protein records.
>>> A HMMer search of the nr database returns a gi number and an associated
>>> When I use that gi number to try to retrieve the full GENBANK record,
however, there is no sequence returned with the record.
>>> When I use the NCBI web interface and use that gi number the GENPEPT
record returns with no sequence, but when I select fast format the sequence
is returned.
>>> Examples of gi numbers for which this occurs are:
>>> 23099847
>>> 21224301
>>> 68536697
>>> 46580017
>>> 77359109
>>> Is this a flaw with the individual GENPEPT records?  In which case
should I report it to NCBI?
>>> Or are these some kind of "special" record that needs different
parameters passed on the utilizes search?
>>> There is a workaround, I guess, where is the sequence comes back empty
then a new retrieval of fasta formatted records can be run and the empty
field in the GENPEPT record repopulated, but this seems inelegant.
>>> All advice and/or commentary appreciated.
>>> Warren Gallin
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> [email protected]pen-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> --
>> ----
>> "Saying the internet has changed dramatically over the last five years
>> is cliché – the internet is always changing dramatically" - Craig
>> Labovitz, Arbor Networks.

"Saying the internet has changed dramatically over the last five years
is cliché – the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.
CD: 3ms