Hamish McWilliam | 8 Jun 08:30 2013

Re: Protein Records without Sequence

No idea how NCBI are planning to handle this. My guess is that the
announcement mentioned in the RefSeq release notes will detail how to
deal with this. Since I cannot find this announcement, I suggest you
contact the RefSeq folks
(http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi) to see what
their plan is.

FWIW from phrasing of the information in the release notes, it looks
like this may have gone live a bit earlier than originally planned, so
some things may not be in place yet...

All the best,


On 5 June 2013 23:10, Warren Gallin <wgallin <at> ualberta.ca> wrote:
> OK, I see where this is coming from.  If I get a record without the protein sequence I can evaluate it and then
retrieve again as a fast file and put it into the Bio::Seq object.
> Do you have any idea how they are going to handle this one-to-many mapping?  Given that a single protein
sequence may be linked to multiple different nucleotide sequences, even from different species, that
means that a single protein sequence record may be tied to several different nucleotide sequence records.
> However, when I look up a couple of WP_XXXXXX records, I get the protein sequence but there is no DBSOURCE
field, and there is no "coded_by"tag in the feature table so there is no direct way to find the underlying
nucleotide sequence(s).
> See WP_004062662.1  GI:490164010 as an example.
> Is there any current way of retrieving the coding sequence starting from a protein record like this?
> Warren Gallin
> I am hoping that all of these fields that will now have multiple entries are going to be easily
> On 2013-06-05, at 1:58 PM, Hamish McWilliam <hamish.mcwilliam <at> bioinfo-user.org.uk> wrote:
>> Hi Warren,
>> This is due to a change to the way some entries are handled in the
>> latest RefSeq (protein) data. From the RefSeq release notes
>> (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release59.txt):
>> ---
>> [1] ATTENTION: anticipated change in release 60 or 61
>> The RefSeq project is planning a significant expansion of the
>> prokaryotic dataset.
>> Specifically, NCBI's prokaryotic genome annotation pipeline is
>> generating annotation for all
>> submitted prokaryotic genomes representing strains, disease outbreak
>> sequences, population
>> sequencing, and diversity studies. These genomes include both complete
>> genomes and WGS (draft)
>> genomes submitted as 500 or fewer contigs. There will be a significant
>> increase in the number
>> of prokaryotic genomes and proteins provided in the RefSeq release.
>> We will avoid providing redundant protein records by providing a
>> single protein record when
>> identical proteins can be annotated on more than one genome.
>> A new RefSeq protein accession prefix, WP_, will be used for these proteins.
>> The accession has the format:
>>  WP_ + 9 numerals + version number, e.g., WP_000000001.1
>> WP_ records will be 'protein-only' records.  When an identical protein
>> is annotated on more
>> than one bacterial genome record, the annotated CDS will point to the
>> same WP_ accession.
>> Thus, a given WP_ record may represent a protein found in more than
>> one strain or in more
>> than one bacterial 'species'.
>> A separate announcement with more details will be provided in the next
>> few weeks.
>> ---
>> This is implemented using the GenBank 'CONTIG' field to point to the
>> actual sequence entry from the various annotation entries. This means
>> that the flat-file data no longer contains the sequence, but NCBI
>> Entrez can construct the sequence when it is requested (e.g. when you
>> ask for fasta), in a similar way to the handling of contig entries in
>> GenBank and RefSeq (nucleotide).
>> All the best,
>> Hamish
>> On 5 June 2013 19:16, Warren Gallin <wgallin <at> ualberta.ca> wrote:
>>> Hi,
>>> I am encountering a problem with a number of protein records.
>>> A HMMer search of the nr database returns a gi number and an associated sequence.
>>> When I use that gi number to try to retrieve the full GENBANK record, however, there is no sequence
returned with the record.
>>> When I use the NCBI web interface and use that gi number the GENPEPT record returns with no sequence, but
when I select fast format the sequence is returned.
>>> Examples of gi numbers for which this occurs are:
>>> 23099847
>>> 21224301
>>> 68536697
>>> 46580017
>>> 77359109
>>> Is this a flaw with the individual GENPEPT records?  In which case should I report it to NCBI?
>>> Or are these some kind of "special" record that needs different parameters passed on the utilizes search?
>>> There is a workaround, I guess, where is the sequence comes back empty then a new retrieval of fasta
formatted records can be run and the empty field in the GENPEPT record repopulated, but this seems inelegant.
>>> All advice and/or commentary appreciated.
>>> Warren Gallin
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l <at> lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> --
>> ----
>> "Saying the internet has changed dramatically over the last five years
>> is cliché – the internet is always changing dramatically" - Craig
>> Labovitz, Arbor Networks.


"Saying the internet has changed dramatically over the last five years
is cliché – the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.