Features Download
From: Hamish McWilliam <hamish.mcwilliam <at> bioinfo-user.org.uk>
Subject: Re: Protein Records without Sequence
Newsgroups: gmane.comp.lang.perl.bio.general
Date: Wednesday 5th June 2013 19:58:51 UTC (over 3 years ago)
Hi Warren,

This is due to a change to the way some entries are handled in the
latest RefSeq (protein) data. From the RefSeq release notes


[1] ATTENTION: anticipated change in release 60 or 61
The RefSeq project is planning a significant expansion of the
prokaryotic dataset.
Specifically, NCBI's prokaryotic genome annotation pipeline is
generating annotation for all
submitted prokaryotic genomes representing strains, disease outbreak
sequences, population
sequencing, and diversity studies. These genomes include both complete
genomes and WGS (draft)
genomes submitted as 500 or fewer contigs. There will be a significant
increase in the number
of prokaryotic genomes and proteins provided in the RefSeq release.

We will avoid providing redundant protein records by providing a
single protein record when
identical proteins can be annotated on more than one genome.

A new RefSeq protein accession prefix, WP_, will be used for these
The accession has the format:

  WP_ + 9 numerals + version number, e.g., WP_000000001.1

WP_ records will be 'protein-only' records.  When an identical protein
is annotated on more
than one bacterial genome record, the annotated CDS will point to the
same WP_ accession.

Thus, a given WP_ record may represent a protein found in more than
one strain or in more
than one bacterial 'species'.

A separate announcement with more details will be provided in the next
few weeks.


This is implemented using the GenBank 'CONTIG' field to point to the
actual sequence entry from the various annotation entries. This means
that the flat-file data no longer contains the sequence, but NCBI
Entrez can construct the sequence when it is requested (e.g. when you
ask for fasta), in a similar way to the handling of contig entries in
GenBank and RefSeq (nucleotide).

All the best,


On 5 June 2013 19:16, Warren Gallin <[email protected]> wrote:
> Hi,
> I am encountering a problem with a number of protein records.
> A HMMer search of the nr database returns a gi number and an associated
> When I use that gi number to try to retrieve the full GENBANK record,
however, there is no sequence returned with the record.
> When I use the NCBI web interface and use that gi number the GENPEPT
record returns with no sequence, but when I select fast format the sequence
is returned.
> Examples of gi numbers for which this occurs are:
> 23099847
> 21224301
> 68536697
> 46580017
> 77359109
> Is this a flaw with the individual GENPEPT records?  In which case should
I report it to NCBI?
> Or are these some kind of "special" record that needs different
parameters passed on the utilizes search?
> There is a workaround, I guess, where is the sequence comes back empty
then a new retrieval of fasta formatted records can be run and the empty
field in the GENPEPT record repopulated, but this seems inelegant.
> All advice and/or commentary appreciated.
> Warren Gallin
> _______________________________________________
> Bioperl-l mailing list
> [email protected]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

"Saying the internet has changed dramatically over the last five years
is cliché – the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.
CD: 3ms