No idea how NCBI are planning to handle this. My guess is that the
announcement mentioned in the RefSeq release notes will detail how to
deal with this. Since I cannot find this announcement, I suggest you
contact the RefSeq folks
to see what
their plan is.
FWIW from phrasing of the information in the release notes, it looks
like this may have gone live a bit earlier than originally planned, so
some things may not be in place yet...
All the best,
On 5 June 2013 23:10, Warren Gallin wrote:
> OK, I see where this is coming from. If I get a record without the
protein sequence I can evaluate it and then retrieve again as a fast file
and put it into the Bio::Seq object.
> Do you have any idea how they are going to handle this one-to-many
mapping? Given that a single protein sequence may be linked to multiple
different nucleotide sequences, even from different species, that means
that a single protein sequence record may be tied to several different
nucleotide sequence records.
> However, when I look up a couple of WP_XXXXXX records, I get the protein
sequence but there is no DBSOURCE field, and there is no "coded_by"tag in
the feature table so there is no direct way to find the underlying
> See WP_004062662.1 GI:490164010 as an example.
> Is there any current way of retrieving the coding sequence starting from
a protein record like this?
> Warren Gallin
> I am hoping that all of these fields that will now have multiple entries
are going to be easily
> On 2013-06-05, at 1:58 PM, Hamish McWilliam
>> Hi Warren,
>> This is due to a change to the way some entries are handled in the
>> latest RefSeq (protein) data. From the RefSeq release notes
>>  ATTENTION: anticipated change in release 60 or 61
>> The RefSeq project is planning a significant expansion of the
>> prokaryotic dataset.
>> Specifically, NCBI's prokaryotic genome annotation pipeline is
>> generating annotation for all
>> submitted prokaryotic genomes representing strains, disease outbreak
>> sequences, population
>> sequencing, and diversity studies. These genomes include both complete
>> genomes and WGS (draft)
>> genomes submitted as 500 or fewer contigs. There will be a significant
>> increase in the number
>> of prokaryotic genomes and proteins provided in the RefSeq release.
>> We will avoid providing redundant protein records by providing a
>> single protein record when
>> identical proteins can be annotated on more than one genome.
>> A new RefSeq protein accession prefix, WP_, will be used for these
>> The accession has the format:
>> WP_ + 9 numerals + version number, e.g., WP_000000001.1
>> WP_ records will be 'protein-only' records. When an identical protein
>> is annotated on more
>> than one bacterial genome record, the annotated CDS will point to the
>> same WP_ accession.
>> Thus, a given WP_ record may represent a protein found in more than
>> one strain or in more
>> than one bacterial 'species'.
>> A separate announcement with more details will be provided in the next
>> few weeks.
>> This is implemented using the GenBank 'CONTIG' field to point to the
>> actual sequence entry from the various annotation entries. This means
>> that the flat-file data no longer contains the sequence, but NCBI
>> Entrez can construct the sequence when it is requested (e.g. when you
>> ask for fasta), in a similar way to the handling of contig entries in
>> GenBank and RefSeq (nucleotide).
>> All the best,
>> On 5 June 2013 19:16, Warren Gallin wrote:
>>> I am encountering a problem with a number of protein records.
>>> A HMMer search of the nr database returns a gi number and an associated
>>> When I use that gi number to try to retrieve the full GENBANK record,
however, there is no sequence returned with the record.
>>> When I use the NCBI web interface and use that gi number the GENPEPT
record returns with no sequence, but when I select fast format the sequence
>>> Examples of gi numbers for which this occurs are:
>>> Is this a flaw with the individual GENPEPT records? In which case
should I report it to NCBI?
>>> Or are these some kind of "special" record that needs different
parameters passed on the utilizes search?
>>> There is a workaround, I guess, where is the sequence comes back empty
then a new retrieval of fasta formatted records can be run and the empty
field in the GENPEPT record repopulated, but this seems inelegant.
>>> All advice and/or commentary appreciated.
>>> Warren Gallin
>>> Bioperl-l mailing list
>>> [email protected]pen-bio.org
>> "Saying the internet has changed dramatically over the last five years
>> is cliché – the internet is always changing dramatically" - Craig
>> Labovitz, Arbor Networks.
"Saying the internet has changed dramatically over the last five years
is cliché – the internet is always changing dramatically" - Craig
Labovitz, Arbor Networks.