OK, I see where this is coming from. If I get a record without the protein
sequence I can evaluate it and then retrieve again as a fast file and put
it into the Bio::Seq object.
Do you have any idea how they are going to handle this one-to-many mapping?
Given that a single protein sequence may be linked to multiple different
nucleotide sequences, even from different species, that means that a single
protein sequence record may be tied to several different nucleotide
However, when I look up a couple of WP_XXXXXX records, I get the protein
sequence but there is no DBSOURCE field, and there is no "coded_by"tag in
the feature table so there is no direct way to find the underlying
See WP_004062662.1 GI:490164010 as an example.
Is there any current way of retrieving the coding sequence starting from a
protein record like this?
I am hoping that all of these fields that will now have multiple entries
are going to be easily
On 2013-06-05, at 1:58 PM, Hamish McWilliam
> Hi Warren,
> This is due to a change to the way some entries are handled in the
> latest RefSeq (protein) data. From the RefSeq release notes
>  ATTENTION: anticipated change in release 60 or 61
> The RefSeq project is planning a significant expansion of the
> prokaryotic dataset.
> Specifically, NCBI's prokaryotic genome annotation pipeline is
> generating annotation for all
> submitted prokaryotic genomes representing strains, disease outbreak
> sequences, population
> sequencing, and diversity studies. These genomes include both complete
> genomes and WGS (draft)
> genomes submitted as 500 or fewer contigs. There will be a significant
> increase in the number
> of prokaryotic genomes and proteins provided in the RefSeq release.
> We will avoid providing redundant protein records by providing a
> single protein record when
> identical proteins can be annotated on more than one genome.
> A new RefSeq protein accession prefix, WP_, will be used for these
> The accession has the format:
> WP_ + 9 numerals + version number, e.g., WP_000000001.1
> WP_ records will be 'protein-only' records. When an identical protein
> is annotated on more
> than one bacterial genome record, the annotated CDS will point to the
> same WP_ accession.
> Thus, a given WP_ record may represent a protein found in more than
> one strain or in more
> than one bacterial 'species'.
> A separate announcement with more details will be provided in the next
> few weeks.
> This is implemented using the GenBank 'CONTIG' field to point to the
> actual sequence entry from the various annotation entries. This means
> that the flat-file data no longer contains the sequence, but NCBI
> Entrez can construct the sequence when it is requested (e.g. when you
> ask for fasta), in a similar way to the handling of contig entries in
> GenBank and RefSeq (nucleotide).
> All the best,
> On 5 June 2013 19:16, Warren Gallin <[email protected]> wrote:
>> I am encountering a problem with a number of protein records.
>> A HMMer search of the nr database returns a gi number and an associated
>> When I use that gi number to try to retrieve the full GENBANK record,
however, there is no sequence returned with the record.
>> When I use the NCBI web interface and use that gi number the GENPEPT
record returns with no sequence, but when I select fast format the sequence
>> Examples of gi numbers for which this occurs are:
>> Is this a flaw with the individual GENPEPT records? In which case
should I report it to NCBI?
>> Or are these some kind of "special" record that needs different
parameters passed on the utilizes search?
>> There is a workaround, I guess, where is the sequence comes back empty
then a new retrieval of fasta formatted records can be run and the empty
field in the GENPEPT record repopulated, but this seems inelegant.
>> All advice and/or commentary appreciated.
>> Warren Gallin
>> Bioperl-l mailing list
>> [email protected]
> "Saying the internet has changed dramatically over the last five years
> is cliché – the internet is always changing dramatically" - Craig
> Labovitz, Arbor Networks.