5 Jun 2013 21:58
Re: Protein Records without Sequence
Hamish McWilliam <hamish.mcwilliam <at> bioinfo-user.org.uk>
2013-06-05 19:58:51 GMT
2013-06-05 19:58:51 GMT
Hi Warren, This is due to a change to the way some entries are handled in the latest RefSeq (protein) data. From the RefSeq release notes (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release59.txt): ---  ATTENTION: anticipated change in release 60 or 61 The RefSeq project is planning a significant expansion of the prokaryotic dataset. Specifically, NCBI's prokaryotic genome annotation pipeline is generating annotation for all submitted prokaryotic genomes representing strains, disease outbreak sequences, population sequencing, and diversity studies. These genomes include both complete genomes and WGS (draft) genomes submitted as 500 or fewer contigs. There will be a significant increase in the number of prokaryotic genomes and proteins provided in the RefSeq release. We will avoid providing redundant protein records by providing a single protein record when identical proteins can be annotated on more than one genome. A new RefSeq protein accession prefix, WP_, will be used for these proteins. The accession has the format: WP_ + 9 numerals + version number, e.g., WP_000000001.1 WP_ records will be 'protein-only' records. When an identical protein is annotated on more than one bacterial genome record, the annotated CDS will point to the same WP_ accession. Thus, a given WP_ record may represent a protein found in more than one strain or in more than one bacterial 'species'. A separate announcement with more details will be provided in the next few weeks. --- This is implemented using the GenBank 'CONTIG' field to point to the actual sequence entry from the various annotation entries. This means that the flat-file data no longer contains the sequence, but NCBI Entrez can construct the sequence when it is requested (e.g. when you ask for fasta), in a similar way to the handling of contig entries in GenBank and RefSeq (nucleotide). All the best, Hamish On 5 June 2013 19:16, Warren Gallin <wgallin <at> ualberta.ca> wrote: > Hi, > > I am encountering a problem with a number of protein records. > > A HMMer search of the nr database returns a gi number and an associated sequence. > > When I use that gi number to try to retrieve the full GENBANK record, however, there is no sequence returned with the record. > > When I use the NCBI web interface and use that gi number the GENPEPT record returns with no sequence, but when I select fast format the sequence is returned. > > Examples of gi numbers for which this occurs are: > > 23099847 > 21224301 > 68536697 > 46580017 > 77359109 > > Is this a flaw with the individual GENPEPT records? In which case should I report it to NCBI? > > Or are these some kind of "special" record that needs different parameters passed on the utilizes search? > > There is a workaround, I guess, where is the sequence comes back empty then a new retrieval of fasta formatted records can be run and the empty field in the GENPEPT record repopulated, but this seems inelegant. > > All advice and/or commentary appreciated. > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l <at> lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- -- ---- "Saying the internet has changed dramatically over the last five years is cliché – the internet is always changing dramatically" - Craig Labovitz, Arbor Networks.