8 Jun 08:30 2013
Re: Protein Records without Sequence
Hamish McWilliam <hamish.mcwilliam <at> bioinfo-user.org.uk>
2013-06-08 06:30:20 GMT
2013-06-08 06:30:20 GMT
No idea how NCBI are planning to handle this. My guess is that the announcement mentioned in the RefSeq release notes will detail how to deal with this. Since I cannot find this announcement, I suggest you contact the RefSeq folks (http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi) to see what their plan is. FWIW from phrasing of the information in the release notes, it looks like this may have gone live a bit earlier than originally planned, so some things may not be in place yet... All the best, Hamish On 5 June 2013 23:10, Warren Gallin <wgallin <at> ualberta.ca> wrote: > OK, I see where this is coming from. If I get a record without the protein sequence I can evaluate it and then retrieve again as a fast file and put it into the Bio::Seq object. > > Do you have any idea how they are going to handle this one-to-many mapping? Given that a single protein sequence may be linked to multiple different nucleotide sequences, even from different species, that means that a single protein sequence record may be tied to several different nucleotide sequence records. > > However, when I look up a couple of WP_XXXXXX records, I get the protein sequence but there is no DBSOURCE field, and there is no "coded_by"tag in the feature table so there is no direct way to find the underlying nucleotide sequence(s). > > See WP_004062662.1 GI:490164010 as an example. > > Is there any current way of retrieving the coding sequence starting from a protein record like this? > > Warren Gallin > > I am hoping that all of these fields that will now have multiple entries are going to be easily > On 2013-06-05, at 1:58 PM, Hamish McWilliam <hamish.mcwilliam <at> bioinfo-user.org.uk> wrote: > >> Hi Warren, >> >> This is due to a change to the way some entries are handled in the >> latest RefSeq (protein) data. From the RefSeq release notes >> (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release59.txt): >> >> --- >> >>  ATTENTION: anticipated change in release 60 or 61 >> The RefSeq project is planning a significant expansion of the >> prokaryotic dataset. >> Specifically, NCBI's prokaryotic genome annotation pipeline is >> generating annotation for all >> submitted prokaryotic genomes representing strains, disease outbreak >> sequences, population >> sequencing, and diversity studies. These genomes include both complete >> genomes and WGS (draft) >> genomes submitted as 500 or fewer contigs. There will be a significant >> increase in the number >> of prokaryotic genomes and proteins provided in the RefSeq release. >> >> We will avoid providing redundant protein records by providing a >> single protein record when >> identical proteins can be annotated on more than one genome. >> >> A new RefSeq protein accession prefix, WP_, will be used for these proteins. >> The accession has the format: >> >> WP_ + 9 numerals + version number, e.g., WP_000000001.1 >> >> WP_ records will be 'protein-only' records. When an identical protein >> is annotated on more >> than one bacterial genome record, the annotated CDS will point to the >> same WP_ accession. >> >> Thus, a given WP_ record may represent a protein found in more than >> one strain or in more >> than one bacterial 'species'. >> >> A separate announcement with more details will be provided in the next >> few weeks. >> >> --- >> >> This is implemented using the GenBank 'CONTIG' field to point to the >> actual sequence entry from the various annotation entries. This means >> that the flat-file data no longer contains the sequence, but NCBI >> Entrez can construct the sequence when it is requested (e.g. when you >> ask for fasta), in a similar way to the handling of contig entries in >> GenBank and RefSeq (nucleotide). >> >> All the best, >> >> Hamish >> >> On 5 June 2013 19:16, Warren Gallin <wgallin <at> ualberta.ca> wrote: >>> Hi, >>> >>> I am encountering a problem with a number of protein records. >>> >>> A HMMer search of the nr database returns a gi number and an associated sequence. >>> >>> When I use that gi number to try to retrieve the full GENBANK record, however, there is no sequence returned with the record. >>> >>> When I use the NCBI web interface and use that gi number the GENPEPT record returns with no sequence, but when I select fast format the sequence is returned. >>> >>> Examples of gi numbers for which this occurs are: >>> >>> 23099847 >>> 21224301 >>> 68536697 >>> 46580017 >>> 77359109 >>> >>> Is this a flaw with the individual GENPEPT records? In which case should I report it to NCBI? >>> >>> Or are these some kind of "special" record that needs different parameters passed on the utilizes search? >>> >>> There is a workaround, I guess, where is the sequence comes back empty then a new retrieval of fasta formatted records can be run and the empty field in the GENPEPT record repopulated, but this seems inelegant. >>> >>> All advice and/or commentary appreciated. >>> >>> Warren Gallin >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l <at> lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> >> -- >> ---- >> "Saying the internet has changed dramatically over the last five years >> is cliché – the internet is always changing dramatically" - Craig >> Labovitz, Arbor Networks. >> > -- -- ---- "Saying the internet has changed dramatically over the last five years is cliché – the internet is always changing dramatically" - Craig Labovitz, Arbor Networks.