Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Peter Cock <p.j.a.cock <at> googlemail.com>
Subject: Re: RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed
Newsgroups: gmane.comp.lang.perl.bio.general
Date: Tuesday 30th July 2013 22:21:47 UTC (over 3 years ago)
Thanks Scott,

I fell foul of this change with Biopython testing GenPept
format records from NCBI Entrez recently, I'd assumed
it might have been a short term glitch:

https://github.com/biopython/biopython/commit/b6ccd0f05804c944f23e4e38df877e30761f491e

https://github.com/biopython/biopython/commit/f3a9e33e428a5ecfb490d4f6f0ede7695fcde0d2

CC'ing the cross project list in case BioRuby or BioJava
are also impacted.

Regards,

Peter

On Tue, Jul 30, 2013 at 10:18 PM, Scott Markel
 wrote:
> According to today's "[Refseq-announce] Post-release 60: human
supplemental files & bacterial record format" both CONTIG and ORIGIN are
now allowed in a GenBank-formatted entry.  See below (*) or the second
bullet of http://www.ncbi.nlm.nih.gov/mailman/pipermail/refseq-announce/2013q3/000110.html
for details.
>
> This change breaks Bio::SeqIO::genbank in the sense that the existence of
the CONTIG line means that the sequence data following ORIGIN will not be
read and $seq->seq() will not return a sequence string.  See lines 713-741
of Bio::SeqIO::genbank.
>
> Note that this is related to the "Protein Records without Sequence"
thread (http://article.gmane.org/gmane.comp.lang.perl.bio.general/26708).
>
> Scott
>
> (*) Details on the change
>
> [3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.
>
> Under the new data model for bacterial proteins, a subset of records
continue to provide an organism-oriented package of protein records. These
records use traditional RefSeq accession prefixes (NP, YP) and include a
pointer to the identical non-redundant WP protein record.  Those NP and YP
records that have been updated to refer to a non-redundant WP protein
record, such as YP_008335932.1, include the following flat file display
details:
>
> . Genome Annotation Data structured comment is also displayed on protein
records for the subset of bacterial genomes that have gone through the
updated NCBI prokaryotic annotation pipeline.
> . Records include both a CONTIG line, which refers to the non-redundant
WP protein accession, and also an ORIGIN with the sequence residues
following. The sequence shown is from the WP protein record.
>
> CONTIG      join(WP_015644991.1:1..273)
> ORIGIN
>         1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk
dydyewdfyn
>        61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie
snlgnykild
>       121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle
lralrhafna
>       181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka
alipksnesl
>       241 elslkndeif ykgavryigm svlgmgvfdr yfl
>
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  [email protected]
> Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
> 5005 Wateridge Vista Drive          voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:    http://www.accelrys.com
>
> http://www.linkedin.com/in/smarkel
> Secretary, Board of Directors:
>     International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLOS Computational Biology
> Editorial Board: Briefings in Bioinformatics
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> [email protected]
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
 
CD: 24ms