Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Scott Markel <Scott.Markel <at> accelrys.com>
Subject: RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed
Newsgroups: gmane.comp.lang.perl.bio.general
Date: Tuesday 30th July 2013 21:18:52 UTC (over 3 years ago)
According to today's "[Refseq-announce] Post-release 60: human supplemental
files & bacterial record format" both CONTIG and ORIGIN are now allowed in
a GenBank-formatted entry.  See below (*) or the second bullet of http://www.ncbi.nlm.nih.gov/mailman/pipermail/refseq-announce/2013q3/000110.html
for details.

This change breaks Bio::SeqIO::genbank in the sense that the existence of
the CONTIG line means that the sequence data following ORIGIN will not be
read and $seq->seq() will not return a sequence string.  See lines 713-741
of Bio::SeqIO::genbank.

Note that this is related to the "Protein Records without Sequence" thread
(http://article.gmane.org/gmane.comp.lang.perl.bio.general/26708).

Scott

(*) Details on the change

[3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.

Under the new data model for bacterial proteins, a subset of records
continue to provide an organism-oriented package of protein records. These
records use traditional RefSeq accession prefixes (NP, YP) and include a
pointer to the identical non-redundant WP protein record.  Those NP and YP
records that have been updated to refer to a non-redundant WP protein
record, such as YP_008335932.1, include the following flat file display
details:

. Genome Annotation Data structured comment is also displayed on protein
records for the subset of bacterial genomes that have gone through the
updated NCBI prokaryotic annotation pipeline.
. Records include both a CONTIG line, which refers to the non-redundant WP
protein accession, and also an ORIGIN with the sequence residues following.
The sequence shown is from the WP protein record.

CONTIG      join(WP_015644991.1:1..273)
ORIGIN      
        1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk dydyewdfyn
       61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie snlgnykild
      121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle lralrhafna
      181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka alipksnesl
      241 elslkndeif ykgavryigm svlgmgvfdr yfl

Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  [email protected]
Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
5005 Wateridge Vista Drive          voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799
5222
USA                                
web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
    International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLOS Computational Biology
Editorial Board: Briefings in Bioinformatics
 
CD: 15ms