Scott Markel | 30 Jul 23:18 2013

RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed

According to today's "[Refseq-announce] Post-release 60: human supplemental files & bacterial record
format" both CONTIG and ORIGIN are now allowed in a GenBank-formatted entry.  See below (*) or the second
bullet of http://www.ncbi.nlm.nih.gov/mailman/pipermail/refseq-announce/2013q3/000110.html
for details.

This change breaks Bio::SeqIO::genbank in the sense that the existence of the CONTIG line means that the
sequence data following ORIGIN will not be read and $seq->seq() will not return a sequence string.  See
lines 713-741 of Bio::SeqIO::genbank.

Note that this is related to the "Protein Records without Sequence" thread (http://article.gmane.org/gmane.comp.lang.perl.bio.general/26708).

Scott

(*) Details on the change

[3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.

Under the new data model for bacterial proteins, a subset of records continue to provide an
organism-oriented package of protein records. These records use traditional RefSeq accession
prefixes (NP, YP) and include a pointer to the identical non-redundant WP protein record.  Those NP and YP
records that have been updated to refer to a non-redundant WP protein record, such as YP_008335932.1,
include the following flat file display details:

. Genome Annotation Data structured comment is also displayed on protein records for the subset of
bacterial genomes that have gone through the updated NCBI prokaryotic annotation pipeline.
. Records include both a CONTIG line, which refers to the non-redundant WP protein accession, and also an
ORIGIN with the sequence residues following. The sequence shown is from the WP protein record.

CONTIG      join(WP_015644991.1:1..273)
ORIGIN      
        1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk dydyewdfyn
       61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie snlgnykild
      121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle lralrhafna
      181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka alipksnesl
      241 elslkndeif ykgavryigm svlgmgvfdr yfl

Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  smarkel <at> accelrys.com
Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
5005 Wateridge Vista Drive          voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799 5222
USA                                 web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
    International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLOS Computational Biology
Editorial Board: Briefings in Bioinformatics

Gmane