Peter Cock | 31 Jul 00:21 2013

Re: RefSeq announces a problematic format change - both CONTIG and ORIGIN allowed

Thanks Scott,

I fell foul of this change with Biopython testing GenPept
format records from NCBI Entrez recently, I'd assumed
it might have been a short term glitch:

CC'ing the cross project list in case BioRuby or BioJava
are also impacted.



On Tue, Jul 30, 2013 at 10:18 PM, Scott Markel
<Scott.Markel <at>> wrote:
> According to today's "[Refseq-announce] Post-release 60: human supplemental files & bacterial record
format" both CONTIG and ORIGIN are now allowed in a GenBank-formatted entry.  See below (*) or the second
bullet of
for details.
> This change breaks Bio::SeqIO::genbank in the sense that the existence of the CONTIG line means that the
sequence data following ORIGIN will not be read and $seq->seq() will not return a sequence string.  See
lines 713-741 of Bio::SeqIO::genbank.
> Note that this is related to the "Protein Records without Sequence" thread (
> Scott
> (*) Details on the change
> [3] Bacterial NP/YP proteins with CONTIG and ORIGIN lines.
> Under the new data model for bacterial proteins, a subset of records continue to provide an
organism-oriented package of protein records. These records use traditional RefSeq accession
prefixes (NP, YP) and include a pointer to the identical non-redundant WP protein record.  Those NP and YP
records that have been updated to refer to a non-redundant WP protein record, such as YP_008335932.1,
include the following flat file display details:
> . Genome Annotation Data structured comment is also displayed on protein records for the subset of
bacterial genomes that have gone through the updated NCBI prokaryotic annotation pipeline.
> . Records include both a CONTIG line, which refers to the non-redundant WP protein accession, and also an
ORIGIN with the sequence residues following. The sequence shown is from the WP protein record.
> CONTIG      join(WP_015644991.1:1..273)
>         1 mvfykysgsg ndflivqsfk kkdfsnlakq vchrhegfga dglvvvlpsk dydyewdfyn
>        61 sdgskagmcg nasrcvglfa yqhaiasknh vflagkreis icieepniie snlgnykild
>       121 vipalrcekf ftnnsvleni ptfylidtgv phlvgfvenk ewlnslntle lralrhafna
>       181 niniafienk etiflqtyer gvedftlacg tgmaavfiaa rifyntpkka alipksnesl
>       241 elslkndeif ykgavryigm svlgmgvfdr yfl
> Scott Markel, Ph.D.
> Principal Bioinformatics Architect  email:  smarkel <at>
> Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
> 5005 Wateridge Vista Drive          voice:  +1 858 799 5603
> San Diego, CA 92121                 fax:    +1 858 799 5222
> USA                                 web:
> Secretary, Board of Directors:
>     International Society for Computational Biology
> Chair: ISCB Publications and Communications Committee
> Associate Editor: PLOS Computational Biology
> Editorial Board: Briefings in Bioinformatics
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at>