Scott Markel | 18 Jun 17:54 2013

problems parsing XML results from BLAST+ version of psiblast running in batch mode

Short version -

How do I use Bio::Search::* modules to parse the XML results from the BLAST+ version of psiblast running in
batch mode?  Only one set of iteration numbers is used, so I can't tell which iteration goes with which query sequence.

Long version -

I'm running NCBI BLAST+ psiblast (version 2.2.27+) in batch mode with XML output.  Unlike the BLAST
version, which creates a <BlastOutput>...</BlastOutput> tag pair for each query sequence, the BLAST+
version creates a single <BlastOutput>...</BlastOutput> tag pair containing all iterations for all
query sequences.  The iteration numbers run across the query sequences, i.e., the iteration numbers
don't restart for a new query sequence.

So, how to know which iteration goes with which query sequence?

There are <BlastOutput_query-ID>...</BlastOutput_query-ID> and
<BlastOutput_query-def>...</BlastOutput_query-def> tag pairs that could be used to inspect the
iterations, but there are no subroutines in Bio::Search::Iteration::GenericIteration providing
access to these values.

An XML output file fragment showing the tag pairs is pasted below.

Any suggestions on workarounds or a pointer to something obvious that I'm missing would be greatly appreciated.

Scott

#########################

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>psiblast</BlastOutput_program>
...
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>lcl|1 no description available</BlastOutput_query-def>
  <BlastOutput_query-len>100</BlastOutput_query-len>
  <BlastOutput_param>
...
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>Query_1</Iteration_query-ID>
      <Iteration_query-def>lcl|1 no description available</Iteration_query-def>
      <Iteration_query-len>100</Iteration_query-len>
      <Iteration_hits>
...
      </Iteration_hits>
      <Iteration_stat>
...
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>2</Iteration_iter-num>
      <Iteration_query-ID>Query_2</Iteration_query-ID>
      <Iteration_query-def>lcl|2 no description available</Iteration_query-def>
      <Iteration_query-len>100</Iteration_query-len>
      <Iteration_hits>
...
      </Iteration_hits>
      <Iteration_stat>
...
      </Iteration_stat>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>

Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  smarkel <at> accelrys.com
Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799 5222
USA                                 web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
    International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLOS Computational Biology
Editorial Board: Briefings in Bioinformatics

Gmane