Thanks for the detailed feedback. The real reason I had to write my own
parser is that even with close, repeated support from NCBI we couldn't get
XML output with short_web_blast.pl because the parameter that turns on XML
output was not functioning (they've probably fixed it by now), and I had to
crank out a parser asap to support a job talk.
I don't think the upstream and downstream feature reports are particulalry
useful, becase in mammals they tend to be so far away that they are not
likely to be biologically relevant. But the internal motif reports are
useful, maybe especially if you are blasting short reads, like I was. A
16-mer preserved domain hit is really good if you're blasting 18-mer
Illumina short reads, like I was.
As far as my involvement goes, I got diagnosed with cancer on Wednesday, so
I'll be taking a step back until next week's surgery and taking a lot a
deep breaths. On the other hand, this just makes me more motivated: I've
been thinking alot about time, and timely contributions, the last two days.
From: Jason Stajich
To: Dan kilburn
Cc: "[email protected]"
Sent: Friday, February 1, 2013 1:58 AM
Subject: Re: [Bioperl-l] Bioperl-l Digest, Vol 117, Issue 13
I think the answer is yes if others are doing it - I am not in a position
to be much of a main coder.
I don't know which format you speak of here or if you had to write
something for the text blast changes or something else. Specific bug
reports on formats that aren't working is always helpful. The XML format
has been pretty stable so I would suggest that if you are simply parsing
reports not looking at them.
Chris posted instructions on how to contribute and the move to github
simplifies this. That you had to write a whole new parser seems probably
a bit severe - I hope that in the future people can speak to the problems
sooner. If I hit a wall with something I can't do I usually write the code
to fix it and contribute it back but I don't play follow-the-format-changes
with the tools anymore, but hopefully others like yourself can make the
If you speak to the response I made to the question below, I don't think
anyone will be trying and support the NCBI's additional markups that refer
to the upstream and downstream features as they are laid out in the text
files without some serious effort. Perhaps in the future that information
will be reported in the XML format and thus be more parseable.
On Jan 30, 2013, at 1:40 PM, Dan kilburn wrote:
>Are there any plans to keep SearchIO up to date with ncbi blast? I know
they change formats ridiculously often, but I had to write my own parser to
get sequence identity, which I would rather not have done. I realize that
this job would be a big load on anyone who takes it, but it's so
fundamental. Maybe I can help.
>Sent from my iPhone
>On Jan 30, 2013, at 12:00 PM, [email protected] wrote:
>Send Bioperl-l mailing list submissions to
>> [email protected]
>>To subscribe or unsubscribe via the World Wide Web, visit
>>or, via email, send a message with subject or body 'help' to
>> [email protected]
>>You can reach the person managing the list at
>> [email protected]
>>When replying, please edit your Subject line so it is more specific
>>than "Re: Contents of Bioperl-l digest..."
>> 1. Re: Parsing Blast-Report extracting "Features flanking .."
>> (Jason Stajich)
>>Date: Tue, 29 Jan 2013 11:00:16 -0800
>>From: Jason Stajich
>>Subject: Re: [Bioperl-l] Parsing Blast-Report extracting "Features
>> flanking .."
>>To: [email protected]
>>Cc: [email protected]
>>Message-ID: <[email protected]>
>>Content-Type: text/plain; charset=us-ascii
>>We don't parse the NCBI feature info from the BLAST reports per your
query. To look up a specific feature you can use Bio::DB::GenBank to query
for sequence from a specific feature by accession number - see the HOWTOs
>>However, most people use tools that generate SAM/BAM files with short
reads - then you can use a tool like bedtools to find overlaps of reads
with the locations of features.
>>- download the genome and GFF for arabidopsis
>>- align your sRNA to the genome with a short read aligner - bowtie, bwa,
>>- convert your sam to bam file with SAMtools or picard
>>- compare the location of features with the reads to get expression
summaries or individuals reads with BEDTools
>>On Jan 25, 2013, at 2:20 AM, jobu wrote:
>>Am 22.01.2013 19:03, schrieb Mgavi Brathwaite:
>>>What upstream and downstream elements are you interested in?
>>>I've got a huge pile of short RNA reads.
>>>Part of the question now is whether those RNA fragments originate from
>>>or may represent miRNAs / parts of pre-miRNAs.
>>>So I did an online blast search against database nt.
>>>The resulting report quite often just gives subject information like
>>>gb|CP002686.1| Arabidopsis thaliana chromosome 3, complete sequence
>>>Now I would like to get the hit's neighbouring regions for further
>>>Preferably I would like to do that in an automized way, but the only
>>>possible action with this kind of subject gi | description would be to
>>>fetch the entire chromosomal sequence I guess ?
>>>right below the line above, the report states more precisely:
>>>Features flanking this part of subject sequence:
>>>8872 bp at 5' side: cytochrome P450 90B1
>>>402 bp at 3' side: U1 small nuclear ribonucleoprotein-70K
>>>Still I would like to have the possibility to automatically fetch the
>>>as of now I think parsing the report with SearchIO won't let me aquire
>>>that information, because SearchIO does not recognize report sections
>>>I hope I did not miss any of SearchIOs capabilities, but I could not
>>>find any method covering my wish?!
>>>Right now maybe the only way to get the information I want is to
>>>construct my own parser and write it out into a separate file, which in
>>>turn again I could read into a hash before processing the Blast-Report
>>>with SearchIO to combine both data for further automized work.
>>>I am aware though that even successfully getting the flanking features
>>>would leave me with the more or less wide intergenic gap my hsp is
>>>However I'm in need of a way to get the flanking features including
>>>their annotation and the region spanning between them.
>>>But I hope I do not have to get complete sequences to accomplish that,
>>>as this would be kind of an overkill.
>>>with kind regards
>>>Bioperl-l mailing list
>>Bioperl-l mailing list
>>End of Bioperl-l Digest, Vol 117, Issue 13
>Bioperl-l mailing list