dimitark | 12 Dec 05:06 2013
Picon

update: question temp files in blast

Hi Francisco and Chris,
thank you both for your replies. I have read quite a bit about  
File::Temp and its problems with forking and then about rand and srand  
and etc. I tried some options and even installed and tried  
Math::Random::Secure but to no result. My threads were still getting  
broken. Then i modified StandAloneBlastPlus and BlastMethods further  
and made it so that i specify the name of my temp.fas as well. Now i  
can specify my own tempdir and tempfas. However my threads are still  
breaking. I'm frustrated.

I attach my perl script as ZIP here if someone can see if there is  
something wrong. I fail to see.

Basicly what i try to do is: say i have one DB, 100K seqs which i want  
to blast and 60 available CPU threads. Blast+ can run multi-threaded.  
So i split my fasta to say 10 parts with 10K seqs each. Then i try to  
run 10 instances of blast with the same DB and each instance uses  
60/10=6 threads.

But somewhere something goes wrong n i can see where. My next try is  
to write my own wrapper for BLAST+ and see if then my threads will  
break. But i really am not that eager to do it :)

If someone can help see where is the problem with my logic or program  
it would be greatly appreciated.
Thank you!

Cheers
Dimitar

>
> Today's Topics:
>
>    1.  question temp files in blast (dimitark <at> bii.a-star.edu.sg)
>    2. Re:  question temp files in blast ( Francisco J. Ossand?n )
>    3. Re:  question temp files in blast (Fields, Christopher J)
>    4. Re:  Possible bug in Bio::Restriction::Analysis (Mark Nadel)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 11 Dec 2013 09:53:52 +0800
> From: dimitark <at> bii.a-star.edu.sg
> Subject: [Bioperl-l] question temp files in blast
> To: bioperl-l <at> lists.open-bio.org
> Message-ID:
> 	<20131211095352.Horde.TgGqeF6XPvFSp8WwWJQXU3A <at> webmailintern.bii.a-star.edu.sg>
>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed; DelSp=Yes
>
> Hi guys,
> i have a question about StandAloneBlastPlus and File::Temp.
>
> I encountered a problem which arises from File::Temp in my particular
> script. In previous email i said i forced StandAloneBLastPLus to
> accept a TEMP_DIR which i give through modifying BlastMethods.pm and
> StandAloneBlastPlus.pm. This works but not always and that is because
> File::Temp is using the built in perl function rand() which uses
> srand().
>
> Now in brief: my script is splitting a large FASTA into smaller ones
> and for each of the smaller ones is starting a new thread of BLAST
> with as many threads as desired. Also is creating a special TEMP_DIR
> for each thread in which the temp blast files are stored: file.fas and
> the blast_result. However because of the rand() some clashing of file
> names occurs because there is not enough randomness and some of my
> threads die, not always but very often.
>
> So my question is the following. Should i try to modify
> BlastMethods.pm and StandAloneBlastPlus.pm further so that i can
> manually specify the file names of the temp files or to use another
> module like  Math::Random::Secure in order to produce a really random
> number which i can then pass to srand() after i create my threads so
> that there is no temp file names clashing?
>
> The easiest is to just use additional module but then more
> dependencies just for one random number. On the other hand if i modify
> the current modules i will be sure that there wont be a chance to have
> temp file name clashing at all and no further dependencies.
>
> I am sorry if my email seems too messy but i tried to put it really brief.
>
> Any advice is welcomed!
>
> Thank you for your time
>
> Cheers
> Dimitar
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 11 Dec 2013 09:57:19 -0300
> From: " Francisco J. Ossand?n " <fossandonc <at> hotmail.com>
> Subject: Re: [Bioperl-l] question temp files in blast
> To: <dimitark <at> bii.a-star.edu.sg>, <bioperl-l <at> lists.open-bio.org>
> Message-ID: <SNT147-DS22760B25620727617DCCBFCFDD0 <at> phx.gbl>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hello Dimitar,
> You expect to have several instances of the script running at the same
> time??
>
> If there is only 1 instance for the script, it could be easier to assign an
> increasing counter for the smaller fastas (seq1.fa, seq2.fa... seqX.fa), and
> then use the fasta filename as base for the blast output filename
> (seq1.blastout.txt, seq2.blastout.txt... seqX.blastout.txt).
>
> If there are multiple instances, you could add to the filename the original
> fasta name and the 'time' function return value (I think it would be
> unlikely to process 2 files with the same name and starting at the same
> time). Something like:
>
> my $in_file = 'original.fa';
> my $time = time;
> my $counter = 0;
> foreach my $fasta_piece ( <at> fasta_pieces) {
> 	$counter++;
> 	my ($file_out) = ($file_in =~ m/^(.+)\.fa$/i);
> 	$file_out = ".$time.seq$counter.fa"; # Resulting in 'original.
> 1386766006.seq1.fa'
>
> 	my ($blast_result) = ($file_out =~ m/^(.+)\.fa$/i);
> 	$blast_result .= '.blast_out.txt'; # Resulting in 'original.
> 1386766006.seq1.blast_out.txt'
> }
>
> That would add some specificity (temporal files with same base name) and
> some randomness (counter and execution time). The filenames can be a little
> long but I like it because all files are grouped by their base name, so I
> can list/copy/move/delete them together.
>
> Or maybe that's not enough for you needs??
>
> Cheers,
>
> Francisco J. Ossandon
>
> -----Mensaje original-----
> De: bioperl-l-bounces <at> lists.open-bio.org
> [mailto:bioperl-l-bounces <at> lists.open-bio.org] En nombre de
> dimitark <at> bii.a-star.edu.sg
> Enviado el: martes, 10 de diciembre de 2013 22:54
> Para: bioperl-l <at> lists.open-bio.org
> Asunto: [Bioperl-l] question temp files in blast
>
> Hi guys,
> i have a question about StandAloneBlastPlus and File::Temp.
>
> I encountered a problem which arises from File::Temp in my particular
> script. In previous email i said i forced StandAloneBLastPLus to accept a
> TEMP_DIR which i give through modifying BlastMethods.pm and
> StandAloneBlastPlus.pm. This works but not always and that is because
> File::Temp is using the built in perl function rand() which uses srand().
>
> Now in brief: my script is splitting a large FASTA into smaller ones and for
> each of the smaller ones is starting a new thread of BLAST with as many
> threads as desired. Also is creating a special TEMP_DIR for each thread in
> which the temp blast files are stored: file.fas and the blast_result.
> However because of the rand() some clashing of file names occurs because
> there is not enough randomness and some of my threads die, not always but
> very often.
>
> So my question is the following. Should i try to modify BlastMethods.pm and
> StandAloneBlastPlus.pm further so that i can manually specify the file names
> of the temp files or to use another module like  Math::Random::Secure in
> order to produce a really random number which i can then pass to srand()
> after i create my threads so that there is no temp file names clashing?
>
> The easiest is to just use additional module but then more dependencies just
> for one random number. On the other hand if i modify the current modules i
> will be sure that there wont be a chance to have temp file name clashing at
> all and no further dependencies.
>
> I am sorry if my email seems too messy but i tried to put it really brief.
>
> Any advice is welcomed!
>
> Thank you for your time
>
> Cheers
> Dimitar
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at> lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 11 Dec 2013 14:00:09 +0000
> From: "Fields, Christopher J" <cjfields <at> illinois.edu>
> Subject: Re: [Bioperl-l] question temp files in blast
> To: Francisco J. Ossand?n <fossandonc <at> hotmail.com>
> Cc: BioPerl List <bioperl-l <at> lists.open-bio.org>,
> 	"dimitark <at> bii.a-star.edu.sg" <dimitark <at> bii.a-star.edu.sg>
> Message-ID: <B9F48E7C-B6A3-41E5-8014-BCCD28ADB981 <at> illinois.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I think File::Temp generates the random file string based on the  
> time stamp (common practice in UNIX), which rounds to the second.   
> Might be wrong, but that could be causing the problem, as files  
> could be created at the same time in threads/forks. See this link,  
> which also discusses solutions:
>
> https://metacpan.org/pod/File::Temp#Forking
>
> chris
>
> On Dec 11, 2013, at 6:57 AM, Francisco J. Ossand?n  
> <fossandonc <at> hotmail.com> wrote:
>
>> Hello Dimitar,
>> You expect to have several instances of the script running at the same
>> time??
>>
>> If there is only 1 instance for the script, it could be easier to assign an
>> increasing counter for the smaller fastas (seq1.fa, seq2.fa... seqX.fa), and
>> then use the fasta filename as base for the blast output filename
>> (seq1.blastout.txt, seq2.blastout.txt... seqX.blastout.txt).
>>
>> If there are multiple instances, you could add to the filename the original
>> fasta name and the 'time' function return value (I think it would be
>> unlikely to process 2 files with the same name and starting at the same
>> time). Something like:
>>
>> my $in_file = 'original.fa';
>> my $time = time;
>> my $counter = 0;
>> foreach my $fasta_piece ( <at> fasta_pieces) {
>> 	$counter++;
>> 	my ($file_out) = ($file_in =~ m/^(.+)\.fa$/i);
>> 	$file_out = ".$time.seq$counter.fa"; # Resulting in 'original.
>> 1386766006.seq1.fa'
>>
>> 	my ($blast_result) = ($file_out =~ m/^(.+)\.fa$/i);
>> 	$blast_result .= '.blast_out.txt'; # Resulting in 'original.
>> 1386766006.seq1.blast_out.txt'
>> }
>>
>> That would add some specificity (temporal files with same base name) and
>> some randomness (counter and execution time). The filenames can be a little
>> long but I like it because all files are grouped by their base name, so I
>> can list/copy/move/delete them together.
>>
>> Or maybe that's not enough for you needs??
>>
>> Cheers,
>>
>> Francisco J. Ossandon
>>
>> -----Mensaje original-----
>> De: bioperl-l-bounces <at> lists.open-bio.org
>> [mailto:bioperl-l-bounces <at> lists.open-bio.org] En nombre de
>> dimitark <at> bii.a-star.edu.sg
>> Enviado el: martes, 10 de diciembre de 2013 22:54
>> Para: bioperl-l <at> lists.open-bio.org
>> Asunto: [Bioperl-l] question temp files in blast
>>
>> Hi guys,
>> i have a question about StandAloneBlastPlus and File::Temp.
>>
>> I encountered a problem which arises from File::Temp in my particular
>> script. In previous email i said i forced StandAloneBLastPLus to accept a
>> TEMP_DIR which i give through modifying BlastMethods.pm and
>> StandAloneBlastPlus.pm. This works but not always and that is because
>> File::Temp is using the built in perl function rand() which uses srand().
>>
>> Now in brief: my script is splitting a large FASTA into smaller ones and for
>> each of the smaller ones is starting a new thread of BLAST with as many
>> threads as desired. Also is creating a special TEMP_DIR for each thread in
>> which the temp blast files are stored: file.fas and the blast_result.
>> However because of the rand() some clashing of file names occurs because
>> there is not enough randomness and some of my threads die, not always but
>> very often.
>>
>> So my question is the following. Should i try to modify BlastMethods.pm and
>> StandAloneBlastPlus.pm further so that i can manually specify the file names
>> of the temp files or to use another module like  Math::Random::Secure in
>> order to produce a really random number which i can then pass to srand()
>> after i create my threads so that there is no temp file names clashing?
>>
>> The easiest is to just use additional module but then more dependencies just
>> for one random number. On the other hand if i modify the current modules i
>> will be sure that there wont be a chance to have temp file name clashing at
>> all and no further dependencies.
>>
>> I am sorry if my email seems too messy but i tried to put it really brief.
>>
>> Any advice is welcomed!
>>
>> Thank you for your time
>>
>> Cheers
>> Dimitar
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l <at> lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l <at> lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Wed, 11 Dec 2013 11:11:02 -0500
> From: Mark Nadel <nadel <at> nabsys.com>
> Subject: Re: [Bioperl-l] Possible bug in Bio::Restriction::Analysis
> To: bioperl-l <at> lists.open-bio.org
> Message-ID:
> 	<CAG=4UpjdfoQE-do_2eZjbjjBFxmum+8oPAogP3BRGv3bd=ec7Q <at> mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Chris,
>
> Thanks for your interest. Here is some code that will generate the data to
> which I refer in my earlier post:
>
> use Bio::DB::GenBank;
>
> use Bio::Restriction::Analysis;
>
> use Bio::Restriction::EnzymeCollection;
>
>
> my $db = Bio::DB::GenBank->new();
>
> my $seq = $db->get_Seq_by_acc('U00096');
>
> my $rebase = Bio::Restriction::IO->new(
>
>       -file   =>  '/Users/marknadel/Documents/adhoc_withrefm.txt',
>
>       -format => 'withrefm' );
>
> my $rebase_collection = $rebase->read();
>
> my $ra = Bio::Restriction::Analysis->new(-seq=>$seq,-enzymes=>$
> rebase_collection);
>
> my $all_cutters = $ra->cutters;
>
> foreach my $enz ($all_cutters->each_enzyme()){
>
> print("\n");
>
> print($enz->name());
>
> print("\n");
>
>  my  <at> z=  $ra->positions($enz->name());
>
>     my $k = $#z;
>
>     for ($j=0;$j<=$k;$j++){
>
>     print "\t$z[$j]";
>
>    }
>
> }
>
> print "\nDONE";
>
>
> Unfortunately, the enzymes that I mentioned in the post are not included in
> the base distribution. Here is a very brief file to use:
>
> <1>Nt.Bpu10I
>
> <2>
>
> <3>CCTNAGC(-5/?)
>
>
> <1>Bpu10I
>
> <2>BpuDI
>
> <3>CCTNAGC(-5/-2)
>
> <4>?(5)
>
> <5>Bacillus pumilus 10
>
> <6>NEB 1777
>
> <7>FINV
>
> <8>Degtyarev, S.K., Zilkin, P.A., Prihodko, G.G., Repin, V.E., Rechkunova,
> N.I., (1989) Mol. Biol. (Mosk), vol. 23, pp. 1051-1056.
>
> Stankevicius, K., Lubys, A., Timinskas, A., Vaitkevicius, D., Janulaitis,
> A., (1998) Nucleic Acids Res., vol. 26, pp. 1084-1091.
>
> This is the file /Users/marknadel/Documents/adhoc_withrefm.txt used in the
> snippet above.
>
> Thanks again,
>
> Mark
>
> --
> *Mark Nadel*
>
> *Principal Scientist*
> Nabsys Inc.
> 60 Clifford Street
> Providence, RI  02903
>
> Phone   401-276-9100 x204
> Fax 401-276-9122
>
>
> ------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l <at> lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> End of Bioperl-l Digest, Vol 128, Issue 6
> *****************************************


Attachment (blast_script_Dimitar.zip): application/zip, 3732 bytes
_______________________________________________
Bioperl-l mailing list
Bioperl-l <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Gmane