Florent Angly | 3 Dec 03:36 2012

Bio::DB::Fasta and threads

Hi all,

This is in response to Carson Holt's report that Bio::DB::Fasta does not 
play well with threads: https://redmine.open-bio.org/issues/3397

The first issue is the serialization of Bio::DB::IndexedBase-inheriting 
(e.g. Bio::DB::Fasta and Bio::DB::Qual) objects, which is needed for 
threading (for example when using Thread::Queue::Any). I implemented 
hooks that make it transparent to serialize using Storable freeze() and 

Another issue was the lack of communication between different 
Bio::DB::IndexedBase instances, which means that an instance could 
easily be writing or deleting the database that another instance is 
working on. To fix this, I needed some form of locking.

Some database Bio::DB::IndexedBase backends (DB_file) have some support 
for locking but Bio::DB::IndexedBase also supports other database 
backends for which there is no native locking mechanism. So, I had to 
come up with a more general solution: a lock file. I noticed that 
Bio::DB::SeqFeature::Store::berkeleydb has a locking mechanism, based on 
flock(), which means that it does not work with NFS-mounted filesystems. 
All the Bioperl-based scripts I (and most likely many others) write run 
on servers that use NFS, so this support is important. I have found only 
one way to do the NFS locking safely, using File::SharedNFSLock. It has 
a few downsides though:
     1/ it is an external dependency,
     2/ it does not work on FAT filesystems (should be mostly restricted 
to USB sticks nowadays) and the lock is never acquired, and
     3/ at the moment, it requires a patch to work in threaded context 

Note that while I have now added basic support for threads in 
Bio::DB::IndexedBase was added, I still get segfaults in specific cases, 
for example when returning a database or sequence object. This might be 
related to this issue: 
https://rt.perl.org/rt3/Ticket/Display.html?id=115972. Beyond this, the 
new code seems to work nicely. See the branch 
https://github.com/bioperl/bioperl-live/tree/storable_db if you want to 
test yourself. For example, one can now run multiple threads, each of 
them creating a Bio::DB::Fasta database from the same FASTA file: the 
first thread performs the indexing while the others wait nicely for the 
indexing to be finished to query the database.

Comments welcome. Regards,