Lapointe, David | 6 Mar 14:46 2013

Re: [Rd] enabling reproducible research & R package management & install.package.version & BiocLite

There are utilities ( e.g. dotkit, and modules) which facilitate version management, basically creating
on the fly PATH and env setups, if you are comfortable keeping all that around. 


-----Original Message-----
From: bioconductor-bounces@...
[mailto:bioconductor-bounces@...] On Behalf Of Cook, Malcolm
Sent: Tuesday, March 05, 2013 6:08 PM
To: 'Paul Gilbert'
Cc: 'r-devel@...';
'bioconductor@...'; 'r-discussion@...'
Subject: Re: [BioC] [Rd] enabling reproducible research & R package management &
install.package.version & BiocLite


I think your balanced and reasoned approach addresses all my current concerns.  Nice!  I will likely adopt
your methods.  Let me ruminate.  Thanks for this.

~ Malcolm

 .-----Original Message-----
 .From: Paul Gilbert [mailto:pgilbert902@...]
 .Sent: Tuesday, March 05, 2013 4:34 PM
 .To: Cook, Malcolm
 .Cc: 'r-devel@...';
'bioconductor@...'; 'r-discussion@...'
 .Subject: Re: [Rd] [BioC] enabling reproducible research & R package management &
install.package.version & BiocLite  .
 .(More on the original question further below.)  .
 .On 13-03-05 09:48 AM, Cook, Malcolm wrote:
 .> All,
 .> What got me started on this line of inquiry was my attempt at  .> balancing the advantages of performing a
periodic (daily or weekly)  .> update to the 'release' version of locally installed R/Bioconductor  .>
packages on our institute-wide installation of R with the  .> disadvantages of potentially changing the
result of an analyst's  .> workflow in mid-project.
 .I have implemented a strategy to try to address this as follows:
 .1/ Install a new version of R when it is released, and packages in the R  .version's site-library with
package versions as available at the time  .the R version is installed. Only upgrade these package
versions in the  .case they are severely broken.
 .2/ Install the same packages in site-library-fresh and upgrade these  .package versions on a regular
basis (e.g. daily).
 .3/ When a new version of R is released, freeze but do not remove the old  .R version, at least not for a fairly
long time, and freeze  .site-library-fresh for the old version. Begin with the new version as in  .1/ and 2/.
The old version remains available, so "reverting" is trivial.
 .The analysts are then responsible for choosing the R version they use,  .and the library they use. This
means they do not have to change R and  .package version mid-project, but they can if they wish. I think the 
.above two libraries will cover most cases, but it is possible that a few  .projects will need their own
special library with a combination of  .package versions. In this case the user could create their own
library,  .or you might prefer some more official mechanism.
 .The idea of the above strategy is to provide the stability one might  .want for an ongoing project, and the
possibility of an upgraded package  .if necessary, but not encourage analysts to remain indefinitely
with old  .versions (by say, putting new packages in an old R version library).
 .This strategy has been implemented in a set of make files in the project  .RoboAdmin available at It can  .be done entirely automatically with a cron job.
Constructive comments  .are always appreciated.
 .(IT departments sometimes think that there should be only one version of  .everything available, which
they test and approve. So the initial  .reaction to this approach could be negative. I think they have not 
.really thought about the advantages. They usually cannot test/approve an  .upgrade without user input,
and timing is often extremely complicate  .because of ongoing user needs. This strategy is simply
shifting  .responsibility and timing to the users, or user departments, that can  .actually do the testing
and approving.)  .
 .Regarding NFS mounts, it is relatively robust. There can be occasional  .problems, especially for users
that have a habit of keeping an R session  .open for days at a time and using site-library-fresh packages. In
my  .experience this did not happen often enough to worry about a "blackout  .period".
 .Regarding the original question, I would like to think it could be  .possible to keep enough information to
reproduce the exact environment,  .but I think for potentially sensitive numerical problems that is 
.optimistic. As others have pointed out, results can depend not only on R  .and package versions,
configuration, OS versions, and library and  .compiler versions, but also on the underlying hardware.
You might have  .some hope using something like an Amazon core instance. (BTW, this  .problem is not
specific to R.)  .
 .It is true that restricting to a fixed computing environment at your  .institution may ease things
somewhat, but if you occasionally upgrade  .hardware or the OS then you will probably lose reproducibility.
 .An alternative that I recommend is that you produce a set of tests that  .confirm the results of any
important project. These can be conveniently  .put in the tests/ directory of an R package, which is then
maintained  .local, not on CRAN, and built/tested whenever a new R and packages are  .installed. (Tools for
this are also available at the above indicated web
 .site.) This approach means that you continue to reproduce the old  .results, or if not, discover
differences/problems in the old or new  .version of R and/or packages that may be important to you. I have
been  .successfully using a variant of this since about 1993, using R and  .package tests/ since they became available.
 .> I just got the "green light" to institute such periodic updates that  .> I have been arguing is in our
collective best interest.  In return,  .> I promised my best effort to provide a means for preserving or  .>
reverting to a working R library configuration.
 .> Please note that the reproducibility I am most eager to provide is  .> limited to reproducibility within
the computing environment of our  .> institute, which perhaps takes away some of the dragon's nests,  .>
though certainly not all.
 .> There are technical issues of updating package installations on an  .> NFS mount that might have
files/libraries open on it from running R  .> sessions.  I am interested in learning of approaches for  .>
minimizing/eliminating exposure to these issue as well.  The  .> first/best approach seems to be to
institute a 'black out' period
 .> when users should expect the installed library to change.   Perhaps
 .> there are improvements to this????
 .> Best,
 .> Malcolm
 .> .-----Original Message----- .From: Mike Marchywka  .>
[mailto:marchywka@...] .Sent: Tuesday, March 05, 2013 5:24 
.> AM .To: amackey@...; Cook, Malcolm .Cc:
 .> r-devel@...;
bioconductor@...;  .>
r-discussion@... .Subject: RE: [Rd] [BioC]
enabling  .> reproducible research & R package management &  .> install.package.version & BiocLite . . .I
hate to ask what go this  .> thread started but it sounds like someone was counting on .exact  .> numeric
reproducibility or was there a bug in a specific release? In  .> actual .fact, the best way to determine
reproducibility is run the  .> code in a variety of .packages. Alternatively, you can do everything  .> in
java and not assume .that calculations commute or associate as the  .> code is modified but it seems
.pointless. Sensitivity determination  .> would seem to lead to more reprodicible results .than trying
to keep  .> a specific set of code quirks. . .I also 
 seem to recall that FPU may  .> have random lower order bits in some cases, .same code/data give  .> different
results. Alsways assume FP is stochastic and plan .on  .> anlayzing the "noise!
 ." . . .----------------------------------------
 .> .> From: amackey@... .> Date: Mon, 4 Mar 2013 16:28:48  .> -0500
.> To: MEC@... .> CC:
r-devel@...;  .>
r-discussion@... .>  .> Subject: Re: [Rd] [BioC]
enabling reproducible research & R package  .> management & install.package.version & BiocLite .> .> On
Mon, Mar 4,  .> 2013 at 4:13 PM, Cook, Malcolm <MEC@...> wrote: .>
.> > *  .> where do the dragons lurk .> > .> .> webs of interconnected  .> dynamically loaded libraries,
identical versions of .> R compiled  .> with different BLAS/LAPACK options, etc. Go with the VM if you .>  .>
really, truly, want this level of exact reproducibility. .> .> An  .> alternative (and arguably more
useful) strategy would be to cache .>  .> results of each computational step
 , and report when results differ  .> upon .> re-execution with identical inputs; if you cache sessionInfo  .>
along with .> each result, you can identify which package(s) changed,  .> and b!
 egin to hunt .> down why the change occurred (possibly for the  .> better); couple this with .> the concept of
keeping both code *and*  .> results in version control, then you .> can move forward with a  .> (re)analysis
without being crippled by out-of-date .> software. .> .>  .> -Aaron .> .> -- .> Aaron J. Mackey, PhD .>
Assistant Professor .>  .> Center for Public Health Genomics .> University of Virginia .>  .>
amackey@... .> .> .> 
.> [[alternative HTML version deleted]] .> .>  .> ______________________________________________ .> 
.> R-devel@... mailing list .>  .> .
 .> ______________________________________________
R-devel@...  .> mailing list

Bioconductor mailing list
Search the archives:

Bioconductor mailing list
Search the archives: