Fong Chun Chan | 18 Jan 22:50 2013
Picon
Picon

Re: Efficiently running DEXSeq for Large Cohorts

Hi Alejandro,

Thanks for the reply. You are right in that the warning messages do not
stop the execution of the program.

What kind of speed-up should I be expecting by using the variant functions?
Is the progress report of one '.' representing 100 processed genes still
accurate in this case? Just observing the run-time, it is still taking an
incredibly long time.

Thanks,

Fong

On Thu, Jan 17, 2013 at 5:11 AM, Alejandro Reyes <alejandro.reyes@...>wrote:

>  Dear Fong Chun Chan,
>
> I would definitely recommend to install the newest versions rather than
> sourcing the files!
> However I think this is not the cause of the errors. These warnings come
> from exons where the glm function fails, e.g. when an exon has very high
> variance and an NAs are generated somewhere inside the function. If I am
> correct, these errors do not stop the function from estimating the rest of
> the exon dispersions? If so it should not stop the overall workflow, but
> you would loose that exon from the analysis.
>
> I think my intention was to print warnings instead. I will do it!
>
> Best wishes
> Alejandro
>
>
>  Hi Alejandro,
>
>  I am getting these errors when running the TRT functions:
>
>  [1] "Using the TRT functions ..."
> ..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
 ..............................................................................................................Error
> in qr.default(mm * sqrt(w)) :
>   NA/NaN/Inf in foreign function call (arg 1)
> In addition: There were 50 or more warnings (use warnings() to see the
> first 50)
> .Error in qr.default(mm * sqrt(w)) :
>   NA/NaN/Inf in foreign function call (arg 1)
> In addition: There were 50 or more warnings (use warnings() to see the
> first 50)
>
> ...............................................................................................................................................................................................................................................................................................................................................................................
>
>  Perhaps I am not using the svn version fo DEXseq properly. I am actually
> still calling library(DEXSeq) and then sourcing the following files from
> the svn repository:
>
>  library('statmod')
> source('scripts/TRT.R')
> source('~/bioconductor.svn/DEXSeq/R/core_functions.R')
> source('~/bioconductor.svn/DEXSeq/R/coef_handling.R')
> source('~/bioconductor.svn/DEXSeq/R/visual_funct.R')
>
>  The reason why I was doing this was because it seemed that the order of
> which files to source mattered as it was complaining. So I worked out the
> most parsimonious set of files to source that still work with the current
> DEXSeq library.  Should I just be not bothering to use the current DEXSeq
> library and depend on sourcing all the svn .R files?
>
>  Thanks,
>
>  Fong
>
>
> On Tue, Jan 15, 2013 at 4:05 PM, Fong Chun Chan <fongchun@...>wrote:
>
>> Great. Seems to work now.  Thanks,
>>
>>  Fong
>>
>>
>> On Tue, Jan 15, 2013 at 2:25 PM, Alejandro Reyes <alejandro.reyes@...
>> > wrote:
>>
>>> Dear Fong Chun Chan,
>>>
>>> If you want to source the code, you need to attach the libraries
>>> required directly to your environment. Before calling the TRT functions,
>>> just do:
>>>
>>> library(statmod)
>>>
>>> I think that should work,
>>>
>>> Best regards,
>>> Alejandro
>>>
>>>
>>>
>>>   Hi Alejandro,
>>>>
>>>> Thanks for the reply. I am trying to use the TRT functions from the
>>>> DEXSeq svn repository in Bioconductor. When I tried to run:
>>>>
>>>> source('/bioconductor.svn/DEXSeq/R/TRT.R')
>>>> estimateDispersionsTRT( exonCounSet, nCores = 48 )
>>>>
>>>> I got a series of errors complaining about "glmnb.fit":
>>>>
>>>> In addition: Warning message:
>>>> In FUN(c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L, 90L, 100L, 110L,  :
>>>>   Unable to estimate dispersions for ENSG00000187634:ENSE00001631320
>>>> Error : could not find function "glmnb.fit"
>>>> In addition: Warning message:
>>>> In FUN(c(7L, 17L, 27L, 37L, 47L, 57L, 67L, 77L, 87L, 97L, 107L,  :
>>>>   Unable to estimate dispersions for ENSG00000187634:ENSE00001884257
>>>> Error : could not find function "glmnb.fit"
>>>> ....
>>>>
>>>> According to this post,
>>>> http://seqanswers.com/forums/showthread.php?t=16040, glmnb.fit is
>>>> supposed to come with statmod. According to my sessionInfo(), I should have
>>>> the latest versions:
>>>>
>>>> > sessionInfo()
>>>> R version 2.15.2 (2012-10-26)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_US.UTF-8  LC_COLLATE=en_US.UTF-8
>>>>  [5] LC_MONETARY=en_US.UTF-8  LC_MESSAGES=en_US.UTF-8
>>>>  [7] LC_PAPER=C                 LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] optparse_1.0.0     DEXSeq_1.4.0 Biobase_2.18.0
>>>> BiocGenerics_0.4.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] biomaRt_2.14.0 getopt_1.19.0  hwriter_1.3  RCurl_1.95-3
>>>> statmod_1.4.16
>>>> [6] stringr_0.6.2  tools_2.15.2   XML_3.95-0.1
>>>> >
>>>>
>>>> Am I doing something wrong here? Thanks for your help,
>>>>
>>>> Fong
>>>>
>>>>
>>>>   On Mon, Jan 14, 2013 at 1:44 AM, Alejandro Reyes <
>>>> alejandro.reyes@...
<mailto:alejandro.reyes@...>> wrote:
>>>>
>>>>     Sorry, there was an error in part of my code:
>>>>
>>>>     The "TRT" would be like this,
>>>>
>>>>     data(pasillaExons, package="pasilla")
>>>>
>>>>     pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
>>>>     pasillaExonsTRT <- estimateDispersionsTRT( pasillaExonsTRT )
>>>>     pasillaExonsTRT <- fitDispersionFunction( pasillaExonsTRT )
>>>>     pasillaExonsTRT <- testForDEUTRT( pasillaExonsTRT )
>>>>
>>>>     Bests,
>>>>     Alejandro
>>>>
>>>>
>>>>         Dear Fong Chun Chan,
>>>>
>>>>         Thank you for your interest in DEXSeq and sorry in advance for
>>>>         the long e-mail. We have also noticed that the computing time
>>>>         increases considerably when you have a large number of
>>>>         samples, conditions or number of exons of a gene. For users in
>>>>         these situations, we have implemented a variant of this
>>>>         functions (estimateDispersionsTRT and testForDEUTRT) in the
>>>>         most recent versions of DEXSeq in the svn.
>>>>
>>>>         The difference relies on how the model matrix is prepared, in
>>>>         the "normal" functions, the model matrices used to fit the
>>>>         glms are prepared for each exon, such that each exon bin is
>>>>         treated individually, independently of which exon you are
>>>>         testing. For example, if you have a gene with 5 exons, when
>>>>         testing for exon E001, you would consider independently E002,
>>>>         E003, ... , E005 in the model.
>>>>
>>>>         In the "TRT" implementation the same model matrix is used for
>>>>         all the exons. In the same example as before, you would
>>>>         consider E001 and the sum of all the rest exons of the same
>>>>         gene. This reduces the model and allows to use DEXSeq with a
>>>>         large number of samples. For more clarity, you could try to
>>>>         compare the normal model frame of a gene with the TRT model
>>>> frame:
>>>>
>>>>         data(pasillaExons, package="pasilla")
>>>>         modelFrameForGene(pasillaExons, "FBgn0000256")
>>>>         # vs
>>>>         modelFrameForTRT( pasillaExons )
>>>>
>>>>         Using the same example, in the last model frame, "this" would
>>>>         be the "E001" and "others" would be the sum of E002 + E003 +
>>>>         ... + E005.
>>>>
>>>>         This would be the "normal" DEXSeq analysis:
>>>>
>>>>         pasillaExons <- estimateSizeFactors( pasillaExons )
>>>>         pasillaExons <- estimateDispersions( pasillaExons )
>>>>         pasillaExons <- fitDispersionFunction( pasillaExons )
>>>>         pasillaExons <- testForDEU( pasillaExons )
>>>>
>>>>         This would be the "TRT",
>>>>
>>>>         pasillaExonsTRT <- estimateSizeFactors( pasillaExons )
>>>>         pasillaExonsTRT <- estimateDispersionsTRT( pasillaExons )
>>>>         pasillaExonsTRT <- fitDispersionFunction( pasillaExons )
>>>>         pasillaExonsTRT <- testForDEUTRT( pasillaExons )
>>>>
>>>>         And you can see that you get the same results:
>>>>
>>>>         plot(fData(pasillaExons)$pvalue,
>>>>         fData(pasillaExonsTRT)$pvalue, log="xy")
>>>>
>>>>         I have the "TRT" tried this for large cohorts with complex
>>>>         models and it works nicely and in reasonable computing times.
>>>>
>>>>         Best regards,
>>>>         Alejandro Reyes
>>>>
>>>>         ps. this changes need to be added to the vignette.
>>>>
>>>>
>>>>             Hi all,
>>>>
>>>>             I've been trying to get DEXSeq to run on a fairly large
>>>>             RNA-seq cohort that
>>>>             I have. To be specific, I have 89 samples and I am attempt
>>>>             to generate DE
>>>>             exon usage results on > 500,000 exons.
>>>>
>>>>             I've followed the latest tutorial (1.5.6) on Bioconductor
>>>>             and it so far
>>>>             I've had relatively no problems. It just the two steps
>>>>             that are mentioned,
>>>>             estimateDispersions and testForDEU, are taking a fairly
>>>>             long time. I've
>>>>             already attempted to parallelize this on a 48-core 256GB
>>>>             machine, but I get
>>>>             very little progress on the run-time of these functions.
>>>>
>>>>             I was just wondering if anyone has a good way of running
>>>>             DEXSeq on such a
>>>>             large cohort. Tips on how to reduce run time? Are there
>>>>             way to parallelize
>>>>             these jobs across a cluster rather than rely on a single
>>>>             machine with
>>>>             multi-cores? Any help would be greatly appreciated.
>>>>
>>>>             Thanks,
>>>>
>>>>             Fong
>>>>
>>>>                 [[alternative HTML version deleted]]
>>>>
>>>>             _______________________________________________
>>>>             Bioconductor mailing list
>>>>              Bioconductor@... <mailto:
>>>> Bioconductor@...>
>>>>
>>>>             https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>             Search the archives:
>>>>
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>>         _______________________________________________
>>>>         Bioconductor mailing list
>>>>          Bioconductor@... <mailto:Bioconductor@...>
>>>>
>>>>
>>>>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>         Search the archives:
>>>>
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>

	[[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor@...
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


Gmane