5 Feb 2013 07:01
Re: Log transformation and left censoring
On Tue, Feb 5, 2013 at 11:23 AM, Gordon K Smyth <smyth@...> wrote: > Dear Paul, > > The transformation that you propose is the same transformation that is done > by predFC(y) in the edgeR package, or by cpm(y,log=TRUE) in the > developmental version of the edgeR package. The argument prior.count > controls the moderation amount. > > This is the same transformation that we recommend and use ourselves for > heatmaps. See Section 2.10 of the edgeR User's Guide: > > http://bioconductor.org/packages/2.12/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf > > There is an example of its use on page 58 of the User's Guide. > > Belinda Phipson has shown as part of her PhD work that, under some > assumptions, this transformation comes close to minimizing the mean square > error when predicting the true log fold changes. > > Simply putting these logCPM values into limma will perform comparably to > voom if the library sizes are not very different, provided that you use > eBayes(fit,trend=TRUE). When the library sizes are different, however, voom > is the clear winner. > It's reassuring to see better statisticians than me thinking along similar lines, including using the mean library size. A difference is that the default prior.count is much smaller than I have been using, 0.25. But then in the example in section 2.10, you're a using prior.count of 2. The ideal prior.count for estimating fold changes might be different than the ideal for visualization and clustering, and the ideal for detecting differential expression might be different from the first or both of these. With my large prior.count I'm detecting extra differentially expressed genes, and in particular these are genes with low average expression. Also, an artifact in my heatmaps, which seems to be due to the different sample sizes, is removed. It might be that my data is weird, but these improvements seem to be to do with having a spread of library sizes. > There is no censoring. A major reason for adding an offset (aka > prior.count) to the counts is to avoid the need to censor, truncate or > remove observations. Rather a mononotic transformation of the counts is > performed for each library. > Apologies for being unclear. Yes, from the perspective of untransformed counts, there is no censoring. I was viewing the problem from the perspective of log transformed expression values, which is an odd viewpoint. From this perspective, there's a (soft) line left of which it's impossible estimate the log expression, just because we're using count data (microarrays have a similar line, but for different reasons). Further, where the number of reads is different in different samples, they have different (soft) cutoff lines. Statistical methods that work with count data directly, such as edgeR, shouldn't be affected by this, but when feeding log transformed values into limma it is a potential problem. I definitely wouldn't want to remove low counts, because I'm then likely to miss those genes with such a large fold change that in one group the reads are almost entirely absent, and these are likely to be exactly the genes I am looking for. So I want to (softly) truncate low counts at a consistent point before log transforming them. One downside: even if this works for straightforward differential expression, it will tend to see interactions that don't actually exist. Voom's weighting system might mitigate this, and it would be an interesting demonstration of voom's power. Hmm. What I actually want is a test that works on counts directly, with empirical priors on coefficients and the dispersion. Um. Assuming we have model0 and model1, and some reasonable priors on the parameters in these models. Say for the count data for a gene, x, we have a way to compute P(x|model0) and P(x|model1). Use P(x|model1) as a test statistic, the significance is the likelihood that P(x2|model1) >= P(x|model1) for x2 sampled from model0. We can generate a few million samples from model0 to give a distribution to test against, only need to do this once, so it's computationally feasible. For actually computing P(x|model): P(x | coefficients, dispersion) is straightforward, and assume we have priors P(coefficients | dispersion), P(dispersion). Would need to numerically integrate over dispersion, I think. For a range of dispersion values find the MAP for the coefficients then use an exp(2nd order Taylor expansion) approximation around that to integrate P(x | coefficients, dispersion) P(coefficients | dispersion) down to P(x | dispersion). Then sum P(x | dispersion) P(dispersion) to get P(x). Tweak emprical priors and repeat a few times. It wouldn't exactly be fast, but it wouldn't be unworkably slow. -- -- Paul Harrison Victorian Bioinformatics Consortium / Monash University _______________________________________________ Bioconductor mailing list Bioconductor@... https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
RSS Feed