23 Nov 04:27 2012

## How to design matrix on edgeR to study genotype x environmental interaction

Dear Daniela, I think you would be very well advised to seek out a statistical bioinformatician with whom you can collaborate on an ongoing basis. A GxE anova analysis would be statistically sophisticated even if you were analysing a simple univariate phenotypic trait. Attempting to do that sort of analysis in the context of an RNA-Seq experiment on miRNAs is far more difficult again. The design matrices you have created may be correct, but that's just the start of the analysis, and there are many layers of possible complexity. The BCV in your experiment is so large that I feel there must be quality issues with your data that you have not successfully dealt with. It seems very likely, for example, that there are batch effects that you have not yet described. To answer some specific questions: You might be better off with prior.df=10 instead the default, but this has little to do with the size of the BCV. You ask why one variety and one stage are disappearing from your design matrix. If you omit the "0+" in the first formula (and you should), you will find that one vineyard will disappear as well. This is because the number of contrasts for any factor must be one less than the number of leveles. This is a very fundamental feature of factors and model formula that you need to become familiar with before you can make sense of any model formula. Your email makes no mention of library sizes or sequencing depths, but obviously that has a fundamental effect on what is significantly different from what. I think you know now how to use edgeR in principle. However, as you probably already appreciate, deciding what is the right analysis for your data is beyond the scope of the mailing list. Best wishes Gordon On Thu, 22 Nov 2012, bioconductor-request@... wrote: > Date: Thu, 22 Nov 2012 10:07:19 +0100 > From: Daniela Lopes Paim Pinto <d.lopespaimpinto@...> > To: bioconductor@... > Subject: Re: [BioC] How to design matrix on edgeR to study genotype x > environmental interaction > Message-ID: > > Dear Gordon, > > Thank you so much for your valuable input. I took sometime to study a bit > more and be able to consider all the aspects you pointed out. At this time > I reconsider the analysis and started again, with the data exploration of > all 48 samples. > > First I filtered out the low reads, considering just the ones with more > than 1 cpm in at least 2 libraries (I have two replicates of each library); > the MDS plot clearly separate one of the locations from the other two > (dimension 1) and with less distinction the two varieties (dimension 2). > The stages also seems to be separated in two groups (the first two ones > together and separate of the two last ones) but as the varieties, not so > distinct. The two replicates are also consistent. > > With the BCV plot I could observe that reads with lower logCPM have bigger > BCV (the BCV value was equal to 0.5941), and then comes my first question: > > Should I choose*prior.df*different from the default, due to this > behavior, when estimating genewise dispersion? > > To proceed with the DE analysis, I tried two approaches, this time with all > the 48 samples, as suggested. > For both approaches, I have the following data frame: > >> target > Sample Vineyard Variety Stage > 1 1 mont CS ps > 2 2 mont CS ps > 3 4 mont CS bc > 4 5 mont CS bc > 5 7 mont CS 19b > 6 8 mont CS 19b > 7 10 mont CS hv > 8 11 mont CS hv > 9 13 mont SG ps > 10 14 mont SG ps > 11 16 mont SG bc > 12 17 mont SG bc > 13 19 mont SG 19b > 14 20 mont SG 19b > 15 22 mont SG hv > 16 23 mont SG hv > 17 25 Bol CS ps > 18 26 Bol CS ps > 19 28 Bol CS bc > 20 29 Bol CS bc > 21 31 Bol CS 19b > 22 32 Bol CS 19b > 23 34 Bol CS hv > 24 35 Bol CS hv > 25 37 Bol SG ps > 26 38 Bol SG ps > 27 40 Bol SG bc > 28 41 Bol SG bc > 29 43 Bol SG 19b > 30 44 Bol SG 19b > 31 46 Bol SG hv > 32 47 Bol SG hv > 33 49 Ric CS ps > 34 50 Ric CS ps > 35 52 Ric CS bc > 36 53 Ric CS bc > 37 55 Ric CS 19b > 38 56 Ric CS 19b > 39 58 Ric CS hv > 40 59 Ric CS hv > 41 61 Ric SG ps > 42 62 Ric SG ps > 43 64 Ric SG bc > 44 65 Ric SG bc > 45 67 Ric SG 19b > 46 68 Ric SG 19b > 47 70 Ric SG hv > 48 71 Ric SG hv > > At the first instance, I used the full interaction formula as the following > code: > >> d <- DGEList(counts=file) >> keep <- rowSums(cpm(DGElist) > 1) >= 2 >> DGElist <- DGElist[keep,] >> DGElist$samples$lib.size <- colSums(DGElist$counts) >> DGElist_norm <- calcNormFactors(DGElist) > *> design <- model.matrix(~0 + Vineyard + Variety + Stage + > Vineyard:Variety + Vineyard:Stage + Variety:Stage + Vineyard:Variety:Stage, > data=target)* > > [or even (*> design <- model.matrix(~0 + Vineyard*Variety*Stage, > data=target)*) which gives the same result] > >> rownames(design) <- colnames(DGEList_norm) > > However, when I call the*design*I see that one Variety (i.e., CS) and one > Stage (i.e., 19b) are not present in the design matrix, as individual > effect or even in the interactions. > > Then I passed to the second approach, in which, I create groups: > >> group <- > factor(paste(target$Vineyard,target$Variety,target$Stage,sep="_")) >> cbind(target,Group=group) >> DGElist <- DGEList(counts=file,group=group) >> keep <- rowSums(cpm(DGElist) > 1) >= 2 >> DGElist <- DGElist[keep,] >> DGElist$samples$lib.size <- colSums(DGElist$counts) >> DGElist_norm <- calcNormFactors(DGElist) >> design <- model.matrix(~0+group, data=DGElist_norm$samples) >> colnames(design) <- levels(group) > > The design matrix in this case include all the groups, and then I proceed > doing: > >> commondisp <- estimateGLMCommonDisp(DGElist_norm, design, verbose=TRUE) > Disp = 0.35294 , BCV = 0.5941 >> trenddisp <- estimateGLMTrendedDisp(commondisp, design) >> tagwisedisp <- estimateGLMTagwiseDisp(trenddisp, design) >> fit <- glmFit(tagwisedisp, design) >> my.contrasts <- makeContrasts(CS_ps_BolvsMont = Bol_CS_ps-mont_CS_ps, > CS_ps_BolvsRic = Bol_CS_ps-Ric_CS_ps, Bol_ps_CSvsSG = Bol_CS_ps-Bol_SG_ps, > levels=design) #Just as some examples of the contrasts I am interested on. >> lrt <- glmLRT(fit, contrast=my.contrasts[,"CS_ps_BolvsMont"]) > > With this code, I got the results, but I am afraid that they are not very > consistent with the data. To give one example, the DE results tell me that > a given miRNA which has 0 and 1 reads respectively in the two replicates of > one sample is significantly different when comparing with other sample in > which this miRNA has 5 and 10 reads in the two replicates respectively, > but in the same set of results another miRNA which has 4259 and 2198 reads > respectively in the two replicates of one sample is not significantly > different when comparing with the other sample in which this miRNA has > 352 and 599 reads respectively in the two replicates. In other words, 0 and > 1 are significantly different from 5 and 10 but 4259 and 2198 are > not significantly different from 352 and 599. With this comparisons, I am > just trying to interpret my data based on these results. > > I know that the test for differential expression is not made based on the > raw reads, but I do not know exactly how it is made, anyway I expect that > if I used the correct model to describe my data, the results will describe > the differences consistently. > Could you make any suggestions about my analysis? Creating the groups as I > showed above, is it correct for testing all the interactions? Is there any > explanation for the fact that the one variety and one stage "disappear" > from the design matrix when using the full interaction formula? > > Sorry for the long email and thank you for all the advises, > > Best wishes > > Daniela Lopes Paim Pinto > PhD student - Agrobiosciences > Scuola Superiore Sant'Anna, Italy > >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.0.3 limma_3.14.1 > > loaded via a namespace (and not attached): > [1] tools_2.15.2 > > > > > > > > > > > 2012/11/11 Gordon K Smyth <smyth@...> > >> Dear Daniela, >> >> What version of the edgeR are you using? The posting guide asks you to >> give sessionInfo() output so we can see package versions. >> >> Your codes looks correct for testing an interaction, although you could >> estimate the same interaction more directly using an interaction formula as >> in Section 3.3.4 of the edgeR User's Guide. >> >> However the model you have used is correct only if all 12 samples >> correspond to the same physiological stage. I wonder why you are not >> analysing all the 48 samples together. I would start with data exploration >> of all 48 samples, including exploration measures like transcript >> filtering, library sizes, normalization factors, an MDS plot, a BCV plot, >> and so on. The first step is to check the data quality before going on to >> test for differential expression. >> >> edgeR has very high statistical power, even giving p-values smaller than I >> would like in some cases. So if you're not getting any differential >> expression, it is because there is none or because you have data quality >> problems. >> >> Best wishes >> Gordon >> >> Date: Fri, 9 Nov 2012 14:44:28 +0100 >>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto@...> >>> To: bioconductor@... >>> Subject: Re: [BioC] How to design matrix on edgeR to study genotype x >>> environmental interaction >>> >>> Dear Gordon, >>> >>> Thank you so much for the reference. I read all the chapter regarding to >>> the models and I tried to set up the following code considering a data >>> frame like this: >>> >>> target >>>> >>> Sample Variety Location >>> 1 1 CS Mont >>> 2 2 CS Mont >>> 3 25 CS Bol >>> 4 26 CS Bol >>> 5 49 CS Ric >>> 6 50 CS Ric >>> 7 13 SG Mont >>> 8 14 SG Mont >>> 9 37 SG Bol >>> 10 38 SG Bol >>> 11 61 SG Ric >>> 12 62 SG Ric >>> >>> group <- factor(paste(target$Variety,**target$Location,sep="_")) >>>> cbind(target,Group=group) >>>> d <- DGEList(counts=file,group=**group) >>>> DGEnorm <- calcNormFactors(d) >>>> design <- model.matrix(~0+group, data=DGEnorm$samples) >>>> colnames(design) <- levels(group) >>>> >>> >>> Which gave me the design matrix: >>> >>> design >>>> >>> CS_Bol CS_Mont CS_Ric SG_Bol SG_Mont SG_Ric >>> CS_Mont 0 1 0 0 0 0 >>> CS_Mont.1 0 1 0 0 0 0 >>> CS_Bol 1 0 0 0 0 0 >>> CS_Bol.1 1 0 0 0 0 0 >>> CS_Ric 0 0 1 0 0 0 >>> CS_Ric.1 0 0 1 0 0 0 >>> SG_Mont 0 0 0 0 1 0 >>> SG_Mont.1 0 0 0 0 1 0 >>> SG_Bol 0 0 0 1 0 0 >>> SG_Bol.1 0 0 0 1 0 0 >>> SG_Ric 0 0 0 0 0 1 >>> SG_Ric.1 0 0 0 0 0 1 >>> attr(,"assign") >>> [1] 1 1 1 1 1 1 >>> attr(,"contrasts") >>> attr(,"contrasts")$group >>> [1] "contr.treatment" >>> >>> And then I estimated the trended and tag wise dispersion and fit the model >>> doing: >>> >>> disp.tren <- estimateGLMTrendedDisp(**DGEnorm,design) >>>> disp.tag <- estimateGLMTagwiseDisp(disp.**tren,design) >>>> fit <- glmFit(disp.tag,design) >>>> >>> >>> When I made some contrasts to find DE miRNAs, for example: >>> >>> my.constrasts <- makeContrasts(CS_BolvsMont = CS_Bol-CS_Mont, >>>> >>> CSvsSG_BolvsMont = (CS_Bol-CS_Mont)-(SG_Bol-SG_**Mont), levels=design) >>> >>>> lrt <- glmLRT(fit, contrast=my.constrasts[,"CS_**BolvsMont"]) >>>> >>> >>> I expected to find DE miRNAs due the environment effect (CS_BolvsMont) and >>> for example DE miRNAs due the interaction genotypeXenvironment ( >>> CSvsSG_BolvsMont). >>> >>> However the results do not seems to reflect it, since I did not get even a >>> single DE miRNA with significant FDR (even less than 20%!!!!) and going >>> back to the counts in the raw data I find reasonable differences in their >>> expression, which was expected. I forgot to mention that I decided to >>> consider stage by stage separately and not add one more factor on the >>> model, since I am not interested, for the moment, on the time course (as I >>> wrote in the previous email - see below). >>> >>> Could you (or any body else from the list) give me some advise regarding >>> the code? Is this matrix appropriate for the kind of comparisons I am >>> interested on? >>> >>> Thank you in advance for any input. >>> >>> Daniela >>> >>> >>> >>> >>> 2012/10/30 Gordon K Smyth <smyth@...> >>> >>> Dear Daniela, >>>> >>>> edgeR can work with any design matrix. Just setup your interaction >>>> model using standard R model formula. See for example Chapter 11 of: >>>> >>>> >>>> http://cran.r-project.org/doc/****manuals/R-intro.pdf<http://cran.r-project.org/doc/**manuals/R-intro.pdf> >> <http://**cran.r-project.org/doc/**manuals/R-intro.pdf<http://cran.r-project.org/doc/manuals/R-intro.pdf> >>> >> >>> >>>> Best wishes >>>> Gordon >>>> >>>> Date: Mon, 29 Oct 2012 16:24:31 +0100 >>>> >>>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto@...> >>>>> To: bioconductor@... >>>>> Subject: [BioC] How to design matrix on edgeR to study genotype x >>>>> environmental interaction >>>>> >>>>> Dear all, >>>>> >>>>> I'm currently working with data coming from deep sequencing of 48 small >>>>> RNAs libraries and using edgeR to identify DE miRNAs. I could not figure >>>>> out how to design my matrix for the following experimental design: >>>>> >>>>> I have 2 varieties (genotypes), cultivated in 3 different locations >>>>> (environments) and collected in 4 physiological stages. None of them >>>>> represent a control treatment. I'm particulary interested on identifying >>>>> those miRNAs which modulate their expression dependent on genotypes (G), >>>>> environments (E) and G x E interaction. For instance the same variety in >>>>> the 3 different locations, both varieties in the same location and both >>>>> varieties in the 3 different locations. >>>>> >>>>> I was wondering if I could use the section 3.3 of edgeR user guide as >>>>> reference or if someone could suggest me any other alternative method. >>>>> >>>>> Thanks in advance >>>>> >>>>> Daniela >>>>> >>>>> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} _______________________________________________ Bioconductor mailing list Bioconductor@... https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor