14 Mar 2013 22:49
Re: yieldSize and paired end data in SummarizeOverlaps error?
Valerie Obenchain <vobencha@...>
2013-03-14 21:49:15 GMT
2013-03-14 21:49:15 GMT
Hi Tom, Currently readBamGappedAlignmentPairs() reads all data from a Bam file into memory to sort and determine proper pairs. This is a problem for large files and is the reason for the error you see below. We've been working on alternative approach that involves first sorting the Bam file by qname. Once sorted, the file can be read in chunks specified by 'yieldSize' in the BamFile object. This functionality is available in GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work in progress and I have not yet implemented a check to ensure the file is sorted by qname. If you try this out, please be sure to sort your Bam file by qname first. Examples of this approach are on the summarizeOverlaps man page for the BamFile method, ?'summarizeOverlaps,GRanges,BamFileList-method' Here is a piece of the example: library(pasillaBamSubset) library("TxDb.Dmelanogaster.UCSC.dm3.ensGene") exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene") ## paired-end sorted by qname: ## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname ## can be read in chunks with 'yieldSize'. fl <- untreated3_chr4() sortfl <- sortBam(fl, tempfile(), byQname=TRUE) bf2 <- BamFileList(sortfl, index=character(0), yieldSize=50000, obeyQname=TRUE) se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE) counts2 <- assays(se2)$counts ## paired-end not sorted: ## If the file is not sorted by qname, all records are read ## into memory for sorting and to determine proper pairs. ## Any 'yieldSize' set on the BamFile will be ignored. fl <- untreated3_chr4() bf3 <- BamFileList(fl) se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE) counts3 <- assays(se3)$counts Another recent feature addition is the GALignmentsList class. The goal here is to provide a container that holds any type of read, single-end, paired-end, singletons, multiple fragments etc. Again, the methods are based on the user providing a Bam file sorted by qname. Once read in, the records are split on read id (QNAME in SAM/BAM). The associated read functions are in GenomicRanges and Rsamtools. ?readGAlignmentsList (GenomicRanges) ?readBamGAlignmentsList (Rsamtools) Let me know if you run into problems. Valerie On 03/13/2013 11:07 AM, Thomas Whisenant wrote: > Hello all, > I am attempting to use summarizeOverlaps to assign counts to exons using paired end Tophat alignments. As an example, I have created a bamfilelist containing one bamfile and a feature list of exons by gene. This is followed by using summarizeOverlaps with the single end parameter set to false. > > bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000) > exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene") > all.exons <- unlist(exons.by.gene) > sumexp <- summarizeOverlaps( > features=all.exons, reads=bfl, > mode=IntersectionNotEmpty, > singleEnd=FALSE, > ignore.strand=TRUE, parallel=TRUE) > Error: cannot allocate vector of size 277.9 Mb > In addition: Warning message: > In readBamGappedAlignmentPairs(bf, param = param) : 'yieldSize' set to 'NA' > > Even with the parallel option, this process uses all available memory which presumably leads to the vector allocation error. While I have set the yieldSize in the BamFileList, this parameter does not appear to be used. Running summarizeOverlaps with "singleEnd=TRUE" does not issue a yieldSize warning. Any ideas on how to get set the yieldSize with "singleEnd=FALSE"? > Thanks, > Tom > > > R version 2.15.2 (2012-10-26) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > > > locale: > >  LC_COLLATE=English_United States.1252 > >  LC_CTYPE=English_United States.1252 > >  LC_MONETARY=English_United States.1252 > >  LC_NUMERIC=C > >  LC_TIME=English_United States.1252 > > > > attached base packages: > >  parallel stats graphics grDevices utils datasets methods > >  base > > > > other attached packages: > >  DEXSeq_1.4.0 > >  Rsamtools_1.10.2 > >  org.Hs.eg.db_2.8.0 > >  RSQLite_0.11.2 > >  DBI_0.2-5 > >  TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0 > >  GenomicFeatures_1.10.1 > >  BSgenome.Hsapiens.UCSC.hg19_1.3.19 > >  BSgenome_1.26.1 > >  Biostrings_2.26.3 > >  GenomicRanges_1.10.6 > >  IRanges_1.16.5 > >  annotate_1.36.0 > >  AnnotationDbi_1.20.3 > >  Biobase_2.18.0 > >  BiocGenerics_0.4.0 > > > > loaded via a namespace (and not attached): > >  biomaRt_2.14.0 bitops_1.0-5 hwriter_1.3 > >  RCurl_1.95-3 rtracklayer_1.18.2 statmod_1.4.17 > >  stats4_2.15.2 stringr_0.6.2 tools_2.15.2 > >  XML_3.95-0.1 xtable_1.7-0 zlibbioc_1.4.0 > > > > Thomas Whisenant, Ph.D. > Salomon Lab, MEM-241 > Department of Molecular and Experimental Medicine > The Scripps Research Institute > 10550 North Torrey Pines Rd. > La Jolla, CA 92037 > thomasw@...<mailto:thomasw@...> > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@... > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ Bioconductor mailing list Bioconductor@... https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor