14 Mar 2013 23:30
Re: yieldSize and paired end data in SummarizeOverlaps error?
Michael Lawrence <lawrence.michael@...>
2013-03-14 22:30:06 GMT
2013-03-14 22:30:06 GMT
This sounds interesting. Just want to make sure: does this ensure that all alignments for a given QNAME fall within one chunk? What happens if the yieldsize is too small to achieve that? And of course the drawback here is that a QNAME-sorted BAM is not indexable. So in practice one would need two BAMs, one sorted by POS, the other by QNAME. What about this: preprocess a POS-sorted (and thus indexable) BAM and add a flag indicating whether an alignment is the last alignment for a given QNAME. The scanner then yields all complete QNAMEs after it has scanned 'yieldSize' such flags. I realize this is more work on your part, but maybe it saves the hassle of multiple BAMs? Just an idea. Michael On Thu, Mar 14, 2013 at 2:49 PM, Valerie Obenchain <vobencha@...>wrote: > Hi Tom, > > Currently readBamGappedAlignmentPairs() reads all data from a Bam file > into memory to sort and determine proper pairs. This is a problem for large > files and is the reason for the error you see below. > > We've been working on alternative approach that involves first sorting the > Bam file by qname. Once sorted, the file can be read in chunks specified by > 'yieldSize' in the BamFile object. This functionality is available in > GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work in progress and I > have not yet implemented a check to ensure the file is sorted by qname. If > you try this out, please be sure to sort your Bam file by qname first. > > Examples of this approach are on the summarizeOverlaps man page for the > BamFile method, > > ?'summarizeOverlaps,GRanges,**BamFileList-method' > > Here is a piece of the example: > > library(pasillaBamSubset) > library("TxDb.Dmelanogaster.**UCSC.dm3.ensGene") > exbygene <- exonsBy(TxDb.Dmelanogaster.**UCSC.dm3.ensGene, "gene") > > ## paired-end sorted by qname: > ## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname > ## can be read in chunks with 'yieldSize'. > fl <- untreated3_chr4() > sortfl <- sortBam(fl, tempfile(), byQname=TRUE) > bf2 <- BamFileList(sortfl, index=character(0), > yieldSize=50000, obeyQname=TRUE) > se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE) > counts2 <- assays(se2)$counts > > ## paired-end not sorted: > ## If the file is not sorted by qname, all records are read > ## into memory for sorting and to determine proper pairs. > ## Any 'yieldSize' set on the BamFile will be ignored. > fl <- untreated3_chr4() > bf3 <- BamFileList(fl) > se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE) > counts3 <- assays(se3)$counts > > > Another recent feature addition is the GALignmentsList class. The goal > here is to provide a container that holds any type of read, single-end, > paired-end, singletons, multiple fragments etc. Again, the methods are > based on the user providing a Bam file sorted by qname. Once read in, the > records are split on read id (QNAME in SAM/BAM). The associated read > functions are in GenomicRanges and Rsamtools. > > ?readGAlignmentsList (GenomicRanges) > ?readBamGAlignmentsList (Rsamtools) > > Let me know if you run into problems. > > Valerie > > > > > > > On 03/13/2013 11:07 AM, Thomas Whisenant wrote: > >> Hello all, >> I am attempting to use summarizeOverlaps to assign counts to exons using >> paired end Tophat alignments. As an example, I have created a bamfilelist >> containing one bamfile and a feature list of exons by gene. This is >> followed by using summarizeOverlaps with the single end parameter set to >> false. >> >> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000) >> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene") >> all.exons <- unlist(exons.by.gene) >> sumexp <- summarizeOverlaps( >> features=all.exons, reads=bfl, >> mode=IntersectionNotEmpty, >> singleEnd=FALSE, >> ignore.strand=TRUE, parallel=TRUE) >> Error: cannot allocate vector of size 277.9 Mb >> In addition: Warning message: >> In readBamGappedAlignmentPairs(**bf, param = param) : 'yieldSize' set to >> 'NA' >> >> Even with the parallel option, this process uses all available memory >> which presumably leads to the vector allocation error. While I have set the >> yieldSize in the BamFileList, this parameter does not appear to be used. >> Running summarizeOverlaps with "singleEnd=TRUE" does not issue a yieldSize >> warning. Any ideas on how to get set the yieldSize with "singleEnd=FALSE"? >> Thanks, >> Tom >> >> >> R version 2.15.2 (2012-10-26) >> >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> >> >> locale: >> >>  LC_COLLATE=English_United States.1252 >> >>  LC_CTYPE=English_United States.1252 >> >>  LC_MONETARY=English_United States.1252 >> >>  LC_NUMERIC=C >> >>  LC_TIME=English_United States.1252 >> >> >> >> attached base packages: >> >>  parallel stats graphics grDevices utils datasets methods >> >>  base >> >> >> >> other attached packages: >> >>  DEXSeq_1.4.0 >> >>  Rsamtools_1.10.2 >> >>  org.Hs.eg.db_2.8.0 >> >>  RSQLite_0.11.2 >> >>  DBI_0.2-5 >> >>  TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0 >> >>  GenomicFeatures_1.10.1 >> >>  BSgenome.Hsapiens.UCSC.hg19_1.**3.19 >> >>  BSgenome_1.26.1 >> >>  Biostrings_2.26.3 >> >>  GenomicRanges_1.10.6 >> >>  IRanges_1.16.5 >> >>  annotate_1.36.0 >> >>  AnnotationDbi_1.20.3 >> >>  Biobase_2.18.0 >> >>  BiocGenerics_0.4.0 >> >> >> >> loaded via a namespace (and not attached): >> >>  biomaRt_2.14.0 bitops_1.0-5 hwriter_1.3 >> >>  RCurl_1.95-3 rtracklayer_1.18.2 statmod_1.4.17 >> >>  stats4_2.15.2 stringr_0.6.2 tools_2.15.2 >> >>  XML_3.95-0.1 xtable_1.7-0 zlibbioc_1.4.0 >> >> >> >> Thomas Whisenant, Ph.D. >> Salomon Lab, MEM-241 >> Department of Molecular and Experimental Medicine >> The Scripps Research Institute >> 10550 North Torrey Pines Rd. >> La Jolla, CA 92037 >> thomasw@...<mailto:tho**masw@... <thomasw@...>> >> >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@... >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@... > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor> > [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@... https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor