Michael Lawrence | 14 Mar 23:30 2013

Re: yieldSize and paired end data in SummarizeOverlaps error?

This sounds interesting. Just want to make sure: does this ensure that all
alignments for a given QNAME fall within one chunk? What happens if the
yieldsize is too small to achieve that?

And of course the drawback here is that a QNAME-sorted BAM is not
indexable. So in practice one would need two BAMs, one sorted by POS, the
other by QNAME.

What about this: preprocess a POS-sorted (and thus indexable) BAM and add a
flag indicating whether an alignment is the last alignment for a given
QNAME. The scanner then yields all complete QNAMEs after it has scanned
'yieldSize' such flags. I realize this is more work on your part, but maybe
it saves the hassle of multiple BAMs? Just an idea.

Michael

On Thu, Mar 14, 2013 at 2:49 PM, Valerie Obenchain <vobencha@...>wrote:

> Hi Tom,
>
> Currently readBamGappedAlignmentPairs() reads all data from a Bam file
> into memory to sort and determine proper pairs. This is a problem for large
> files and is the reason for the error you see below.
>
> We've been working on alternative approach that involves first sorting the
> Bam file by qname. Once sorted, the file can be read in chunks specified by
> 'yieldSize' in the BamFile object. This functionality is available in
> GenomicRanges 1.11.38 and Rsamtools 1.11.25. It is a work in progress and I
> have not yet implemented a check to ensure the file is sorted by qname. If
> you try this out, please be sure to sort your Bam file by qname first.
>
> Examples of this approach are on the summarizeOverlaps man page for the
> BamFile method,
>
> ?'summarizeOverlaps,GRanges,**BamFileList-method'
>
> Here is a piece of the example:
>
> library(pasillaBamSubset)
> library("TxDb.Dmelanogaster.**UCSC.dm3.ensGene")
> exbygene <- exonsBy(TxDb.Dmelanogaster.**UCSC.dm3.ensGene, "gene")
>
> ## paired-end sorted by qname:
> ## Set 'singleEnd' to 'FALSE'. A BAM file sorted by qname
> ## can be read in chunks with 'yieldSize'.
> fl <- untreated3_chr4()
> sortfl <- sortBam(fl, tempfile(), byQname=TRUE)
> bf2 <- BamFileList(sortfl, index=character(0),
>                    yieldSize=50000, obeyQname=TRUE)
> se2 <- summarizeOverlaps(exbygene, bf2, singleEnd=FALSE)
> counts2 <- assays(se2)$counts
>
> ## paired-end not sorted:
> ## If the file is not sorted by qname, all records are read
> ## into memory for sorting and to determine proper pairs.
> ## Any 'yieldSize' set on the BamFile will be ignored.
> fl <- untreated3_chr4()
> bf3 <- BamFileList(fl)
> se3 <- summarizeOverlaps(exbygene, bf3, singleEnd=FALSE)
> counts3 <- assays(se3)$counts
>
>
> Another recent feature addition is the GALignmentsList class. The goal
> here is to provide a container that holds any type of read, single-end,
> paired-end, singletons, multiple fragments etc. Again, the methods are
> based on the user providing a Bam file sorted by qname. Once read in, the
> records are split on read id (QNAME in SAM/BAM). The associated read
> functions are in GenomicRanges and Rsamtools.
>
> ?readGAlignmentsList (GenomicRanges)
> ?readBamGAlignmentsList (Rsamtools)
>
> Let me know if you run into problems.
>
> Valerie
>
>
>
>
>
>
> On 03/13/2013 11:07 AM, Thomas Whisenant wrote:
>
>> Hello all,
>> I am attempting to use summarizeOverlaps to assign counts to exons using
>> paired end Tophat alignments. As an example, I have created a bamfilelist
>> containing one bamfile and a feature list of exons by gene. This is
>> followed by using summarizeOverlaps with the single end parameter set to
>> false.
>>
>> bfl <- BamFileList("EP04.bam", index="EP04.bam", yieldSize=1000000)
>> exons.by.gene <- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene")
>> all.exons <- unlist(exons.by.gene)
>> sumexp <- summarizeOverlaps(
>>    features=all.exons, reads=bfl,
>>   mode=IntersectionNotEmpty,
>>    singleEnd=FALSE,
>>    ignore.strand=TRUE, parallel=TRUE)
>> Error: cannot allocate vector of size 277.9 Mb
>> In addition: Warning message:
>> In readBamGappedAlignmentPairs(**bf, param = param) : 'yieldSize' set to
>> 'NA'
>>
>> Even with the parallel option, this process uses all available memory
>> which presumably leads to the vector allocation error. While I have set the
>> yieldSize in the BamFileList, this parameter does not appear to be used.
>> Running summarizeOverlaps  with "singleEnd=TRUE" does not issue a yieldSize
>> warning. Any ideas on how to get set the yieldSize with "singleEnd=FALSE"?
>> Thanks,
>> Tom
>>
>>
>> R version 2.15.2 (2012-10-26)
>>
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>
>>
>>
>> locale:
>>
>> [1] LC_COLLATE=English_United States.1252
>>
>> [2] LC_CTYPE=English_United States.1252
>>
>> [3] LC_MONETARY=English_United States.1252
>>
>> [4] LC_NUMERIC=C
>>
>> [5] LC_TIME=English_United States.1252
>>
>>
>>
>> attached base packages:
>>
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>
>> [8] base
>>
>>
>>
>> other attached packages:
>>
>>   [1] DEXSeq_1.4.0
>>
>>   [2] Rsamtools_1.10.2
>>
>>   [3] org.Hs.eg.db_2.8.0
>>
>>   [4] RSQLite_0.11.2
>>
>>   [5] DBI_0.2-5
>>
>>   [6] TxDb.Hsapiens.UCSC.hg19.**knownGene_2.8.0
>>
>>   [7] GenomicFeatures_1.10.1
>>
>>   [8] BSgenome.Hsapiens.UCSC.hg19_1.**3.19
>>
>>   [9] BSgenome_1.26.1
>>
>> [10] Biostrings_2.26.3
>>
>> [11] GenomicRanges_1.10.6
>>
>> [12] IRanges_1.16.5
>>
>> [13] annotate_1.36.0
>>
>> [14] AnnotationDbi_1.20.3
>>
>> [15] Biobase_2.18.0
>>
>> [16] BiocGenerics_0.4.0
>>
>>
>>
>> loaded via a namespace (and not attached):
>>
>>   [1] biomaRt_2.14.0     bitops_1.0-5       hwriter_1.3
>>
>>   [4] RCurl_1.95-3       rtracklayer_1.18.2 statmod_1.4.17
>>
>>   [7] stats4_2.15.2      stringr_0.6.2      tools_2.15.2
>>
>> [10] XML_3.95-0.1       xtable_1.7-0       zlibbioc_1.4.0
>>
>>
>>
>> Thomas Whisenant, Ph.D.
>> Salomon Lab, MEM-241
>> Department of Molecular and Experimental Medicine
>> The Scripps Research Institute
>> 10550 North Torrey Pines Rd.
>> La Jolla, CA 92037
>>
thomasw@...<mailto:tho**masw@... <thomasw@...>>
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@...
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives: http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
> ______________________________**_________________
> Bioconductor mailing list
> Bioconductor@...
> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives: http://news.gmane.org/gmane.**
> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>

	[[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor@...
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


Gmane