24 Jan 01:45 2013
Re: Using summarizeOverlaps with multiple samples/readgroups in a single bam file?
Martin Morgan <mtmorgan@...>
2013-01-24 00:45:57 GMT
2013-01-24 00:45:57 GMT
On 01/23/2013 04:00 PM, Ryan C. Thompson wrote: > I've been thinking about this some more, and I don't think there's any inherent > reason that one cannot parallelize access to multiple read groups in a single > bam file, because I have previously successfully sped up bam file reading by > parallelizing across chromosomes. I think it would be convenient to have all the > data for all the samples in an experiment in a single file. If Rsamtools > supported filtering by read groups using some kind of option to scanBamParam > (does it?), I think it would be sufficient to take a vectorized param argument > to summarizeOverlaps. Then one could pass a list with one scanBamParam for each > read group and get parallel counting of multiple read groups from a single bam > file. If someone can point me to a reasonable publicly available BAM file with read groups I'd be happy to explore this a bit. Rsamtools doesn't (yet?) support filtering by read group. Martin > > What do you think? > > On Sat 12 Jan 2013 12:53:36 PM PST, Martin Morgan wrote: >> On 1/12/2013 12:29 PM, Ryan C. Thompson wrote: >>> Hi all, >>> >>> I'm looking at simplifying my differential expression pipeline a >>> little bit by >>> merging all my input bam files into one bam file with multiple >>> samples/read >>> groups and then using that bam file as input to summarizeOverlaps. Is >>> this >>> supported in any way? I've never worked with sam read groups before >>> (I always >>> just did one sample per file), so I don't really know anything about >>> them. >>> >>> So is it supported to take a single bam file and use >>> summarizeOverlaps or some >>> other mechanism to get a SummarizedExperiment object with one column >>> for each >>> sample in the bam file, rather than one column per file? >> >> Rsamtools doesn't do anything special with read groups (e.g., no >> pre-filtering) and summarizeOverlaps doesn't do per-read-group >> counting (one can provide one's own counting function to >> summarizedOverlaps, though...) Also, parallelizing over bam files is a >> simple way to get better throughput (providing a BamFileList as the >> second argument to summarizeOverlaps, and with 'parallel' on the >> search path, currently uses mclapply and memory-efficient iteration to >> populate the SummarizedExperiment), so in some ways one large bam file >> is a step in a counter-productive direction. >> >> Martin >> >>> >>> -Ryan Thompson >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@... >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioconductor mailing list Bioconductor@... https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor