24 Jan 2013 01:00
Re: Using summarizeOverlaps with multiple samples/readgroups in a single bam file?
Ryan C. Thompson <rct@...>
2013-01-24 00:00:23 GMT
2013-01-24 00:00:23 GMT
I've been thinking about this some more, and I don't think there's any inherent reason that one cannot parallelize access to multiple read groups in a single bam file, because I have previously successfully sped up bam file reading by parallelizing across chromosomes. I think it would be convenient to have all the data for all the samples in an experiment in a single file. If Rsamtools supported filtering by read groups using some kind of option to scanBamParam (does it?), I think it would be sufficient to take a vectorized param argument to summarizeOverlaps. Then one could pass a list with one scanBamParam for each read group and get parallel counting of multiple read groups from a single bam file. What do you think? On Sat 12 Jan 2013 12:53:36 PM PST, Martin Morgan wrote: > On 1/12/2013 12:29 PM, Ryan C. Thompson wrote: >> Hi all, >> >> I'm looking at simplifying my differential expression pipeline a >> little bit by >> merging all my input bam files into one bam file with multiple >> samples/read >> groups and then using that bam file as input to summarizeOverlaps. Is >> this >> supported in any way? I've never worked with sam read groups before >> (I always >> just did one sample per file), so I don't really know anything about >> them. >> >> So is it supported to take a single bam file and use >> summarizeOverlaps or some >> other mechanism to get a SummarizedExperiment object with one column >> for each >> sample in the bam file, rather than one column per file? > > Rsamtools doesn't do anything special with read groups (e.g., no > pre-filtering) and summarizeOverlaps doesn't do per-read-group > counting (one can provide one's own counting function to > summarizedOverlaps, though...) Also, parallelizing over bam files is a > simple way to get better throughput (providing a BamFileList as the > second argument to summarizeOverlaps, and with 'parallel' on the > search path, currently uses mclapply and memory-efficient iteration to > populate the SummarizedExperiment), so in some ways one large bam file > is a step in a counter-productive direction. > > Martin > >> >> -Ryan Thompson >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@... >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ Bioconductor mailing list Bioconductor@... https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor