Martin Morgan | 24 Jan 01:45 2013

Re: Using summarizeOverlaps with multiple samples/readgroups in a single bam file?

On 01/23/2013 04:00 PM, Ryan C. Thompson wrote:
> I've been thinking about this some more, and I don't think there's any inherent
> reason that one cannot parallelize access to multiple read groups in a single
> bam file, because I have previously successfully sped up bam file reading by
> parallelizing across chromosomes. I think it would be convenient to have all the
> data for all the samples in an experiment in a single file. If Rsamtools
> supported filtering by read groups using some kind of option to scanBamParam
> (does it?), I think it would be sufficient to take a vectorized param argument
> to summarizeOverlaps. Then one could pass a list with one scanBamParam for each
> read group and get parallel counting of multiple read groups from a single bam
> file.

If someone can point me to a reasonable publicly available BAM file with read 
groups I'd be happy to explore this a bit. Rsamtools doesn't (yet?) support 
filtering by read group. Martin

> What do you think?
> On Sat 12 Jan 2013 12:53:36 PM PST, Martin Morgan wrote:
>> On 1/12/2013 12:29 PM, Ryan C. Thompson wrote:
>>> Hi all,
>>> I'm looking at simplifying my differential expression pipeline a
>>> little bit by
>>> merging all my input bam files into one bam file with multiple
>>> samples/read
>>> groups and then using that bam file as input to summarizeOverlaps. Is
>>> this
>>> supported in any way? I've never worked with sam read groups before
>>> (I always
>>> just did one sample per file), so I don't really know anything about
>>> them.
>>> So is it supported to take a single bam file and use
>>> summarizeOverlaps or some
>>> other mechanism to get a SummarizedExperiment object with one column
>>> for each
>>> sample in the bam file, rather than one column per file?
>> Rsamtools doesn't do anything special with read groups (e.g., no
>> pre-filtering) and summarizeOverlaps doesn't do per-read-group
>> counting (one can provide one's own counting function to
>> summarizedOverlaps, though...) Also, parallelizing over bam files is a
>> simple way to get better throughput (providing a BamFileList as the
>> second argument to summarizeOverlaps, and with 'parallel' on the
>> search path, currently uses mclapply and memory-efficient iteration to
>> populate the SummarizedExperiment), so in some ways one large bam file
>> is a step in a counter-productive direction.
>> Martin
>>> -Ryan Thompson
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor@...
>>> Search the archives:


Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Bioconductor mailing list
Search the archives: