Hervé Pagès | 21 Feb 07:54 2013

Re: speed of runmed on SimpleRleList

Hi Janet,

On 02/19/2013 06:10 PM, Janet Young wrote:
> Hi Herve (and Valerie),
>
> Interesting - thanks for looking into this.  Funny that the ends make
> such a large difference, but I think I understand.

It's because we are re-using stats::smoothEnds implementation, which
contains things like 'for (i in 3:k) {...}', and therefore is
inefficient when 'k' is not very small.

>  Hmm - I guess that
> issue would go away if I use "drop" - that should be fine, as I don't
> care about medians in the partial windows at the start and end of each
> chromosome (although I do want to make sure I get the windows lined up
> correctly, but I can probably just add a windowsize/2 constant to each
> position).

If you don't care about the truncated windows at the start and end of
each chromosome, but do want to preserve the windows lined up correctly,
you can use endrule="keep" or endrule="constant". I don't know what they
do exactly with the truncated windows (it doesn't matter), but, unlike
endrule="drop", they preserve the length of the input. And they don't
have the cost of endrule="median".

>
> I found today that the timing issues are actually made less relevant for
> me because of memory issues - I'm finding that runmean and runmed give
> me way more information than I need, and they create objects that are
> huge in memory, because they're reporting at 1bp resolution.

An operation like runmean() almost systematically returns an Rle that
is bigger in memory than its input, because it increases the nb of runs.
One important number to keep an eye on is the mean of the run lengths
(i.e. mean(runLength(x))). If it falls below the threshold of 2, then
the integer-Rle representation is not worth it: an ordinary integer
vector will use less space in memory. For a numeric-Rle, the threshold
is 3/2.

Unfortunately, runmean() tends to make that number drop badly:

   x <- Rle(sample(10L, 150, replace=TRUE),
            sample(50L, 150, replace=TRUE))

   > mean(runLength(x))
   [1] 27.58462
   > mean(runLength(runmean(x, k=5)))
   [1] 5.843393
   > mean(runLength(runmean(x, k=15)))
   [1] 2.228322
   > mean(runLength(runmean(x, k=25)))
   [1] 1.537333
   > mean(runLength(runmean(x, k=35)))
   [1] 1.28976

With k=35, keeping the result in a numeric-Rle is no more worth it.

> For my
> analysis, I don't need windows spaced by 1bp - e.g. I can look at 5kb
> windows sliding along the genome in 500bp increments.
>
> I've cooked up some code that seems to work (using Views and aggregate),
> but I wonder whether in the long run there could be some kind of
> built-in solution in IRanges. I know it would probably have to return
> different object types, to be able to keep track of window
> positions/sizes, but it seems like something a lot of folks will want to
> do (e.g. look at coverage over the genome, smoothing in some kind of
> running window, but not necessarily every 1bp).  Does that sound doable?
>   (maybe I'm missing some existing solution?)

I don't think we have a built-in solution for this at the moment but
it would certainly be doable. There are probably many ways to go but
here is one that I find appealing.

The starting point would be to have an easy way to switch between the
2 following representations of a numeric function F defined along a
genome (e.g. coverage):

   (1) RleList representation

   (2) A GRanges made of fixed-width bins (that cover the entire genome),
       and with a metadata column that contains the average value of F
       for each bin.

Because of the averaging, going from (1) to (2) would of course lose
some information but would potentially reduce memory usage a lot. More
precisely, (2) offers the advantage of using an amount of memory that
is under control because it only depends on the width of the bins.
For example, for a function F defined along the Human genome, if the
bins are of width 500 (like in your case), the GRanges representation
would have about 6 million ranges and occupy less than 100Mb. With the
RleList representation, it can vary a lot: from a few hundreds of bytes
(in the best case i.e. when coverage is constant), to 24Gb if all the
runs are of length 1 (worst case).

Once you have (2), you could do all the operations you want (e.g.
runmean, runmed, etc...,) by operating directly on the metadata col.
Unlike with the RleList, the size of the GRanges would not grow while
you do this. Then, when you are done, you could always come back to
(1) if you really miss the RleList representation. This RleList
would actually be quite small: at most 72Mb in your use case (i.e.
bins of width 500 on the Human genome).

Of course this approach is only suitable if the user accepts the loss
of information introduced by the first step i.e. by the conversion
from (1) to (2).

The only things we would need to add to support this approach are the
functions to switch between (1) and (2). Would that be of any interest?

Cheers,
H.

>
> thanks,
>
> Janet
>
>
>
>
> On Feb 19, 2013, at 3:51 PM, Hervé Pagès wrote:
>
>> Hi Janet,
>>
>> The culprit seems to be the call to smoothEnds() at the end of the
>> "runmed" method for Rle objects:
>>
>>  > selectMethod("runmed", "Rle")
>>  Method Definition:
>>
>>  function (x, k, endrule = c("median", "keep", "drop", "constant"),
>>      algorithm = NULL, print.level = 0)
>>  {
>>      if (!all(is.finite(as.vector(x))))
>>          stop("NA/NaN/Inf not supported in runmed,Rle-method")
>>      endrule <- match.arg(endrule)
>>      n <- length(x)
>>      k <- normargRunK(k = k, n = n, endrule = endrule)
>>      i <- (k + 1L)%/%2L
>>      ans <- runq(x, k = k, i = i)
>>      if (endrule == "constant") {
>>          runLength(ans)[1L] <- runLength(ans)[1L] + (i - 1L)
>>          runLength(ans)[nrun(ans)] <- runLength(ans)[nrun(ans)] +
>>              (i - 1L)
>>      }
>>      else if (endrule != "drop") {
>>          ans <- c(head(x, i - 1L), ans, tail(x, i - 1L))
>>          if (endrule == "median") {
>>              ans <- smoothEnds(ans, k = k)
>>          }
>>      }
>>      ans
>>  }
>>
>> Not too surprisingly, the complexity of smoothEnds() doesn't depend on
>> the length of its first arg:
>>
>>  > system.time(smoothEnds(Rle(0L, 10), k=5))
>>     user  system elapsed
>>    0.028   0.000   0.026
>>  > system.time(smoothEnds(Rle(0L, 10000), k=5))
>>     user  system elapsed
>>    0.028   0.000   0.026
>>
>> but is linear w.r.t to the value of its 2nd ag:
>>
>>  > system.time(smoothEnds(Rle(0L, 10000), k=51))
>>     user  system elapsed
>>    0.280   0.000   0.282
>>  > system.time(smoothEnds(Rle(0L, 10000), k=511))
>>     user  system elapsed
>>    2.836   0.000   2.841
>>
>> In other words, runmed() will be pretty fast on a full genome
>> coverage, but you will pay an additional fixed price for each
>> chromosome in the genome, and this fixed priced only depends
>> on the length of the median window.
>>
>> Note that the "smoothEnds" method for Rle objects is just
>> re-using the exact same code as stats::smoothEnds(), which is not
>> very efficient on Rle's. So we'll look into ways to improve this.
>>
>> Cheers,
>> H.
>>
>>
>> On 02/15/2013 12:27 PM, Janet Young wrote:
>>> Hi there,
>>>
>>> I've been using runmean on some coverage objects - nice.   Runmed
>>> also seems useful, but also seems very slow (oddly so) - I'm
>>> wondering whether there's some easy improvement could be made there?
>>>  Same issue with devel version and an older version.  All should be
>>> explained in the code below (I hope).
>>>
>>> thanks very much,
>>>
>>> Janet
>>>
>>> -------------------------------------------------------------------
>>>
>>> Dr. Janet Young
>>>
>>> Malik lab
>>>
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Avenue N., C3-168,
>>> P.O. Box 19024, Seattle, WA 98109-1024, USA.
>>>
>>> tel: (206) 667 1471 fax: (206) 667 6524
>>> email: jayoung  ...at... fhcrc.org <http://fhcrc.org>
>>>
>>> -------------------------------------------------------------------
>>>
>>>
>>>
>>> library(GenomicRanges)
>>>
>>> ### a small GRanges object (example from ?GRanges)
>>> seqinfo <- Seqinfo(paste0("chr", 1:3), c(1000, 2000, 1500), NA, "mock1")
>>> gr2 <-GRanges(seqnames =
>>>           Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
>>>           ranges = IRanges(1:10, width = 10:1, names = head(letters,10)),
>>>           strand = Rle(strand(c("-", "+", "*", "+", "-")),c(1, 2, 2,
>>> 3, 2)),
>>>           score = 1:10,
>>>           GC = seq(1, 0, length=10),
>>>           seqinfo=seqinfo)
>>> gr2
>>>
>>> cov <- coverage(gr2)
>>>
>>> ### runmed is slow! (for you Hutchies: this is on a rhino machine)
>>> ### I'm really trying to run this on some much bigger objects (whole
>>> genome coverage), where the slowness is more of an issue.
>>> system.time(runmed(cov, 51))
>>> #   user  system elapsed
>>> #  1.120   0.016   1.518
>>>
>>> ### runmean is fast
>>> system.time(runmean(cov, 51))
>>> #    user  system elapsed
>>> #   0.008   0.000   0.005
>>>
>>> sessionInfo()
>>>
>>>
>>> R Under development (unstable) (2012-10-03 r60868)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=C                 LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.11.29 IRanges_1.17.32       BiocGenerics_0.5.6
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_2.16.0
>>>
>>>
>>> ## see a similar issue on my Mac, using older R
>>>
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.10.1 IRanges_1.16.2       BiocGenerics_0.4.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] parallel_2.15.1 stats4_2.15.1
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor@... <mailto:Bioconductor@...>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages@... <mailto:hpages@...>
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages@...
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioconductor mailing list
Bioconductor@...
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


Gmane