Simon Urbanek | 22 Mar 02:23 2012

Re: R's copying of arguments (Re: Julia)


On Mar 20, 2012, at 3:08 PM, Hervé Pagès wrote:

> Hi Oliver,
> 
> On 03/17/2012 08:35 AM, oliver wrote:
>> Hello,
>> 
>> regarding the copying issue,
>> I would like to point to the
>> 
>> "Writing R-Extensions" documentation.
>> 
>> There it is mentio9ned, that functions of extensions
>> that use the .C interface normally do get their arguments
>> pre-copied...
>> 
>> 
>> In section 5.2:
>> 
>>   "There can be up to 65 further arguments giving R objects to be
>>   passed to compiled code. Normally these are copied before being
>>   passed in, and copied again to an R list object when the compiled
>>   code returns."
>> 
>> But for the .Call and .Extension interfaces this is NOT the case.
>> 
>> 
>> 
>> In section 5.9:
>>   "The .Call and .External interfaces allow much more control, but
>>   they also impose much greater responsibilities so need to be used
>>   with care. Neither .Call nor .External copy their arguments. You
>>   should treat arguments you receive through these interfaces as
>>   read-only."
>> 
>> 
>> Why is read-only preferred?
>> 
>> Please, see the discussion in section 5.9.10.
>> 
>> It's mentioned there, that a copy of an object in the R-language
>> not necessarily doies a real copy of that object, but instead of
>> this, just a "rerference" to the real data is created (two names
>> referring to one bulk of data). That's typical functional
>> programming: not a variable, but a name (and possibly more than one
>> name) bound to an object.
>> 
>> 
>> Of course, if yo change the orgiginal named value, when there
>> would be no copy of it, before changing it, then both names
>> would refer to the changed data.
>> of course that is not, what is wanted.
>> 
>> But what you also can see in section 5.9.10 is, that
>> there already is a mechanism (reference counting) that allows
>> to distinguish between unnamed and named object.
>> 
>> So, this is directly adressing the points you have mentioned in your
>> examples.
>> 
>> So, at least in principial, R allows to do in-place modifications
>> of object with the .Call interface.
>> 
>> You seem to refer to the .C interface, and I had explored the .Call
>> interface. That's the reason why you may insist on "it's copyied
>> always" and I wondered, what you were talking about, because the
>> .Call interface allowed me rather C-like raw style of programming
>> (and the user of it to decide, if copying will be done or not).
>> 
>> The mechanism to descide, if copying should be done or not,
>> also is mentioined in section 5.9.10: NAMED and SET_NAMED macros.
>> with NAMED you can get the number of references.
>> 
>> But later in that section it is mentioned, that - at least for now -
>> NAMED always returns the value 2.
>> 
>> 
>>   "Currently all arguments to a .Call call will have NAMED set to 2,
>>   and so users must assume that they need to be duplicated before
>>   alteration."
>>                (section 5.9.10, last sentence)
>> 
>> 
>> So, the in-place modification can be done already with the .Call
>> intefcae for example. But the decision if it is safe or not
>> is not supported at the moment.
>> 
>> So the situation is somewhere between: "it is possible" and
>> "R does not support a safe decision if, what is possible, also
>> can be recommended".
>> At the moment R rather deprecates in-place modification by default
>> (the save way, and I agree with this default).
>> 
>> But it's not true, that R in general copies arguments.
>> 
>> But this seems to be true for the .C interface.
>> 
>> Maybe a lot of performance-/memory-problems can be solved
>> by rewriting already existing packages, by providing them
>> via .Call instead of .C.
> 
> My understanding is that most packages use the .C interface
> because it's simpler to deal with and because they don't need
> to pass complicated objects at the C level, just atomic vectors.
> My guess is that it's probably rarely the case that the cost
> of copying the arguments passed to .C is significant, but,
> if that was the case, then they could always call .C() with
> DUP=FALSE. However, using DUP=FALSE is dangerous (see Warning
> section in the man page).
> 
> No need to switch to .Call
> 

I strongly disagree. I'm appalled to see that sentence here. The overhead is significant for any large
vector and it is in particular unnecessary since in .C you have to allocate *and copy* space even for
results (twice!). Also it is very error-prone, because you have no information about the length of
vectors so it's easy to run out of bounds and there is no way to check. IMHO .C should not be used for any code
written in this century (the only exception may be if you are passing no data, e.g. if all you do is to pass a
flag and expect no result, you can get away with it even if it is more dangerous). It is a legacy interface
that dates way back and is essentially just re-named .Fortran interface. Again, I would strongly
recommend the use of .Call in any recent code because it is safer and more efficient (if you don't care about
either attribute, well, feel free ;)).

Cheers,
Simon

> Cheers,
> H.
> 
>> 
>> 
>> Ciao,
>>    Oliver
>> 
>> 
>> 
>> 
>> On Tue, Mar 06, 2012 at 04:44:49PM +0000, William Dunlap wrote:
>>> S (and its derivatives and successors) promises that functions
>>> will not change their arguments, so in an expression like
>>>    val<- func(arg)
>>> you know that arg will not be changed.  You can
>>> do that by having func copy arg before doing anything,
>>> but that uses space and time that you want to conserve.
>>> If arg is not a named item in any environment then it
>>> should be fine to write over the original because there
>>> is no way the caller can detect that shortcut.  E.g., in
>>>     cx<- cos(runif(n))
>>> the cos function does not need to allocate new space for
>>> its output, it can just write over its input because, without
>>> a name attached to it, the caller has no way of looking
>>> at what runif(n) returned.  If you did
>>>     x<- runif(n)
>>>     cx<- cos(x)
>>> then cos would have to allocate new space for its output
>>> because overwriting its input would affect a subsequent
>>>     sum(x)
>>> I suppose that end-users and function-writers could learn
>>> to live with having to decide when to copy, but not having
>>> to make that decision makes S more pleasant (and safer) to use.
>>> I think that is a major reason that people are able to
>>> share S code so easily.
>>> 
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>> 
>>>> -----Original Message-----
>>>> From: oliver [mailto:oliver <at> first.in-berlin.de]
>>>> Sent: Tuesday, March 06, 2012 1:12 AM
>>>> To: William Dunlap
>>>> Cc: Hervé Pagès; R-devel
>>>> Subject: Re: [Rd] Julia
>>>> 
>>>> On Tue, Mar 06, 2012 at 12:35:32AM +0000, William Dunlap wrote:
>>>> [...]
>>>>> I find R's (&  S+'s&  S's) copy-on-write-if-not-copying-would-be-discoverable-
>>>>> by-the-uer machanism for giving the allusion of pass-by-value a good way
>>>>> to structure the contract between the function writer and the function user.
>>>> [...]
>>>> 
>>>> 
>>>> Can you elaborate more on this,
>>>> especially on the ...-...-...-if-not-copying-would-be-discoverable-by-the-uer
>>>> stuff?
>>>> 
>>>> What do you mean with discoverability of not-copying?
>>>> 
>>>> Ciao,
>>>>    Oliver
>> 
>> ______________________________________________
>> R-devel <at> r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages <at> fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> ______________________________________________
> R-devel <at> r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


Gmane