Tejun Heo | 22 Apr 06:24 2013

Re: memcg: softlimit on internal nodes

Hey, Michel.

> I don't remember exactly when you left - during the session I
> expressed to Michal that while I think his proposal is an improvement
> over the current situation, I think his handling of internal nodes is
> confus(ed/ing).

I think I stayed until near the end of the hierarchy discussion and
yeap I heard you saying that.

> I completely agree with you here. This is important to ensure
> composability - someone that was using cgroups within a 4GB system can
> be moved to use cgroups within a hierarchy with a 4GB soft limit on
> the root, and still have its performance isolated from tasks running
> in other cgroups in the system.

And for basic sanity.  As you look down through the hierarchy of
nested cgroups, the pressure exerted by a limit can only be increased
(IOW, the specificity of the control increases) as the level deepens,
regardless of the direction of such pressure, which is the only
logical thing to do for nested limits.

>> > Now, let's consider the following hierarchy just to be sure.  Let's
>> > assume that A itself doesn't have any tasks for simplicity.
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >
>> >       A (h:16G s:4G)
>> >      /            \
>> >     /              \
>> >  B (h:7G s:5G)    C (h:7G s:5G)
>> >
>> > For hardlimit, it is clear that A's limit won't do anything.
>
> Now the above is a very interesting case.

It shouldn't be interesting at all.  It should be exactly the same.
If "softlimit" means actual soft limit prioritizing reclaim down to
that point under pressure, it works in the same direction as hardlimit
and the limits should behave the same.

If "softlimit" means allocation guarantee where a cgroup is exempt
from reclaim while under the limit, a knob defining allowance rather
than limit, the direction of specificity is flipped.  While the
direction is flipped, how it behaves should be the same.  Otherwise,
it ends up breaking the very basics of nesting.  Not a particularly
bright idea.

> One thing some people worry about is that B and C's configuration
> might be under a different administrator's control than A's. That is,
> we could have a situation where the machine's sysadmin set up A for
> someone else to play with, and that other person set up B and C within
> his cgroup. In this scenario, one of the issues has to be how do we
> prevent B and C's configuration settings from reserving (or protecting
> from reclaim) more memory than the machine's admin intended when he
> configured A.

Cgroup doesn't and will not support delegation of subtrees to
different security domains.  Please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel.cgroups/6638

In fact, I'm planning to disallow changing ownership of cgroup files
when "sane_behavior" is specified.  We're having difficult time
identifying our own asses as it is and I have no intention of adding
the huge extra burden of security policing on top.  Delegation, if
necessary, will happen from userland.

> Michal's proposal resolves this by saying that A,B and C all become
> reclaimable as soon as A goes over its soft limit.

This makes me doubly upset and reminds me strongly of the
.use_hierarchy mess.  It's so myopic in coming up with a solution for
the problem immediately at hand, it ends up ignoring basic rules and
implementing something which is fundamentally broken and confused.
Don't twist basic nesting rules to accomodate half-assed delegation
mechanism.  It's never gonna work properly and we'll need
"really_sane_behavior" flag eventually to clean up the mess again, and
we'll probably have to clarify that for memcg the 'c' stands for
"confused" instead of "control".

And I don't even get the delegation argument.  Isn't that already
covered by hardlimit?  Sure, reclaimer won't look at it but if you
don't trust a cgroup it of course will be put under certain hardlimit
from parent and smacked when it misbehaves.  Hardlimit of course
should have priority over allocation guarantee and the system wouldn't
be in jeopardy due to a delegated cgroup misbehaving.  If each knob is
given a clear meaning, these things should come naturally.  You just
need a sane pecking order among the controls.  It almost feels surreal
that this is suggested as a rationale for creating this chimera of a
knob.  What the hell is going on here?

> I have a third view, which I talked about during Michal's
> presentation. I think that when A's usage goes over 4G, we should be
> able to reclaim from A's subtree. If B or C's usage are above their
> soft limits, then we should reclaim from these cgroups; however if
> both B and C have usage below their soft limits, then we are in a
> situation where the soft limits can't be obeyed so we should ignore
> them and reclaim from both B and C instead.

No, the config is valid and *exactly* the same as hardlimit case.
It's just in the opposite direction.  Don't twist it.  It's exactly
the same mechanics.  Flipping the direction should not change what
nesting means.  That's what you get and should get when cgroup nesting
is used for something which "guarantees" rather than "limits".

Whatever twsit you think is a good idea for "softlimit", try to flip
the direction and apply it the same to "hardlimit" and see how messed
up it gets.

> Regardless about these differences, I still want to stress out that
> Michal's proposal is a clear improvement over what we have, so I see
> it as a large step in the right direction.

I'm afraid I don't agree with that.  If the current situation is
ambiguous, moving to a definite wrong state makes the situation worse,
so we need to figure out what this thing actually means first, and
it's not like it is a difficult choice to make.  It's either actual
soft limit or allocation guarantee.  It cannot be some random
combination of the two.  Just pick one and stick with it.

>> Now I'm confused.  You're saying softlimit currently doesn't guarantee
>> anything and what it means, even for flat hierarchy, isn't clearly
>> defined?
>
> The largest problem with softlimit today is that global reclaim
> doesn't take it into account at all... So yes, I would say that
> softlimit is very badly defined today (which may be why people have
> such trouble agreeing about what it should mean in the first place).

So, in that case, let's please make "softlimit" an actual soft limit
working in the same direction as hardlimit but works in terms of
reclaim pressure rather than OOM killing, and please don't tell me how
"softlimit" working in the opposite direction of "hardlimit" actually
makes sense in the wonderland of memcg.  Please have at least some
common sense. :(

If people need "don't reclaim under this limit", IOW allocation
guarantee, please introduce another knob with proper name and properly
flipped hierarchy behavior.

Thanks.

--
tejun

Gmane