Nikodemus Siivola | 14 Feb 09:57 2011
Picon

Re: heap exhaustion ?

On 14 February 2011 07:04, Jianshi Huang <jianshi.huang <at> gmail.com> wrote:

> Heap exhausted during garbage collection: 448 bytes available, 512 requested.
>  Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB   LUB  !move  Alloc
> Waste   Trig    WP  GCs Mem-age
>   0:     0     0     0     0     0     0     0     0     0        0
>  0  2000000    0   0  0.0000
>   1:     0     0     0     0     0     0     0     0     0        0
>  0  2000000    0   0  0.0000
>   2:     0     0     0     0     0     0     0     0     0        0
>  0  2000000    0   0  0.0000
>   3:     0     0     0     0     0     0     0     0     0        0
>  0  2000000    0   0  0.0000
>   4:     0     0     0     0     0     0     0     0     0        0
>  0  2000000    0   0  0.0000
>   5: 47086 198400     0     0 130093 907945   417   757     0
> 4113377392 143234960 3988343744    0  19  0.9691
>   6:     0     0     0     0 18880  3778     0     0     0 92807168
>  0  2000000 18630   0  0.0000
>   Total bytes allocated    = 8305371856
>   Dynamic-space-size bytes = 4294901760
> GC control variables:
>   *GC-INHIBIT* = true
>   *GC-PENDING* = in progress
>   *STOP-FOR-GC-PENDING* = false
> fatal error encountered in SBCL pid 29148(tid 140737209779968):
> Heap exhausted, game over.
>
>
> It seems SBCL try to allocated more than 8GB?! of memory and died
> since max heap size is 8GB. My program usually consumes 700MB~1GB for
> data, so others memory used should be consumed by garbages.
>
> My questions are:
>
> 1) What does " Dynamic-space-size bytes" mean here? It's about half
> the size of maximum heap

It is actually the size of the heap, but the fprintf statement in the
runtime appears to have %u where it should have %lu ... so that it has
gotten truncated by accident. Argh. But it does stand for the
heap-size, so you can disregard that.

> 2) My program processes data from streams so objects get created and
> collected constantly, is it the case that the youngest generations of
> heap kept expanding and finally exceeded the limit? If so, is there a
> way to compact/resize the generations?(a full GC?)

Youngest generations cannot expand without bounds: objects instead get
migrated older generations when necessary. From the heap-map printed
we can see all data has been migrated to the final untenured
generation by the time the heap is exhausted. (The final generation
gets collected too, so that's no a problem.)

Without knowing more about you allocation patterns it's hard to hazard
a guess, but:

(define-alien-routine print-generation-stats void)

and you can call (print-generation-stats) to print the heap map at any
time to stderr of the process, so you can see what is happening -- are
things slowly accumulating in older generations over the lifetime of
your application, or is this is a sudden collapse, etc.

I'll start from the assumption that you're not accidentally
accumulating things in your application. :)

There are two major reasons for running out of memory through no fault
of your own:

1. Getting bitten by SBCL's conservativism (stack and registers.)
However, in a long-running application the effect of conservativism
typically isn't the issue: if it was the problem, it should not cause
slow accumulation of uncollected garbage, but it should rather keep a
static amount of garbage uncollected.

So while the following should not apply to you, I'll put it here for
lurkers to read:

If your application produces large multipart structures (long lists,
trees of objects -- especially if those tree contain backpointers), it
can help to break links in those structures when they are no longer
needed. The problem with "large multipart structures" is that there is
a large number of objects in the heap where conservatively retaining
any of them could cause many others to be retained as well.

Of course large flat structures like big SIMPLE-VECTORs aren't immune
to this either, but since they aren't preserved by interior pointers
they are no more likely to be retained by accident than a single CONS
cell ... and if you have million elements in a list, you have a
million potential false positives on average responsible for retaining
0.5 million other things.

So in practise it is large structures composed of great many smaller
parts are only ever a problem. (Which is not to say they are
automatically a problem -- again, it really depends on the
application.)

2. Doing *something* which causes SBCL to keep accumulating something
in its internals due to a known or unknown bug. For example:

  (defun foo ()
    (let ((name (gensym)))
      (setf (fdefinition name) #'foo)
      (fmakunbound name)))

  (loop (foo))

will eventually exhause the heap as even though FMAKUNBOUND removes
the function binding, it leaves the *name* in SBCL's globaldb.
EQL-specializers and EQL-specialized methods in CLOS are another known
leak. I can't from the top my head think of other known issues, but
maybe there is an unknown one that is biting you?

4. Getting bitten by the generational trap.

Let's say you have "cyclic" application -- a workload comes in, you
process it, then repeat from start with another workload. Let's say
that processing a single workload involves on average 1 minor GC in
which large amounts of the data can be live.

These minor GCs initially only collect the nursery, promoting live
objects to generation 1 -- where uncollected garbage from earlier
cycles keeps accumulating till a collection is triggered for it as
well.

When this happens, *first* the nursery is collected into gen 1. Then
gen 1 is collected into gen 2. So now live objects from this cycle
have ended up in gen 2 -- where slowly in this way uncollected garbage
accumulates till a collection is triggered for gen 2.

When this happens, first nursery is collected into gen 1. Then gen 1
into gen 2. Then gen 2 into gen 3... so now live objects from this
cycle ended up in gen 3, where they accumulate till an even deeper
collection is triggered.

This keeps going on till final generation is reached. When that
happens, it is collected but not promoted, breaking the chain of
promotion.

Now, given a "bad" allocation pattern, it may be that you exhaust the
heap due to uncollected garbage in older generations before a
collection deep enough to collect that garbage is triggered.

Based on the description of your application, I suspect this may be
happening to you. In this case forcing a full collection every cycle
(or every few cycles) should help -- watching the
PRINT-GENERATION-STATS should tell you if this is the case, and how
often you should force a full GC.

...

As for bad stuff your application (or a library you depend on) could be doing:

A. a cache or memoization that keeps growing without bounds?

B. a HASH-TABLE that should be weak, but has misspecified it's
weakness -- :VALUE when it should be :KEY, etc.

C. a HASH-TABLE created using a large :REHASH-SIZE -- this is
virtually almost a bad idea, especially if the number is a float...

> 3) What's the best practice of memory management for long-run programs
> in sbcl? Do full GC periodically?

See above.

Cheers,

 -- Nikodemus

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Sbcl-devel mailing list
Sbcl-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel

Gmane