Thilo Goetz | 18 May 09:55
Picon
Picon

Re: UIMA internals memory footprint

Kirk True wrote:
> Hi Adam,
> 
>> Kirk,
>>
>> In this test are you running a CPE or just an AnalysisEngine?  If it
>> is a CPE do you know what your CAS Pool size is?
> 
> It's an AnalysisEngine.
> 
>> When a CAS is created it does allocate a large heap which is then
>> filled as you create annotations.  By default I believe this is
>> 500,000 cells (2MB) per CAS, but this can be overridden (see
>> UIMAFramework.getDefaultPerformanceTuningPropeties()).  So this can
>> defintely be one source of memory overhead.  As you saw it does not
>> grow with larger documents, it will only grow if you create enough
>> annotations to fill up the allocated space.
> 
> I noticed that this is tweak-able and set it to something insanely
> small (like 100). But, as you said, it grows as the number of
> annotations grow. Since the parameter is under the umbrella of
> performance, I'd assume that it would actually be better to
> pre-allocate close to what we're going to use.
[...]

Yes.

You can estimate data use on the heap as follows.  Each FS uses at least one
int for the type information, plus whatever features it has.  So a vanilla
annotation is 3 ints, one for the type, and one for the start and end features,
respectively.  If you have two additional features, that's 5 ints, so 20 bytes.
If you use the JCas, you incur an additional overhead of a Java object for
each annotation.  It's small, but I can't say off the top of my head how small
exactly.  Plus, the JCas objects are held in a HashMap (or some such, Marshall
correct me if I'm wrong), which incurs additional memory overhead.

In my experience, the CAS can easily reach 10 to 20 times the size of the input
document.  If you have information reach token annotations, that's not really 
surprising.  (And this is without using JCas).  Imagine you were to manually
create Java objects that carry the same information, you would see roughly
the same kind of overhead.

--Thilo


Gmane