Hanno Schlichting | 2 Jan 18:42
Picon
Gravatar

Batching improvements

Hi.

Inspired by some work in collective.solr, I made some improvements to
the batching logic in Plone over x-mas. This is a short explanation -
I suspect the p.a.contentlisting and maybe p.a.search code needs to be
adjusted to this as well.

So far we did a catalog query of some sort, often with a sort_on
argument (score, folder position, publication date, ...), then later
on wrap it in Plone's Batch class in some template and at last decide
which batch of 10 or 20 items to show.

After my changes, we decide which batch to show first (reading
variables from the request), then do the catalog query passing the
batch variables into the query and at last wrap the result in the
Batch class. The catalog can then use the batching hints internally to
optimize things. I did these changes in the generic getFolderContents
skin script and the queryCatalog method of the topic class.

The optimizations are in ZCatalog itself in Zope 2.13 / Plone 4.1 and
backported to Plone 4 via experimental.catalogqueryplan. You can do
the template / code changes in any Plone version without any negative
effect, as extra arguments to the catalog will simply be ignored.

The two query arguments you need to pass on are b_start and b_size.
b_start is usually just read from the request. b_size depends on the
template. The b_size you pass to the catalog needs to be the batch
size plus the orphan value - so usually b_size + 1.

There's currently two optimizations in the catalog:

1. An implicit sort_limit is calculated as b_start + b_size. So if you
only want to show the first 10 items, the catalog doesn't need to sort
the entire resultset of possibly 10.000 matching items, but can stop
after it has 10. Depending on the ratio of the limit to the resultset
there's different strategies used for sorting which make this more
performant.

2. In case a sort_limit is specified you only get as many brains back
as the limit says (though maybe a couple more). In addition the result
value has an "actual_result_count" attribute. This attribute states
the number of matches, so the Batch class can still calculate the
correct batch pagination links.

The second optimization protects you from some bad code that we've
gotten lately. Some people started constructing a list of dicts in a
view method, instead of operating on the brains inside the template.
If you do this, you so far instantiated every brain in the resultset
and potentially called some expensive methods on them like
toLocalizedTime. If you did this in the template code, you usually did
it only for the current batch. With the new optimization, you won't
get as many brains, so your code isn't quite as expensive.

There's a number of obvious next steps we can do here to optimize
things further. For example:

1. If b_start is greater than zero, only return the brains from that
starting point on. We still need to sort, but you should only need to
get back b_size+orphan items and not more. This avoids the extra cost
of the dict-from-view-method-pattern.
2. If you request a batch from the second half of the resultset,
invert the sorting order and only sort len(resultset)-b_start items.
This would make the later batches more performant and the last one as
performant as the first batch.

Cheers,
Hanno

------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl

Gmane