John Arbash Meinel | 3 Oct 18:57
Favicon

[MERGE] Make 'bzr co' more memory sensitive


At the moment, we've been bringing in some really nice apis, which allow
us to discuss large amounts of data at once. One case is that now "bzr
co" is able to pass the list of all files to checkout into
"iter_files_bytes()", which can then be spooled out in any order.

This is very nice for CPU time, as it doesn't require us to access the
same things over and over again. However, it turns out that it causes us
to unpack the entire WT into memory before we write it out.

This was the cause of:
https://bugs.edge.launchpad.net/bzr/+bug/277171
and
https://bugs.edge.launchpad.net/bzr/+bug/269456

There are several things to be done to combat this, but I'm starting
with this. In testing on a mysql tree, doing "bzr co" with current
bzr.dev uses around 700MB (the tree is 157MB). When I do the same with
my patch, it uses 200MB right away (loading all the relevant text
indexes), and then it stays stable at under 270MB.

It does slow down the time for "bzr co" by a little bit (at least as
long as you aren't hitting swap). My guess is that asking for
everything-at-once is able to order the reads from the pack repository
in more linear order.

The difference isn't huge, though:
real    1m42.557s
vs
real    2m6.050s

A better fix would be to rewrite _get_content_map functionality so that
it can be an iterator, with the ability to hold on to *some*
intermediate results that it will need to produce later results, but
also able to let go of other intermediate results when they are no
longer needed.

Also, I should mention that the code is buffering all of the
intermediate content as well, not just the final texts. Another small
issue is that we have to return the texts as ''.join(lines) rather than
just lines. However, this isn't a huge overhead, as it only doubles the
consumption for each file, and we don't hold on to those strings after
they are written out.

Just to give an outline of memory consumption...

1) get_parent_map(keys) reads all of the text indexes because it is
doing a very broad request of all files on disk. This ends with a peak
memory of 270MB and a resident of 230MB

2) _get_record_map(keys) reads in the raw bytes from disk. This is the
fulltext records plus the delta records. This is the *parsed* records,
and not the raw bytes from disk. So things have been uncompressed and
parsed into lines. At this point, we are at 565MB of RAM

3) _get_content_map(keys) converts the raw fulltexts and deltas into
in-memory lists of lines. Note, however, that it is good about sharing
the raw strings between the lists it creates. So probably most of the
extra memory consumed at this point is just in "list" objects. I'm not
100% sure, though. 733MB

With this patch, we still get (1), but because we request the content
one file at a time, neither (2) or (3) buffer much at a time.

I'll also note that with btree indexes (1) drops to 130MB instead of 230MB.

Also, because of the text sharing, I don't think that doing
"text_map.pop()" ends up saving as much memory as I hoped. The problem
is that it will usually have several texts in the chain, and we are only
removing the last one/two whatever.

I suppose my ideal structure would get the full list of keys that we
actually need to write to disk. It would then lookup in the indexes to
find the ancestry it needs to build those texts. Then sort by pack
location to get optimal read ordering. It would then start reading in
order, buffer what it needs to get to the next text, and then release
the buffers when the final text is extracted.

My updating "bzr pack" to sort the file texts, it can also help give us
upper bounds on the amount of memory we will consume. (Certainly without
disk ordering, the worst case is O(all_texts)).

John
=:->

# Bazaar merge directive format 2 (Bazaar 0.90)
# revision_id: john <at> arbash-meinel.com-20081003161439-h23zdckp4z78wh3r
# target_branch: http://bazaar-vcs.org/bzr/bzr.dev
# testament_sha1: 6ef81e276482005aadd491701ce9fbd46162f25d
# timestamp: 2008-10-03 11:57:11 -0500
# source_branch: http://bzr.arbash-meinel.com/branches/bzr/1.8-\
#   dev/lighter_iter_files_bytes
# base_revision_id: pqm <at> pqm.ubuntu.com-20081002172844-d6df1l8dzpsqzyup
# 
# Begin patch
=== modified file 'NEWS'
--- NEWS	2008-10-02 17:28:44 +0000
+++ NEWS	2008-10-03 16:14:39 +0000
@@ -83,6 +83,10 @@
       repository now preserves the repository format.
       (Andrew Bennetts, #269214)

+    * ``bzr co`` uses less memory. It used to unpack the entire WT into
+      memory before writing it to disk. This was a little bit faster, but
+      consumed lots of memory. (John Arbash Meinel, #269456)
+
     * ``bzr log`` now accepts a ``--change`` option.
       (Vincent Ladeuil, #248427)

=== modified file 'bzrlib/knit.py'
--- bzrlib/knit.py	2008-10-01 05:40:45 +0000
+++ bzrlib/knit.py	2008-10-03 16:14:39 +0000
@@ -1124,6 +1124,26 @@
             record_map[key] = record, record_details, digest, next
         return record_map

+    def _split_by_prefix(self, keys):
+        """For the given keys, split them up based on their prefix.
+
+        To keep memory pressure somewhat under control, split the
+        requests back into per-file-id requests, otherwise "bzr co"
+        extracts the full tree into memory before writing it to disk.
+        This should be revisited if _get_content_maps() can ever cross
+        file-id boundaries.
+
+        :param keys: An iterable of key tuples
+        :return: A dict of {prefix: [key_list]}
+        """
+        split_by_prefix = {}
+        for key in keys:
+            if len(key) == 1:
+                split_by_prefix.setdefault('', []).append(key)
+            else:
+                split_by_prefix.setdefault(key[0], []).append(key)
+        return split_by_prefix
+
     def get_record_stream(self, keys, ordering, include_delta_closure):
         """Get a stream of records for keys.

@@ -1223,11 +1243,18 @@
         if include_delta_closure:
             # XXX: get_content_maps performs its own index queries; allow state
             # to be passed in.
-            text_map, _ = self._get_content_maps(present_keys,
-                needed_from_fallback - absent_keys)
-            for key in present_keys:
-                yield FulltextContentFactory(key, global_map[key], None,
-                    ''.join(text_map[key]))
+            non_local_keys = needed_from_fallback - absent_keys
+            prefix_split_keys = self._split_by_prefix(present_keys)
+            prefix_split_non_local_keys = self._split_by_prefix(non_local_keys)
+            for prefix, keys in prefix_split_keys.iteritems():
+                non_local = prefix_split_non_local_keys.get(prefix, [])
+                non_local = set(non_local)
+                text_map, _ = self._get_content_maps(keys, non_local)
+                for key in keys:
+                    lines = text_map.pop(key)
+                    text = ''.join(lines)
+                    yield FulltextContentFactory(key, global_map[key], None,
+                                                 text)
         else:
             for source, keys in source_keys:
                 if source is parent_maps[0]:

# Begin bundle
IyBCYXphYXIgcmV2aXNpb24gYnVuZGxlIHY0CiMKQlpoOTFBWSZTWXZw4kYAA4T/gHRUQABZ9///
f/t/qv////pgCE77XV3OANI7mi6WtZBrUsVDCSQQm1TaNE9TTAEyMlA/VH6p6mnlM1PaU8oNAep6
NE0Gqepppqj20TRGkDQBtCNMmgMhoNABgINBw0MmmhpkaGmRkGRkaGQGJoyaAMmRiGEpogkyjNKe
0p+o9SNHpqj1PRp6pvVDQflR6QAAyaABw0MmmhpkaGmRkGRkaGQGJoyaAMmRiGEkgIGhNDTSU/El
PZT0eiNQHqNHqJ6h5TJoeoeo02hQViYEel2+ZsDmyP1VTOHf15H6NNnnbL5z/Lnp91x2J8fbKCK7
aQQkJI6uvN7LtTa2QSVsuzVbqpXob97YZmss5rJPfze0ZGjfEJAogAxty/z7+912Vfvr3r3HIZmB
m08l56I2mLIu42Pry1xzNRXNRQuWmWez+XDm16Ki8eTuXs/0yO9hH2qR0yUG5nhfq7mdNuy/4EHd
lksnmUiDzx9JCNrg0pInqNYoEbRAH/ZKlVgczXCZgS9Og15I3RTKCmyAp2LpKy3V6p42XuK7brbJ
hSkgpJ9pXlKoJTvddXVHGY4u2WoKOVGTwYYldoHbE16oqvHT+KENIQVn1n0AowBAmTUSl4b0wIxX
V3P4hY5W1Um0mPh11pD9SRg8TJnh9iiNz6CJB3XcnjJrtvkzSmy8RabYK+wLxX3LD56CRhANO8+9
GMvQEchi3pEGu671lrTnuKT+bqquYVzHxxnwQwi4Z6unrWRvR3zzWktm/NxD2e0XbUBxYNN7m87w
CKEVkHsjh4kTlAQwceMz5CEVcxBxyPEuLSuYBGc4KoH/nInZ1n3OsPdXj9etdLZDL4S2gDWF+yqZ
zbOA6Y8g8gNrI0qqzJH5E4ErUwh8Is7cIuNwyLfHv4SK7X2uVhmYk8B3dBEfzpfZdRCHlkwyRdcP
IvM8rxEZDLIYccv7DkrFTOQvCBiUC8HpGw5PeTGIOVhhG8RvJKRv1CJCH4UfUjfnSa5G0cIoiD05
7PCI0SGFZmt2IxQ84i1zZmFyoN2h+uweefGt94w/UV2GI2tpECC+95v2FCs36V6WFr26XDPbGUDG
E6lKlESyJtyhAaB2EJ6l0FpimKyo3ZwRqLdpfbmFDcIuvQxiYkS7aVji4AsMUWNEk4Y1DPgTj0Fp
llbTWjM/TERjhfVqu3ynKgxWuBfrMG0GoxNdxpbLf2qtsP5XXg3apT0iWBeHDBiQOtDahqbj0Ndc
7N9qbD2kB3jBWF52RahRgOQjNZ4/cbz54q8KxRIhmI3sV9f1fX6t/BcZD4evTbxRZQJHgREC9REZ
X65+OtfhdEvk+llbgfhMNzIe61Mp0NMSuHmvfBRxXWCsfG1x+OvNKfsqcPJSKmlqZ0zLB4ckFw6c
/j6Fyvq71XSFvHe3TT2ifezdxk0WlG+6fo+mK+ZvBP6tcGS+pjXs3z7vNT3h1SPnEklIa6YEEkEM
yi3vwSRYZuAS9vTBzHBW/USSY6/tO6JwPHaQLJH4nbRwpyMx7KHQj3GEiKCCfs74T5Axk8PydnDK
MdZV70i+4DNwfdEu+7a+fM4rga+FFmAzGMvHQVlWklXrbChRjNbcRuHMZyDnWnP8EcYd/l4sSl6g
MaeRaQOs2H+rrtbSXPiF6B/VM23dfA0NcjQ6bqOZDugRQ3oBvey0G8zSmch5Al8do8+7KeJs2kFg
fex3rSxi2Zgck0Mtsuk5fMMf3T9dxW7oY17mRb4qbIY7dt75P54rJnNCfCkkRNHoe47iJeTOE3kC
BEtr2kw/En1G1m6oUGFw+Y78siiMqhnG+Vucvl4avhNGtxF6SmOwkkVWkLRUD/c0RY6bdC18ItdW
OOEVBLJjF8a36PCpvhEVKYkmxqiVyzjyftK5WJnHo16BqVZj0rEYaMeTV0uh0+8Owh0nEzRxP4Hb
qmcjqLg5mCj7iezDmPLZDmQmwNN6c0SCJm3GWgLLpWwt1B/ldPaznUnx7dKFhqsEewR6R29U70o6
3sFjXGPBakKcm2ftmFO23onre3PeWEWyNZh0zVlnWIl4MTDVs5i8qFWtVKipf47IvpFhFTu5W96R
iXfIxWZpyRZftj1vPTKXRmgHD9n/vWrPEHFvt6RVlsw3i9piD2Gezhkwp2WNWI07jBh3BnMIoj7E
PRVKPPh3G8ZLD6XOkG/4/hDZsqQn33MyziBCw2xPRiuP+8Las3MAfzEN+rXURNayYUO1Ip9KCXN2
VtW1841NzXRlXsK179JkloeIHIQUZhAqBuqsqKhOHyUoAHlzcGXZJ3xp1ngV8fRo5qXlSoQMDcSZ
M5yT0f3BdOpbmWzZyu5kGGsO80gnxQG51T7ibINAGd/f6IPiHgyq8/Jwsg/LIUni8k+oizMmJ9ss
VlGrJjyNM80iO23CFPGU879mx6KFILxNSJ+NC2R4iH5o3Uhzo6JDj5bXs0GLt1MYOIxaEX6145CP
Z0G1G8N1ZAcWt1tEIEHDl6mVIijx5jkiLH6IFTmWx9DM1APTjA/Q/NwJnOmheTTbbyxwNd4SLy+z
ON15qoSYQwseIPyZh+7XOKTzTrD+WXSrCxFE6JppQSEkrD0WeO2/ZrbwW5TlsKjNJujW0zljkbtY
CuQ4wRfUHGK+cYMkbsUzVsYYCk85p+N8vYp2XvkoCX7aS4smz88qE63VZqkDpSHHIgJDjh5AIa39
zHRt9X1XZwQ4YDaZDlUZ0uaAw1BO1sQwhGA99cnQJVuRWYYvRZqeSCaI70eXicFf5PlYtME0KKSG
3IukN62IERvNkasUyFOP8hD/i1wJoCCE9MWDnaaPehw4ID1ck/6OEElDsRx/qIphYfQItr1WsaJl
jRWfDf/4u5IpwoSDs4cSMA==

Gmane