3 Oct 18:57
[MERGE] Make 'bzr co' more memory sensitive
From: John Arbash Meinel <john <at> arbash-meinel.com>
Subject: [MERGE] Make 'bzr co' more memory sensitive
Newsgroups: gmane.comp.version-control.bazaar-ng.general
Date: 2008-10-03 16:57:37 GMT
Subject: [MERGE] Make 'bzr co' more memory sensitive
Newsgroups: gmane.comp.version-control.bazaar-ng.general
Date: 2008-10-03 16:57:37 GMT
At the moment, we've been bringing in some really nice apis, which allow us to discuss large amounts of data at once. One case is that now "bzr co" is able to pass the list of all files to checkout into "iter_files_bytes()", which can then be spooled out in any order. This is very nice for CPU time, as it doesn't require us to access the same things over and over again. However, it turns out that it causes us to unpack the entire WT into memory before we write it out. This was the cause of: https://bugs.edge.launchpad.net/bzr/+bug/277171 and https://bugs.edge.launchpad.net/bzr/+bug/269456 There are several things to be done to combat this, but I'm starting with this. In testing on a mysql tree, doing "bzr co" with current bzr.dev uses around 700MB (the tree is 157MB). When I do the same with my patch, it uses 200MB right away (loading all the relevant text indexes), and then it stays stable at under 270MB. It does slow down the time for "bzr co" by a little bit (at least as long as you aren't hitting swap). My guess is that asking for everything-at-once is able to order the reads from the pack repository in more linear order. The difference isn't huge, though: real 1m42.557s vs real 2m6.050s A better fix would be to rewrite _get_content_map functionality so that it can be an iterator, with the ability to hold on to *some* intermediate results that it will need to produce later results, but also able to let go of other intermediate results when they are no longer needed. Also, I should mention that the code is buffering all of the intermediate content as well, not just the final texts. Another small issue is that we have to return the texts as ''.join(lines) rather than just lines. However, this isn't a huge overhead, as it only doubles the consumption for each file, and we don't hold on to those strings after they are written out. Just to give an outline of memory consumption... 1) get_parent_map(keys) reads all of the text indexes because it is doing a very broad request of all files on disk. This ends with a peak memory of 270MB and a resident of 230MB 2) _get_record_map(keys) reads in the raw bytes from disk. This is the fulltext records plus the delta records. This is the *parsed* records, and not the raw bytes from disk. So things have been uncompressed and parsed into lines. At this point, we are at 565MB of RAM 3) _get_content_map(keys) converts the raw fulltexts and deltas into in-memory lists of lines. Note, however, that it is good about sharing the raw strings between the lists it creates. So probably most of the extra memory consumed at this point is just in "list" objects. I'm not 100% sure, though. 733MB With this patch, we still get (1), but because we request the content one file at a time, neither (2) or (3) buffer much at a time. I'll also note that with btree indexes (1) drops to 130MB instead of 230MB. Also, because of the text sharing, I don't think that doing "text_map.pop()" ends up saving as much memory as I hoped. The problem is that it will usually have several texts in the chain, and we are only removing the last one/two whatever. I suppose my ideal structure would get the full list of keys that we actually need to write to disk. It would then lookup in the indexes to find the ancestry it needs to build those texts. Then sort by pack location to get optimal read ordering. It would then start reading in order, buffer what it needs to get to the next text, and then release the buffers when the final text is extracted. My updating "bzr pack" to sort the file texts, it can also help give us upper bounds on the amount of memory we will consume. (Certainly without disk ordering, the worst case is O(all_texts)). John =:->
# Bazaar merge directive format 2 (Bazaar 0.90) # revision_id: john <at> arbash-meinel.com-20081003161439-h23zdckp4z78wh3r # target_branch: http://bazaar-vcs.org/bzr/bzr.dev # testament_sha1: 6ef81e276482005aadd491701ce9fbd46162f25d # timestamp: 2008-10-03 11:57:11 -0500 # source_branch: http://bzr.arbash-meinel.com/branches/bzr/1.8-\ # dev/lighter_iter_files_bytes # base_revision_id: pqm <at> pqm.ubuntu.com-20081002172844-d6df1l8dzpsqzyup # # Begin patch === modified file 'NEWS' --- NEWS 2008-10-02 17:28:44 +0000 +++ NEWS 2008-10-03 16:14:39 +0000 @@ -83,6 +83,10 @@ repository now preserves the repository format. (Andrew Bennetts, #269214) + * ``bzr co`` uses less memory. It used to unpack the entire WT into + memory before writing it to disk. This was a little bit faster, but + consumed lots of memory. (John Arbash Meinel, #269456) + * ``bzr log`` now accepts a ``--change`` option. (Vincent Ladeuil, #248427) === modified file 'bzrlib/knit.py' --- bzrlib/knit.py 2008-10-01 05:40:45 +0000 +++ bzrlib/knit.py 2008-10-03 16:14:39 +0000 @@ -1124,6 +1124,26 @@ record_map[key] = record, record_details, digest, next return record_map + def _split_by_prefix(self, keys): + """For the given keys, split them up based on their prefix. + + To keep memory pressure somewhat under control, split the + requests back into per-file-id requests, otherwise "bzr co" + extracts the full tree into memory before writing it to disk. + This should be revisited if _get_content_maps() can ever cross + file-id boundaries. + + :param keys: An iterable of key tuples + :return: A dict of {prefix: [key_list]} + """ + split_by_prefix = {} + for key in keys: + if len(key) == 1: + split_by_prefix.setdefault('', []).append(key) + else: + split_by_prefix.setdefault(key[0], []).append(key) + return split_by_prefix + def get_record_stream(self, keys, ordering, include_delta_closure): """Get a stream of records for keys. @@ -1223,11 +1243,18 @@ if include_delta_closure: # XXX: get_content_maps performs its own index queries; allow state # to be passed in. - text_map, _ = self._get_content_maps(present_keys, - needed_from_fallback - absent_keys) - for key in present_keys: - yield FulltextContentFactory(key, global_map[key], None, - ''.join(text_map[key])) + non_local_keys = needed_from_fallback - absent_keys + prefix_split_keys = self._split_by_prefix(present_keys) + prefix_split_non_local_keys = self._split_by_prefix(non_local_keys) + for prefix, keys in prefix_split_keys.iteritems(): + non_local = prefix_split_non_local_keys.get(prefix, []) + non_local = set(non_local) + text_map, _ = self._get_content_maps(keys, non_local) + for key in keys: + lines = text_map.pop(key) + text = ''.join(lines) + yield FulltextContentFactory(key, global_map[key], None, + text) else: for source, keys in source_keys: if source is parent_maps[0]: # Begin bundle IyBCYXphYXIgcmV2aXNpb24gYnVuZGxlIHY0CiMKQlpoOTFBWSZTWXZw4kYAA4T/gHRUQABZ9/// f/t/qv////pgCE77XV3OANI7mi6WtZBrUsVDCSQQm1TaNE9TTAEyMlA/VH6p6mnlM1PaU8oNAep6 NE0Gqepppqj20TRGkDQBtCNMmgMhoNABgINBw0MmmhpkaGmRkGRkaGQGJoyaAMmRiGEpogkyjNKe 0p+o9SNHpqj1PRp6pvVDQflR6QAAyaABw0MmmhpkaGmRkGRkaGQGJoyaAMmRiGEkgIGhNDTSU/El PZT0eiNQHqNHqJ6h5TJoeoeo02hQViYEel2+ZsDmyP1VTOHf15H6NNnnbL5z/Lnp91x2J8fbKCK7 aQQkJI6uvN7LtTa2QSVsuzVbqpXob97YZmss5rJPfze0ZGjfEJAogAxty/z7+912Vfvr3r3HIZmB m08l56I2mLIu42Pry1xzNRXNRQuWmWez+XDm16Ki8eTuXs/0yO9hH2qR0yUG5nhfq7mdNuy/4EHd lksnmUiDzx9JCNrg0pInqNYoEbRAH/ZKlVgczXCZgS9Og15I3RTKCmyAp2LpKy3V6p42XuK7brbJ hSkgpJ9pXlKoJTvddXVHGY4u2WoKOVGTwYYldoHbE16oqvHT+KENIQVn1n0AowBAmTUSl4b0wIxX V3P4hY5W1Um0mPh11pD9SRg8TJnh9iiNz6CJB3XcnjJrtvkzSmy8RabYK+wLxX3LD56CRhANO8+9 GMvQEchi3pEGu671lrTnuKT+bqquYVzHxxnwQwi4Z6unrWRvR3zzWktm/NxD2e0XbUBxYNN7m87w CKEVkHsjh4kTlAQwceMz5CEVcxBxyPEuLSuYBGc4KoH/nInZ1n3OsPdXj9etdLZDL4S2gDWF+yqZ zbOA6Y8g8gNrI0qqzJH5E4ErUwh8Is7cIuNwyLfHv4SK7X2uVhmYk8B3dBEfzpfZdRCHlkwyRdcP IvM8rxEZDLIYccv7DkrFTOQvCBiUC8HpGw5PeTGIOVhhG8RvJKRv1CJCH4UfUjfnSa5G0cIoiD05 7PCI0SGFZmt2IxQ84i1zZmFyoN2h+uweefGt94w/UV2GI2tpECC+95v2FCs36V6WFr26XDPbGUDG E6lKlESyJtyhAaB2EJ6l0FpimKyo3ZwRqLdpfbmFDcIuvQxiYkS7aVji4AsMUWNEk4Y1DPgTj0Fp llbTWjM/TERjhfVqu3ynKgxWuBfrMG0GoxNdxpbLf2qtsP5XXg3apT0iWBeHDBiQOtDahqbj0Ndc 7N9qbD2kB3jBWF52RahRgOQjNZ4/cbz54q8KxRIhmI3sV9f1fX6t/BcZD4evTbxRZQJHgREC9REZ X65+OtfhdEvk+llbgfhMNzIe61Mp0NMSuHmvfBRxXWCsfG1x+OvNKfsqcPJSKmlqZ0zLB4ckFw6c /j6Fyvq71XSFvHe3TT2ifezdxk0WlG+6fo+mK+ZvBP6tcGS+pjXs3z7vNT3h1SPnEklIa6YEEkEM yi3vwSRYZuAS9vTBzHBW/USSY6/tO6JwPHaQLJH4nbRwpyMx7KHQj3GEiKCCfs74T5Axk8PydnDK MdZV70i+4DNwfdEu+7a+fM4rga+FFmAzGMvHQVlWklXrbChRjNbcRuHMZyDnWnP8EcYd/l4sSl6g MaeRaQOs2H+rrtbSXPiF6B/VM23dfA0NcjQ6bqOZDugRQ3oBvey0G8zSmch5Al8do8+7KeJs2kFg fex3rSxi2Zgck0Mtsuk5fMMf3T9dxW7oY17mRb4qbIY7dt75P54rJnNCfCkkRNHoe47iJeTOE3kC BEtr2kw/En1G1m6oUGFw+Y78siiMqhnG+Vucvl4avhNGtxF6SmOwkkVWkLRUD/c0RY6bdC18ItdW OOEVBLJjF8a36PCpvhEVKYkmxqiVyzjyftK5WJnHo16BqVZj0rEYaMeTV0uh0+8Owh0nEzRxP4Hb qmcjqLg5mCj7iezDmPLZDmQmwNN6c0SCJm3GWgLLpWwt1B/ldPaznUnx7dKFhqsEewR6R29U70o6 3sFjXGPBakKcm2ftmFO23onre3PeWEWyNZh0zVlnWIl4MTDVs5i8qFWtVKipf47IvpFhFTu5W96R iXfIxWZpyRZftj1vPTKXRmgHD9n/vWrPEHFvt6RVlsw3i9piD2Gezhkwp2WNWI07jBh3BnMIoj7E PRVKPPh3G8ZLD6XOkG/4/hDZsqQn33MyziBCw2xPRiuP+8Las3MAfzEN+rXURNayYUO1Ip9KCXN2 VtW1841NzXRlXsK179JkloeIHIQUZhAqBuqsqKhOHyUoAHlzcGXZJ3xp1ngV8fRo5qXlSoQMDcSZ M5yT0f3BdOpbmWzZyu5kGGsO80gnxQG51T7ibINAGd/f6IPiHgyq8/Jwsg/LIUni8k+oizMmJ9ss VlGrJjyNM80iO23CFPGU879mx6KFILxNSJ+NC2R4iH5o3Uhzo6JDj5bXs0GLt1MYOIxaEX6145CP Z0G1G8N1ZAcWt1tEIEHDl6mVIijx5jkiLH6IFTmWx9DM1APTjA/Q/NwJnOmheTTbbyxwNd4SLy+z ON15qoSYQwseIPyZh+7XOKTzTrD+WXSrCxFE6JppQSEkrD0WeO2/ZrbwW5TlsKjNJujW0zljkbtY CuQ4wRfUHGK+cYMkbsUzVsYYCk85p+N8vYp2XvkoCX7aS4smz88qE63VZqkDpSHHIgJDjh5AIa39 zHRt9X1XZwQ4YDaZDlUZ0uaAw1BO1sQwhGA99cnQJVuRWYYvRZqeSCaI70eXicFf5PlYtME0KKSG 3IukN62IERvNkasUyFOP8hD/i1wJoCCE9MWDnaaPehw4ID1ck/6OEElDsRx/qIphYfQItr1WsaJl jRWfDf/4u5IpwoSDs4cSMA==
RSS Feed