12 Nov 15:00
Multiple skiplist bugs found, patches attached
Bron Gondwana <brong <at> fastmail.fm>
2007-11-12 14:00:00 GMT
2007-11-12 14:00:00 GMT
On Mon, Nov 12, 2007 at 12:34:34AM +1100, Bron Gondwana wrote: > Anyway - here it is. A "recovery()" that copes if the logstart > parameter in the database header is wrong. No, I don't have a > clue how that happened unless lseek() lied. Maybe it sometimes > lies, I don't know. I'll be writing a test case for that soon > too! I have some more suspicions now, but I wrote it all up in the patch header, so here's the bugfixes only patch, a "robustness" extras patch and the tool I used for testing. Ken, I know you've done some other work on the file changing types. I'd like to be even more agressive and convert just about everything to bit32 and also rename some variables, but I restricted myself in this to only fixing the most ugly case: offset = htonl(offset). These patches are all against 2.3.10 (in this order), and may need some fuzz fixing to apply against your latest CVS thanks to those changes - sorry I haven't done that, but it's getting on 1am for me, and I've just finished doing a lot of testing and paring these down to simple and clear patches that don't touch more than they need to fix the issues. cyrus-skiplist-bugfixes-2.3.10.diff: Ken, please review this patch and consider it for pushing out in the next release, preferably soon. There really are a lot of issues I found reviewing the code, and even more so with the attached tool that can be used to "hammer" a file with all sorts of interesting requests. There are some really nasty skiplist corruptions available if even one process is ever killed or segfaults half way through a transaction and the first operation to touch said file after this is a write. cyrus-skiplist-robustify-2.3.10.diff: If you have been running skiplist on your systems and suspect that you may have corrupted databases, you probably want this. It adds extra robustness and fixupability to recovery(). It's still not going to be crash proof reading line-noise, but it detects and fixes all the corruptions I have personally seen. (I say this having tested it on all the bogus DBs I had, eg:) DBERROR: skiplist recovery /tmp/mb2/mailboxes.db.1194508562: -> 8E6DD8 should be ADD or DELETE became: skiplist recovery /tmp/mb2/mailboxes.db.1194508562: -> skipped 136 bytes of zeros at 8E6DD8 skiplist: recovered /tmp/mb2/mailboxes.db.1194508562 -> (44594 records, 9547108 bytes) in 1 second skiplist: checkpointed /tmp/mb2/mailboxes.db.1194508562 -> (44594 records, 5350556 bytes) in 1 second and: skiplist recovery /tmp/mailboxes.db.fail: -> 288C00 should be ADD or DELETE became: skiplist recovery /tmp/mailboxes.db.fail: -> incorrect logstart 288BD8 changed to 356F94 skiplist: recovered /tmp/mailboxes.db.fail -> (28811 records, 4109664 bytes) in 1 second In both cases a second run of the same command (I used cyr_dbtool 'show') came up clean - no issues remaining and no log entries. cyrus-hammer-skiplist-2.3.10.diff: I used this command to hammer a skiplist database like so: sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & ... I've turned down the "forget about this transaction" option to a lot less common than my original tests. It should still fire a couple of times per hammer, but it creates log entries even on the fully patched code (rolling back incomplete txn), so I didn't want to spam the logs. Enjoy, Bron.
---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
RSS Feed