[sheepdog] [PATCH 0/4] do not read the node list from end recovery code in farm

Fri Jun 1 17:21:23 CEST 2012

With the current code base I can easily trigger a bug in farm during
a test case that creates a cluster with four nodes, then shuts the
cluster down, restarts two of the sheep, starts a new sheep and then
restarts the other two original sheep.

The stack trace looks like:
#0 0x00007f1b526af3a5 in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f1b526b2b0b in __GI_abort () at abort.c:92
#2 0x0000000000425562 in strbuf_grow (sb=0x7ffffc3599b0,
extra=18446744073709551615) at strbuf.c:54
#3 0x0000000000425934 in strbuf_add (sb=0x7ffffc3599b0, data=0x7ffffc3539b0,
len=18446744073709551615) at strbuf.c:101
#4 0x000000000041f4be in snap_file_write (epoch=1, trunksha1=0x7ffffc359a60
"\204\344Úž\307\006\v\234\354\025\233w*m\364\341A\353\367\226\377\177",
outsha1=0x7ffffc359a40 "", user=0) at farm/snap.c:171
#5 0x0000000000420c42 in farm_end_recover (iocb=0x7ffffc359aa0) at farm/farm.c:543
#6 0x000000000041396a in do_recover_main (work=0x6266340) at recovery.c:415
#7 0x000000000040fd73 in bs_thread_request_done (fd=11, events=1, data=0x0) at work.c:159
#8 0x00000000004219b8 in event_loop (timeout=-1) at event.c:181
#9 0x00000000004049ce in main (argc=10, argv=0x7ffffc35b308) at sheep.c:285

This series avoids having to read the epoch file during the recovery
process entirely.