[sheepdog-users] Sheep crash on node failure

Thu Jun 27 11:53:23 CEST 2013

On Thu, Jun 27, 2013 at 11:25:44AM +0200, Ing. Luca Lazzeroni - Trend Servizi Srl wrote:
> Hi,
> I'm testing for internal purposes sheepdog 0.6.0 on a cluster with 3 nodes.
> 2 nodes have identical configuration: 1 ssd disk for sheepdog cache and 2 x 7,2K sata drives for data storage assigned to cluster via MD feature.
> The other machine has 3x 7,2K sata drives, one for cache and the other 2 for datas.
> 
> Capacity of data drives is identical: 1TB each.
> Data drives are assigned to sheepdog via md feature.
> Cluster is formatted in 2 copies mode.
> 
> Everything works fine, except for one think; if a qemu vm is working on a node and I manually kill sheep on one of the others, after a few second the sheep daemon on the node with vm dies with this error:
> 
> Jun 27 11:09:10 [main] recover_object_main(625) done:76 count:14456, oid:33a95f0000623c
> Jun 27 11:09:10 [main] recover_object_main(625) done:77 count:14456, oid:ce73a800000030
> Jun 27 11:09:10 [gway 20930] gateway_forward_request(308) fail to write local e04cc600000001, No object found
> Jun 27 11:09:10 [oc_push 20909] push_cache_object(471) failed to push object No object found
> Jun 27 11:09:10 [oc_push 20909] do_push_object(841) PANIC: push failed but should never fail
> Jun 27 11:09:10 [oc_push 20909] crash_handler(180) sheep exits unexpectedly (Aborted).
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(833) sheep.c:182: crash_handler
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(847) /lib/x86_64-linux-gnu/libpthread.so.0(+0xfbcf) [0x7fc23fd0dbcf]
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(847) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x36) [0x7fc23f25b036]
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(847) /lib/x86_64-linux-gnu/libc.so.6(abort+0x147) [0x7fc23f25e697]
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(833) object_cache.c:841: do_push_object
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(833) work.c:243: worker_routine
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(847) /lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8d) [0x7fc23fd05f8d]
> Jun 27 11:09:10 [oc_push 20909] sd_backtrace(847) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6c) [0x7fc23f31de1c]
> Jun 27 11:09:10 [oc_push 20909] __dump_stack_frames(743) cannot find gdb
> Jun 27 11:09:10 [oc_push 20909] __sd_dump_variable(693) cannot find gdb
> Jun 27 11:09:10 [main] crash_handler(487) sheep pid 20658 exited unexpectedly.
> 
> If I manually restart sheep on the crashed node, recovery starts again and I must wait until cluster fully recovered to start vm on that node.
> 
> Crash is repeatable. 
> 
> Note that without cache nothing wrong happens; the vm works without problems.
> 
> Any idea ? Wrong setup ?

The problem is already fixed in the latest git master. Could you please try
master branch and verify the fix?

Thanks,
Yuan