[sheepdog-users] Single Sheepdog Node Failure takes down cluster

Liu Yuan namei.unix at gmail.com
Sat Feb 22 04:39:43 CET 2014


corosync can't support more than 10 nodes and many users include us find it
not stable for a long time. use zookeeper for producttion

Yuan
2014-2-22 AM6:07于 "Aydelott, Ryan M." <ryade at mcs.anl.gov>写道:

> We are running a 20 node sheepdog cluster with ~50VM’s active during the
> test.
>
> 13.10 Ubuntu
> 2.3.3 corosync
> 0.8.0 sheep
>
> Spawning Sheepd: sheep -n -c corosync:172.21.5.0
> /meta,/var/lib/sheepdog/disc0,/var/lib/sheepdog/disc1,/var/lib/sheepdog/disc2,/var/lib/sheepdog/disc3,/var/lib/sheepdog/disc4,/var/lib/sheepdog/disc5,/var/lib/sheepdog/disc6,/var/lib/sheepdog/disc7,/var/lib/sheepdog/disc8,/var/lib/sheepdog/disc9,/var/lib/sheepdog/disc10,/var/lib/sheepdog/disc11,/var/lib/sheepdog/disc12,/var/lib/sheepdog/disc13
>
> The issue we are encountering is that when we power off a single node, a
> large group of sheepd’s (16 of 20 nodes) fail, causing the cluster to fail
> overall. The types of errors received across the cluster are:
>
> root at a1-p:/home/ryade# pdsh -w cs[141-160]-p 'grep EMERG /meta/sheep.log'
> | dshbak -c
> pdsh at a1-p: cs158-p: ssh exited with exit code 1
> pdsh at a1-p: cs159-p: ssh exited with exit code 1
> pdsh at a1-p: cs160-p: ssh exited with exit code 1
> pdsh at a1-p: cs157-p: ssh exited with exit code 1
> pdsh at a1-p: cs141-p: ssh exited with exit code 1
> ----------------
> cs145-p
> ----------------
> Feb 21 13:16:43  EMERG [main] cdrv_cpg_confchg(573) PANIC: Network
> partition is detected
> Feb 21 13:16:43  EMERG [main] crash_handler(267) sheep exits unexpectedly
> (Aborted).
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:44  EMERG [main] sd_backtrace(817) :
> ----------------
> cs146-p
> ----------------
> Feb 21 13:17:03  EMERG [main] cdrv_cpg_confchg(573) PANIC: Network
> partition is detected
> Feb 21 13:17:03  EMERG [main] crash_handler(267) sheep exits unexpectedly
> (Aborted).
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:03  EMERG [main] sd_backtrace(817) :
> ----------------
> cs153-p
> ----------------
> Feb 21 13:17:13  EMERG [oc_push 9673] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:13  EMERG [oc_push 9673] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> Feb 21 13:17:13  EMERG [oc_push 9673] sd_backtrace(817) :
> ----------------
> cs156-p
> ----------------
> Feb 21 13:17:57  EMERG [oc_push 8616] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:57  EMERG [oc_push 8616] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> Feb 21 13:17:57  EMERG [oc_push 8616] sd_backtrace(817) :
> ----------------
> cs155-p
> ----------------
> Feb 21 13:17:16  EMERG [oc_push 1433] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:16  EMERG [oc_push 1433] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> Feb 21 13:17:16  EMERG [oc_push 1433] sd_backtrace(817) :
> ----------------
> cs149-p
> ----------------
> Feb 21 13:17:19  EMERG [oc_push 9777] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:19  EMERG [oc_push 9777] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> Feb 21 13:17:20  EMERG [oc_push 9777] sd_backtrace(817) :
> ----------------
> cs147-p
> ----------------
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> ----------------
> cs148-p
> ----------------
> Feb 21 13:17:07  EMERG [main] cdrv_cpg_confchg(573) PANIC: Network
> partition is detected
> Feb 21 13:17:07  EMERG [main] crash_handler(267) sheep exits unexpectedly
> (Aborted).
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> Feb 21 13:17:07  EMERG [main] sd_backtrace(817) :
> ----------------
> cs142-p
> ----------------
> Feb 21 13:17:21  EMERG [oc_push 31470] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:21  EMERG [oc_push 31470] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:21  EMERG [oc_push 31730] do_push_object(866) PANIC: push
> failed but should never fail
> ----------------
> cs150-p
> ----------------
> Feb 21 13:17:22  EMERG [oc_push 6518] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:22  EMERG [oc_push 6518] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> Feb 21 13:17:22  EMERG [oc_push 6518] sd_backtrace(817) :
> ----------------
> cs152-p
> ----------------
> Feb 21 13:17:10  EMERG [oc_push 17997] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:10  EMERG [oc_push 17997] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> Feb 21 13:17:10  EMERG [oc_push 17997] sd_backtrace(817) :
> ----------------
> cs144-p
> ----------------
> Feb 21 13:16:58  EMERG [main] cdrv_cpg_confchg(573) PANIC: Network
> partition is detected
> Feb 21 13:16:58  EMERG [main] crash_handler(267) sheep exits unexpectedly
> (Aborted).
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:59  EMERG [main] sd_backtrace(817) :
> ----------------
> cs151-p
> ----------------
> Feb 21 13:17:16  EMERG [oc_push 27774] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:16  EMERG [oc_push 27774] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> Feb 21 13:17:17  EMERG [oc_push 27774] sd_backtrace(817) :
> ----------------
> cs154-p
> ----------------
> Feb 21 13:17:18  EMERG [oc_push 22150] do_push_object(866) PANIC: push
> failed but should never fail
> Feb 21 13:17:18  EMERG [oc_push 22150] crash_handler(267) sheep exits
> unexpectedly (Aborted).
> Feb 21 13:17:18  EMERG [oc_push 22146] do_push_object(866) PANIC: push
> failed but should never fail
> ----------------
> cs143-p
> ----------------
> Feb 21 13:16:54  EMERG [main] cdrv_cpg_confchg(573) PANIC: Network
> partition is detected
> Feb 21 13:16:54  EMERG [main] crash_handler(267) sheep exits unexpectedly
> (Aborted).
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
> Feb 21 13:16:54  EMERG [main] sd_backtrace(817) :
>
>
>
> --
> sheepdog-users mailing lists
> sheepdog-users at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20140222/94084571/attachment-0005.html>


More information about the sheepdog-users mailing list