[sheepdog-users] Crash of a node (maybe due to cache)

Valerio Pachera sirio81 at gmail.com
Thu Jun 6 11:53:08 CEST 2013


This is my production cluster.
There's a guest that receives data backup.
Backups start at 20:00.
Yesterday I stopped the cluster and increased the cache from 2000 to 20000.
Later in the evening (23:00) I noticed the guest was not responding.
Sheep was not running anymore on that node.
This is what I see from sheep.log of that node.

Jun 05 07:02:42 [gway 6379] wait_forward_request(176) poll timeout 1,
disks of some nodes or network is busy. Going to poll-wait again
Jun 05 07:02:42 [gway 6363] wait_forward_request(176) poll timeout 1,
disks of some nodes or network is busy. Going to poll-wait again
Jun 05 07:02:42 [gway 6369] wait_forward_request(176) poll timeout 1,
disks of some nodes or network is busy. Going to poll-wait again
Jun 05 07:02:42 [gway 6375] wait_forward_request(176) poll timeout 1,
disks of some nodes or network is busy. Going to poll-wait again
Jun 05 07:02:42 [gway 6412] wait_forward_request(176) poll timeout 1,
disks of some nodes or network is busy. Going to poll-wait again
Jun 05 17:14:44 [main] main(781) shutdown
Jun 05 17:15:12 [main] md_add_disk(161) /mnt/wd_WCAYUEP99298/obj, nr 1
Jun 05 17:15:12 [main] md_add_disk(161) /mnt/wd_WCAWZ1588874, nr 2
Jun 05 17:15:12 [main] send_join_request(1101) IPv4 ip:192.168.6.44 port:7000
Jun 05 17:15:12 [main] for_each_object_in_stale(403)
/mnt/wd_WCAYUEP99298/obj/.stale
Jun 05 17:15:12 [main] for_each_object_in_stale(403) /mnt/wd_WCAWZ1588874/.stale
Jun 05 17:15:12 [main] check_host_env(395) WARN: Allowed open files
1024 too small, suggested 1024000
Jun 05 17:15:12 [main] check_host_env(404) Allowed core file size 0,
suggested unlimited
Jun 05 17:15:12 [main] main(774) sheepdog daemon (version 0.6.0) started
Jun 05 17:15:12 [main] update_cluster_info(877) status = 1, epoch = 8,
finished: 0
Jun 05 17:15:21 [main] main(781) shutdown
Jun 05 17:16:19 [main] md_add_disk(161) /mnt/wd_WCAYUEP99298/obj, nr 1
Jun 05 17:16:19 [main] md_add_disk(161) /mnt/wd_WCAWZ1588874, nr 2
Jun 05 17:16:19 [main] send_join_request(1101) IPv4 ip:192.168.6.44 port:7000
Jun 05 17:16:19 [main] for_each_object_in_stale(403)
/mnt/wd_WCAYUEP99298/obj/.stale
Jun 05 17:16:19 [main] for_each_object_in_stale(403) /mnt/wd_WCAWZ1588874/.stale
Jun 05 17:16:19 [main] check_host_env(395) WARN: Allowed open files
1024 too small, suggested 1024000
Jun 05 17:16:19 [main] check_host_env(404) Allowed core file size 0,
suggested unlimited
Jun 05 17:16:19 [main] main(774) sheepdog daemon (version 0.6.0) started
Jun 05 17:16:19 [main] update_cluster_info(877) status = 1, epoch = 8,
finished: 0

Jun 05 20:00:08 [oc_push 28853] push_cache_object(471) failed to push
object Object is read-only
Jun 05 20:00:08 [oc_push 28853] do_push_object(841) PANIC: push failed
but should never fail
Jun 05 20:00:08 [oc_push 28853] crash_handler(180) sheep exits
unexpectedly (Aborted).
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(833) sheep.c:182: crash_handler
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(847)
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7f9ee33ef02f]
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(847)
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7f9ee29fb474]
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(847)
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7f9ee29fe6ef]
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(833) object_cache.c:841:
do_push_object
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(833) work.c:243: worker_routine
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(847)
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b4f) [0x7f9ee33e6b4f]
Jun 05 20:00:08 [oc_push 28853] sd_backtrace(847)
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6c) [0x7f9ee2aa3a7c]
Jun 05 20:00:08 [oc_push 28853] __dump_stack_frames(743) cannot find gdb
Jun 05 20:00:08 [oc_push 28853] __sd_dump_variable(693) cannot find gdb
Jun 05 20:00:08 [main] crash_handler(487) sheep pid 25024 exited unexpectedly.
Jun 05 23:19:45 [main] md_add_disk(161) /mnt/wd_WCAYUEP99298/obj, nr 1
Jun 05 23:19:45 [main] md_add_disk(161) /mnt/wd_WCAWZ1588874, nr 2
Jun 05 23:19:45 [main] send_join_request(1101) IPv4 ip:192.168.6.44 port:7000
Jun 05 23:19:45 [main] for_each_object_in_stale(403)
/mnt/wd_WCAYUEP99298/obj/.stale
Jun 05 23:19:45 [main] for_each_object_in_stale(403) /mnt/wd_WCAWZ1588874/.stale
Jun 05 23:19:45 [main] check_host_env(395) WARN: Allowed open files
1024 too small, suggested 1024000
Jun 05 23:19:45 [main] check_host_env(404) Allowed core file size 0,
suggested unlimited
Jun 05 23:19:45 [main] main(774) sheepdog daemon (version 0.6.0) started
Jun 05 23:19:45 [main] update_cluster_info(877) status = 1, epoch = 9,
finished: 0
Jun 05 23:22:35 [rw] get_object_sha1(488) fail to get sha1,
/mnt/wd_WCAYUEP99298/obj/.stale/00c8d128000274cb.8
Jun 05 23:22:35 [main] recover_object_main(625) done:1 count:208454,
oid:c8d128000274cb
Jun 05 23:22:35 [rw] get_object_sha1(488) fail to get sha1,
/mnt/wd_WCAYUEP99298/obj/.stale/00c8d128000382cc.8
Jun 05 23:22:35 [main] recover_object_main(625) done:2 count:208454,
oid:c8d128000382cc
Jun 05 23:22:35 [rw] get_object_sha1(488) fail to get sha1,
/mnt/wd_WCAYUEP99298/obj/.stale/00c8d12800001a85.8
Jun 05 23:22:35 [main] recover_object_main(625) done:3 count:208454,
oid:c8d12800001a85
Jun 05 23:22:35 [rw] get_object_sha1(488) fail to get sha1,
/mnt/wd_WCAYUEP99298/obj/.stale/00c8d1280000e12b.8
Jun 05 23:22:35 [main] recover_object_main(625) done:4 count:208454,
oid:c8d1280000e12b

Do you see anything useful?



More information about the sheepdog-users mailing list