Hi, on my production cluster I tried to kill one of the 3 nodes and restart sheep right after. (Sheepdog daemon version 0.5.5_335_g25a93bf) root at sheepdog004:~# collie node list M Id Host:Port V-Nodes Zone - 0 192.168.6.41:7000 85 688302272 - 1 192.168.6.42:7000 85 705079488 - 2 192.168.6.44:7000 21 738633920 root at sheepdog004:~# collie node info Id Size Used Use% 0 1.6 TB 1.0 TB 64% 1 1.6 TB 978 GB 57% 2 2.1 TB 236 GB 10% Total 5.4 TB 2.2 TB 41% Total virtual image size 1.2 TB root at sheepdog004:~# collie node kill 2 root at sheepdog004:~# sheep -w size=20000 /mnt/wd_WCAYUEP99298,/mnt/wd_WCAYUEP99298/obj,/mnt/wd_WCAWZ1588874 root at sheepdog004:~# collie node info Id Size Used Use% 0 1.6 TB 1.0 TB 64% 1 1.6 TB 978 GB 57% 2 466 GB 72 MB 0% Total 3.7 TB 2.0 TB 53% root at sheepdog004:~# collie node md info Id Size Use Path 0 422 GB 0.0 MB /mnt/wd_WCAYUEP99298/obj 1 1.6 TB 980 MB /mnt/wd_WCAWZ1588874 root at sheepdog004:~# collie node recovery Nodes In Recovery: Id Host:Port V-Nodes Zone 2 192.168.6.44:7000 21 738633920 sheep.log May 22 08:54:32 [main] main(752) shutdown May 22 08:54:38 [main] md_add_disk(164) /mnt/wd_WCAYUEP99298/obj, nr 1 May 22 08:54:38 [main] md_add_disk(164) /mnt/wd_WCAWZ1588874, nr 2 May 22 08:54:38 [main] send_join_request(1082) IPv4 ip:192.168.6.44 port:7000 May 22 08:54:38 [main] check_host_env(381) WARN: Allowed open files 1024 too small, suggested 1024000 May 22 08:54:38 [main] check_host_env(390) Allowed core file size 0, suggested unlimited May 22 08:54:38 [main] main(745) sheepdog daemon (version 0.5.5_335_g25a93bf) started May 22 08:54:38 [main] update_cluster_info(862) status = 1, epoch = 4, finished: 0 May 22 08:54:40 [rw 17255] recover_object_work(205) done:0 count:60534, oid:c8d1280002992d May 22 08:54:42 [rw 17255] recover_object_work(205) done:1 count:60534, oid:c8d1280000081f May 22 08:54:43 [rw 17255] recover_object_work(205) done:2 count:60534, oid:c8d1280003c3d0 ... May 22 08:54:49 [gway 17253] gateway_read_obj(60) local read 80c8be4d00000000 failed, No object found May 22 08:54:49 [gway 17253] gateway_read_obj(60) local read 80e149bf00000000 failed, No object found May 22 08:54:49 [rw 17255] recover_object_work(205) done:19 count:60534, oid:c8d12800018e38 ... May 22 08:55:16 [gway 17253] gateway_read_obj(60) local read 80c8be4d00000000 failed, No object found May 22 08:55:16 [gway 17253] gateway_read_obj(60) local read 80e149bf00000000 failed, No object found May 22 08:55:16 [rw 17255] recover_object_work(205) done:109 count:60534, oid:c8d1280000ff6b ... What do you think? Is everything messed up? |