On 05/22/2013 03:07 PM, Valerio Pachera wrote: > Hi, on my production cluster I tried to kill one of the 3 nodes and > restart sheep right after. > (Sheepdog daemon version 0.5.5_335_g25a93bf) > > root at sheepdog004:~# collie node list > M Id Host:Port V-Nodes Zone > - 0 192.168.6.41:7000 85 688302272 > - 1 192.168.6.42:7000 85 705079488 > - 2 192.168.6.44:7000 21 738633920 > > root at sheepdog004:~# collie node info > Id Size Used Use% > 0 1.6 TB 1.0 TB 64% > 1 1.6 TB 978 GB 57% > 2 2.1 TB 236 GB 10% > Total 5.4 TB 2.2 TB 41% > Total virtual image size 1.2 TB > > root at sheepdog004:~# collie node kill 2 > > root at sheepdog004:~# sheep -w size=20000 > /mnt/wd_WCAYUEP99298,/mnt/wd_WCAYUEP99298/obj,/mnt/wd_WCAWZ1588874 > > root at sheepdog004:~# collie node info > Id Size Used Use% > 0 1.6 TB 1.0 TB 64% > 1 1.6 TB 978 GB 57% > 2 466 GB 72 MB 0% > Total 3.7 TB 2.0 TB 53% > Node 2 didn't show correct size, looks like a bug. > root at sheepdog004:~# collie node md info > Id Size Use Path > 0 422 GB 0.0 MB /mnt/wd_WCAYUEP99298/obj > 1 1.6 TB 980 MB /mnt/wd_WCAWZ1588874 > > root at sheepdog004:~# collie node recovery > Nodes In Recovery: > Id Host:Port V-Nodes Zone > 2 192.168.6.44:7000 21 738633920 > > sheep.log > May 22 08:54:32 [main] main(752) shutdown > May 22 08:54:38 [main] md_add_disk(164) /mnt/wd_WCAYUEP99298/obj, nr 1 > May 22 08:54:38 [main] md_add_disk(164) /mnt/wd_WCAWZ1588874, nr 2 > May 22 08:54:38 [main] send_join_request(1082) IPv4 ip:192.168.6.44 port:7000 > May 22 08:54:38 [main] check_host_env(381) WARN: Allowed open files > 1024 too small, suggested 1024000 > May 22 08:54:38 [main] check_host_env(390) Allowed core file size 0, > suggested unlimited > May 22 08:54:38 [main] main(745) sheepdog daemon (version > 0.5.5_335_g25a93bf) started > May 22 08:54:38 [main] update_cluster_info(862) status = 1, epoch = 4, > finished: 0 > May 22 08:54:40 [rw 17255] recover_object_work(205) done:0 > count:60534, oid:c8d1280002992d > May 22 08:54:42 [rw 17255] recover_object_work(205) done:1 > count:60534, oid:c8d1280000081f > May 22 08:54:43 [rw 17255] recover_object_work(205) done:2 > count:60534, oid:c8d1280003c3d0 > ... > May 22 08:54:49 [gway 17253] gateway_read_obj(60) local read > 80c8be4d00000000 failed, No object found > May 22 08:54:49 [gway 17253] gateway_read_obj(60) local read > 80e149bf00000000 failed, No object found > May 22 08:54:49 [rw 17255] recover_object_work(205) done:19 > count:60534, oid:c8d12800018e38 > ... > May 22 08:55:16 [gway 17253] gateway_read_obj(60) local read > 80c8be4d00000000 failed, No object found > May 22 08:55:16 [gway 17253] gateway_read_obj(60) local read > 80e149bf00000000 failed, No object found > May 22 08:55:16 [rw 17255] recover_object_work(205) done:109 > count:60534, oid:c8d1280000ff6b > ... > > > What do you think? > Is everything messed up? > Not yet. If you see "failed to recover object xxx", then the objects are lost Thanks, Yuan |