On 07/20/2012 09:42 PM, Bastian Scholz wrote: > Hi, > > yesterday I had a strange behavior of my sheepdog cluster. > number of copies is set to two > > # collie node list > M Id Host:Port V-Nodes Zone > - 0 10.0.1.61:7000 0 1023475722 > - 1 10.0.1.61:7001 64 1023475722 > - 2 10.0.1.62:7000 0 1040252938 > - 3 10.0.1.62:7001 64 1040252938 > - 4 10.0.1.62:7002 64 1040252938 > - 5 10.0.1.62:7003 64 1040252938 > - 6 10.0.1.63:7000 0 1057030154 > - 7 10.0.1.63:7001 64 1057030154 > - 8 10.0.1.63:7002 64 1057030154 > - 9 10.0.1.63:7003 64 1057030154 > > I had to shutdown 10.0.1.62, the other two servers start > recovering immediately. While the sheep on 10.0.1.61 was > still recovering, the failed node came back and the sheeps > are started too. > > At this moment, the hole cluster semms to hang, collie node > info returns only a few lines and the virtual machines cant > access the images. > Two hours later, the recovery finished, collie commands > reacts normal and I could start the virtual machines but > discovered some strange behavior inside... The logfile > of the gateway sheep on 10.0.1.63 gives me a lot of the > following errors... > > [..] > Jul 20 15:10:25 [gateway 0] forward_write_obj_req(188) fail 2 > Jul 20 15:10:26 [gateway 2] forward_write_obj_req(188) fail 2 > Jul 20 15:10:26 [gateway 3] forward_write_obj_req(188) fail 2 > Jul 20 15:10:26 [gateway 1] forward_write_obj_req(188) fail 2 > [..] > > When I start a collie vdi check the most vdi gives an error > message... > [...] > fix c956c0000022f success > fix c956c00000230 success > fix c956c00000231 success > Failed to read, No object found > > Would anyone know if this is maybe an fixed bug in my older > version (0.3.0_431_g2361852) or could explain, what happens > in this situation? > > Cheers Bastian > There is fatal bug in recovery code at g2361852. Please try v0.4.0 or latest master, as far as I can say, there is no fatal problem found yet. Thanks, Yuan -- thanks, Yuan |