[sheepdog-users] Node failure an data loss

Fri Jul 20 15:55:45 CEST 2012

On 07/20/2012 09:42 PM, Bastian Scholz wrote:
> Hi,
> 
> yesterday I had a strange behavior of my sheepdog cluster.
> number of copies is set to two
> 
> # collie node list
> M   Id   Host:Port         V-Nodes       Zone
> -    0   10.0.1.61:7000           0 1023475722
> -    1   10.0.1.61:7001          64 1023475722
> -    2   10.0.1.62:7000           0 1040252938
> -    3   10.0.1.62:7001          64 1040252938
> -    4   10.0.1.62:7002          64 1040252938
> -    5   10.0.1.62:7003          64 1040252938
> -    6   10.0.1.63:7000           0 1057030154
> -    7   10.0.1.63:7001          64 1057030154
> -    8   10.0.1.63:7002          64 1057030154
> -    9   10.0.1.63:7003          64 1057030154
> 
> I had to shutdown 10.0.1.62, the other two servers start
> recovering immediately. While the sheep on 10.0.1.61 was
> still recovering, the failed node came back and the sheeps
> are started too.
> 
> At this moment, the hole cluster semms to hang, collie node
> info returns only a few lines and the virtual machines cant
> access the images.
> Two hours later, the recovery finished, collie commands
> reacts normal and I could start the virtual machines but
> discovered some strange behavior inside... The logfile
> of the gateway sheep on 10.0.1.63 gives me a lot of the
> following errors...
> 
> [..]
> Jul 20 15:10:25 [gateway 0] forward_write_obj_req(188) fail 2
> Jul 20 15:10:26 [gateway 2] forward_write_obj_req(188) fail 2
> Jul 20 15:10:26 [gateway 3] forward_write_obj_req(188) fail 2
> Jul 20 15:10:26 [gateway 1] forward_write_obj_req(188) fail 2
> [..]
> 
> When I start a collie vdi check the most vdi gives an error
> message...
> [...]
> fix c956c0000022f success
> fix c956c00000230 success
> fix c956c00000231 success
> Failed to read, No object found
> 
> Would anyone know if this is maybe an fixed bug in my older
> version (0.3.0_431_g2361852) or could explain, what happens
> in this situation?
> 
> Cheers Bastian
> 

There is fatal bug in recovery code at g2361852. Please try v0.4.0 or latest
master, as far as I can say, there is no fatal problem found yet.

Thanks,
Yuan

-- 
thanks,
Yuan