[sheepdog-users] Node failure an data loss

Fri Jul 20 15:42:29 CEST 2012

Hi,

yesterday I had a strange behavior of my sheepdog cluster.
number of copies is set to two

# collie node list
M   Id   Host:Port         V-Nodes       Zone
-    0   10.0.1.61:7000      	 0 1023475722
-    1   10.0.1.61:7001      	64 1023475722
-    2   10.0.1.62:7000      	 0 1040252938
-    3   10.0.1.62:7001      	64 1040252938
-    4   10.0.1.62:7002      	64 1040252938
-    5   10.0.1.62:7003      	64 1040252938
-    6   10.0.1.63:7000      	 0 1057030154
-    7   10.0.1.63:7001      	64 1057030154
-    8   10.0.1.63:7002      	64 1057030154
-    9   10.0.1.63:7003      	64 1057030154

I had to shutdown 10.0.1.62, the other two servers start
recovering immediately. While the sheep on 10.0.1.61 was
still recovering, the failed node came back and the sheeps
are started too.

At this moment, the hole cluster semms to hang, collie node
info returns only a few lines and the virtual machines cant
access the images.
Two hours later, the recovery finished, collie commands
reacts normal and I could start the virtual machines but
discovered some strange behavior inside... The logfile
of the gateway sheep on 10.0.1.63 gives me a lot of the
following errors...

[..]
Jul 20 15:10:25 [gateway 0] forward_write_obj_req(188) fail 2
Jul 20 15:10:26 [gateway 2] forward_write_obj_req(188) fail 2
Jul 20 15:10:26 [gateway 3] forward_write_obj_req(188) fail 2
Jul 20 15:10:26 [gateway 1] forward_write_obj_req(188) fail 2
[..]

When I start a collie vdi check the most vdi gives an error
message...
[...]
fix c956c0000022f success
fix c956c00000230 success
fix c956c00000231 success
Failed to read, No object found

Would anyone know if this is maybe an fixed bug in my older
version (0.3.0_431_g2361852) or could explain, what happens
in this situation?

Cheers Bastian