[sheepdog-users] Several difficulties with sheepdog (from 0.4.0-0+tek2b-10 deb package)

Thu Jul 26 19:59:21 CEST 2012

Am 2012-07-26 18:53, schrieb Jens WEBER:
> In a case of a crash, like your network error, you have a problem if
> one node dosn't have a full copy. So 3 nodes must have 3 copies. Or
> use redundant network links, so situation can't happen. For me some
> times collie cluster recover and collie cluster cleanup works after
> killing/crash.

Ironically this happens, while testing the redundant network
lines and a strange firmware switch error kills the hole
network... ;-)

But even if this should nearly never happen in a well designed
network, I think, that it should be possible to recover from
this kind of corner case error. The sheeps detect, that they
had to few living nodes and halts...

Theoretically they had a valid status, because they rejects all
further write requests at the same time, but I cant reconnect
them after the network error...

For their own, they wont detect to each other again...
collie cluster shutdown wont work, because of too few hosts...
killing them, invalidates the data...

So, what I had expected is following scenario:

If I loose more zones than number of copies at the same time
-> Shit happens! It will be be very unlikely under normal
conditions.

But when I loose half or even more sheeps at the same time, I
think it should be possible to fail to a recoverable state...

> Next step is to write best-practice-guide how to setup a sheepdog
> cluster in the right way. All help is welcome.

I am not very good while documentation something, but I try
my best ;-)

To answer Davids question for best practise update something
like this should do it...

The update scenario depends if you need a running cluster the
hole time, or if you can plan a complete shutdown for some time.

If you need to run the cluster all the time, you have to kill
the sheeps on one node, make the update and restart the sheeps.
After this, wait for recovery to complete and proceed with
the next node. After finishing with all nodes run ''collie
cluster cleanup'', this removes obj no longer needed on the
nodes after successful recovery.

If you have a timeframe to shutdown the cluster completely, it
is maybe faster to use ''collie cluster shutdown'' (shut down
all connected qemu instances before) to stop all sheeps on all
nodes which leaves the cluster in a clean state.
Then make the updates on all nodes an restart the sheeps, the
cluster starts working again, if all original inhabitants are
back alive on the farm.

Cheers

Bastian