| On 26/07/2012 19:59, Bastian Scholz wrote: > Am 2012-07-26 18:53, schrieb Jens WEBER: >> In a case of a crash, like your network error, you have a problem if >> one node dosn't have a full copy. So 3 nodes must have 3 copies. Or >> use redundant network links, so situation can't happen. For me some >> times collie cluster recover and collie cluster cleanup works after >> killing/crash. > > Ironically this happens, while testing the redundant network > lines and a strange firmware switch error kills the hole > network... ;-) > > But even if this should nearly never happen in a well designed > network, I think, that it should be possible to recover from > this kind of corner case error. The sheeps detect, that they > had to few living nodes and halts... > > Theoretically they had a valid status, because they rejects all > further write requests at the same time, but I cant reconnect > them after the network error... > > For their own, they wont detect to each other again... > collie cluster shutdown wont work, because of too few hosts... > killing them, invalidates the data... > > So, what I had expected is following scenario: > > If I loose more zones than number of copies at the same time > -> Shit happens! It will be be very unlikely under normal > conditions. > > But when I loose half or even more sheeps at the same time, I > think it should be possible to fail to a recoverable state... > >> Next step is to write best-practice-guide how to setup a sheepdog >> cluster in the right way. All help is welcome. > > I am not very good while documentation something, but I try > my best ;-) > > To answer Davids question for best practise update something > like this should do it... > > The update scenario depends if you need a running cluster the > hole time, or if you can plan a complete shutdown for some time. > > If you need to run the cluster all the time, you have to kill > the sheeps on one node, make the update and restart the sheeps. > After this, wait for recovery to complete and proceed with > the next node. After finishing with all nodes run ''collie > cluster cleanup'', this removes obj no longer needed on the > nodes after successful recovery. > > If you have a timeframe to shutdown the cluster completely, it > is maybe faster to use ''collie cluster shutdown'' (shut down > all connected qemu instances before) to stop all sheeps on all > nodes which leaves the cluster in a clean state. > Then make the updates on all nodes an restart the sheeps, the > cluster starts working again, if all original inhabitants are > back alive on the farm. > Thanks, I've put a modified version of this in the wiki. I'd also like to have more doc (on the wiki and in the man pages) on the meaning and the implications of several parameters (what are zone in the sheep cmd args? How do this is related to the mode (safe, quorum, unsafe) used when formating the cluster? etc.). I thins these questions needs to be clarified for the newcomer, maybe with some examples with failures scenarios up to the disaster (data lost; when does this occur in each example config). What do you think? David PS: I've CC this email to the dev list since I don't know haw many sheepdog developers are actually registered in this 'sheepdog-users' list. > Cheers > > Bastian > > > -- -- David DOUARD LOGILAB +33 1 45 32 03 12 david.douard at logilab.fr +33 1 83 64 25 26 http://www.logilab.fr/id/david.douard Formations - http://www.logilab.fr/formations Développements - http://www.logilab.fr/services Gestion de connaissances - http://www.cubicweb.org/ -------------- next part -------------- A non-text attachment was scrubbed... Name: david_douard.vcf Type: text/x-vcard Size: 302 bytes Desc: not available URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120726/94ce3572/attachment.vcf> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120726/94ce3572/attachment.pgp> |