[sheepdog-users] Several difficulties with sheepdog (from 0.4.0-0+tek2b-10 deb package)

David Douard david.douard at logilab.fr
Thu Jul 26 23:06:18 CEST 2012


On 26/07/2012 19:59, Bastian Scholz wrote:
> Am 2012-07-26 18:53, schrieb Jens WEBER:
>> In a case of a crash, like your network error, you have a problem if
>> one node dosn't have a full copy. So 3 nodes must have 3 copies. Or
>> use redundant network links, so situation can't happen. For me some
>> times collie cluster recover and collie cluster cleanup works after
>> killing/crash.
> 
> Ironically this happens, while testing the redundant network
> lines and a strange firmware switch error kills the hole
> network... ;-)
> 
> But even if this should nearly never happen in a well designed
> network, I think, that it should be possible to recover from
> this kind of corner case error. The sheeps detect, that they
> had to few living nodes and halts...
> 
> Theoretically they had a valid status, because they rejects all
> further write requests at the same time, but I cant reconnect
> them after the network error...
> 
> For their own, they wont detect to each other again...
> collie cluster shutdown wont work, because of too few hosts...
> killing them, invalidates the data...
> 
> So, what I had expected is following scenario:
> 
> If I loose more zones than number of copies at the same time
> -> Shit happens! It will be be very unlikely under normal
> conditions.
> 
> But when I loose half or even more sheeps at the same time, I
> think it should be possible to fail to a recoverable state...
> 
>> Next step is to write best-practice-guide how to setup a sheepdog
>> cluster in the right way. All help is welcome.
> 
> I am not very good while documentation something, but I try
> my best ;-)
> 
> To answer Davids question for best practise update something
> like this should do it...
> 
> The update scenario depends if you need a running cluster the
> hole time, or if you can plan a complete shutdown for some time.
> 
> If you need to run the cluster all the time, you have to kill
> the sheeps on one node, make the update and restart the sheeps.
> After this, wait for recovery to complete and proceed with
> the next node. After finishing with all nodes run ''collie
> cluster cleanup'', this removes obj no longer needed on the
> nodes after successful recovery.
> 
> If you have a timeframe to shutdown the cluster completely, it
> is maybe faster to use ''collie cluster shutdown'' (shut down
> all connected qemu instances before) to stop all sheeps on all
> nodes which leaves the cluster in a clean state.
> Then make the updates on all nodes an restart the sheeps, the
> cluster starts working again, if all original inhabitants are
> back alive on the farm.
> 

Thanks,

I've put a modified version of this in the wiki.

I'd also like to have more doc (on the wiki and in the man pages) on the
meaning and the implications of several parameters (what are zone in the
sheep cmd args? How do this is related to the mode (safe, quorum,
unsafe) used when formating the cluster? etc.). I thins these questions
needs to be clarified for the newcomer, maybe with some examples with
failures scenarios up to the disaster (data lost; when does this occur
in each example config).

What do you think?

David

PS: I've CC this email to the dev list since I don't know haw many
sheepdog developers are actually registered in this 'sheepdog-users' list.

> Cheers
> 
> Bastian
> 
> 
> 


-- 
--
David DOUARD		LOGILAB
+33 1 45 32 03 12	david.douard at logilab.fr
+33 1 83 64 25 26	http://www.logilab.fr/id/david.douard

Formations - http://www.logilab.fr/formations
Développements - http://www.logilab.fr/services
Gestion de connaissances - http://www.cubicweb.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: david_douard.vcf
Type: text/x-vcard
Size: 302 bytes
Desc: not available
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20120726/94ce3572/attachment-0004.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20120726/94ce3572/attachment-0003.sig>


More information about the sheepdog-users mailing list