[sheepdog] [PATCH v3] sheep: remove master node

Fri Jul 26 05:20:00 CEST 2013

On Jul 24, 2013, at 8:57 PM, MORITA Kazutaka <morita.kazutaka at gmail.com> wrote:

> At Wed, 24 Jul 2013 19:37:53 +0800,
> Kai Zhang wrote:
>> 
>>>> 
>>>> Sorry, my description is not correct.
>>>> What I mean is that sheepdog cluster cannot recover by themselves at this scenario.
>>>> And I'm a little disappointed with this.
>>>> Is there a possibility to solve this?
>>> 
>>> If the number of redundacy is 1, it is possible that only the node D
>>> has the latest data.  Then, it's not safe to start sheepdog
>>> automatically without the node D.
>>> 
>> 
>> If the number of concurrent lost sheep is larger than the number of replica, it is sure that data will be lost.
>> I think it is reasonable and acceptable. And we have no choice but increase the number of replica.
> 
> Note that filesystem doesn't expect that the underlying block storage
> shows old data.  Sheepdog is a block storage system.  Data consistency
> is definitely more important than availability.

This make sense.

> 
>> 
>> If we have to start cluster manually, this will sacrifice availability. 
> 
> This happens only when starting Sheepdog, right?  The availability of
> the running cluster is not sacrificed.  What you are insisting is that
> Sheepdog should start up as soon as possible.  We need manual
> operations to start sheep daemons, either way, so I don't think
> automating this scenario is important.

Actually, I think in production, sheep daemons are always monitored and restarted by another
daemon, e.g. "supervise".
However, the assumption of manually restart is also reasonable.

> 
>> 
>> Another scenario in my mind is, if we shutdown all sheep, and then restart it. We have to bring back all of them,
>> otherwise the cluster will not work.
> 
> Yes, that's the reason we have a command 'collie cluster recover
> force'.
> 
> What do you think the best behaivior is?  It is very easy to change
> the current code so that Sheepdog doesn't wait for the node D, but
> note that Sheepdog cannot know whether the node D is just starting up
> or completely goes away.  I think the choices we can take of is:
> 
> 1. Start Sheepdog without waiting for the node D.
>     - This minimizes the boot time of Sheepdog.
>     - We have a risk to break the filesystem of the guest OSes.
>     - The node D may be during starting up and join later.
> 
> 2. Wait for the node D forever.
>     - This is safe from the point of view of data consistency.
>     - We have to run a manual recovery command if the node D is
>       broken.
>     - This delays the boot time of Sheepdog.
> 
> 3. Wait for the node D for a while.
>     - This is a compromise between 1 and 2.
> 
> 4. Any other ideas…?

I think the 2nd is the best choice if we think consistence is more important than availability
and we have to restart sheep manually when it crushed.

Thanks for your explanation.

>> 
>>> Sheepdog starts if all the nodes in the previous epoch are gathered -
>>> this is necessary to keep strong consistency which is required for
>>> block storage system.  We can relax this rule a bit (e.g. it is okay
>>> to start sheepdog in the above example if the number of redundancy is
>>> larger than one).  It's on my TODO items.
>>> 
>>> 
>> 
>> Based on my above statement, there is still risk of losing data.
>> We cannot avoid it totally.
> 
> The number of lost sheep is 1 (the node D), so it is safe if the
> number of replicas is larger than 1, no?
> 

It is safe for vdis who's number of replicas is larger than 1, but not safe for vdis who's number of 
replicas is just 1.

>> 
>> BTW, I think the number of redundancy should be bound to a specific vdi, but not a cluster.
> 
> Ah yes, you are right.
> 

Thanks,
Kyle