[sheepdog] [PATCH v3] sheep: remove master node

Wed Jul 24 14:57:05 CEST 2013

At Wed, 24 Jul 2013 19:37:53 +0800,
Kai Zhang wrote:
> 
> >> 
> >> Sorry, my description is not correct.
> >> What I mean is that sheepdog cluster cannot recover by themselves at this scenario.
> >> And I'm a little disappointed with this.
> >> Is there a possibility to solve this?
> > 
> > If the number of redundacy is 1, it is possible that only the node D
> > has the latest data.  Then, it's not safe to start sheepdog
> > automatically without the node D.
> > 
> 
> If the number of concurrent lost sheep is larger than the number of replica, it is sure that data will be lost.
> I think it is reasonable and acceptable. And we have no choice but increase the number of replica.

Note that filesystem doesn't expect that the underlying block storage
shows old data.  Sheepdog is a block storage system.  Data consistency
is definitely more important than availability.

> 
> If we have to start cluster manually, this will sacrifice availability. 

This happens only when starting Sheepdog, right?  The availability of
the running cluster is not sacrificed.  What you are insisting is that
Sheepdog should start up as soon as possible.  We need manual
operations to start sheep daemons, either way, so I don't think
automating this scenario is important.

> 
> Another scenario in my mind is, if we shutdown all sheep, and then restart it. We have to bring back all of them,
> otherwise the cluster will not work.

Yes, that's the reason we have a command 'collie cluster recover
force'.

What do you think the best behaivior is?  It is very easy to change
the current code so that Sheepdog doesn't wait for the node D, but
note that Sheepdog cannot know whether the node D is just starting up
or completely goes away.  I think the choices we can take of is:

 1. Start Sheepdog without waiting for the node D.
     - This minimizes the boot time of Sheepdog.
     - We have a risk to break the filesystem of the guest OSes.
     - The node D may be during starting up and join later.

 2. Wait for the node D forever.
     - This is safe from the point of view of data consistency.
     - We have to run a manual recovery command if the node D is
       broken.
     - This delays the boot time of Sheepdog.

 3. Wait for the node D for a while.
     - This is a compromise between 1 and 2.

 4. Any other ideas...?

> 
> > Sheepdog starts if all the nodes in the previous epoch are gathered -
> > this is necessary to keep strong consistency which is required for
> > block storage system.  We can relax this rule a bit (e.g. it is okay
> > to start sheepdog in the above example if the number of redundancy is
> > larger than one).  It's on my TODO items.
> > 
> > 
> 
> Based on my above statement, there is still risk of losing data.
> We cannot avoid it totally.

The number of lost sheep is 1 (the node D), so it is safe if the
number of replicas is larger than 1, no?

> 
> BTW, I think the number of redundancy should be bound to a specific vdi, but not a cluster.

Ah yes, you are right.

Thanks,

Kazutaka