[sheepdog-users] is --nohalt dangerous?

Thu Jul 19 04:09:58 CEST 2012

On 07/19/2012 02:14 AM, Arnold Krille wrote:
> But you do get problems when you write to the last remaining node, that node 
> dies (non-recoverable) and you bring back the other nodes. These node don't 
> have a chance of knowing they have invalid data. Well they can know, because 
> they might be shut down uncleanly. But then the remaining nodes know that they 
> have invalid data, so what? You can't go on with that and have to bring in the 
> backup you don't have...

This is exactly why halt behavior is default one. Without -nohalt, we
don't have this problem.

> For data consistency it would have been better if the cluster stopped writing 
> after more then half of the copies died. And thus forced the admins to fix the 
> nodes well before that even occures.
> 
> Setting a copy-value of more then one probably meant something for the admin 
> regarding data-security. So its safe to assume that he wants to protect 
> himself against the scenario of the last node dying with the last consistent 
> data on it.
> 
> So, please give sheepdog real quorum calculation when there are more then two 
> copies wanted.

Quorum will fail the case if the majority nodes are down at the same
time and non-recoverable, in this case, we lose the updates.

We actually have a more stronger constraint: if nr_nodes < copies, we
halt the cluster. I think this is the safest choose.

Thanks,
Yuan