[sheepdog-users] is --nohalt dangerous?

Thu Jul 19 06:38:14 CEST 2012

> On 07/19/2012 02:14 AM, Arnold Krille wrote:
> > But you do get problems when you write to the last remaining node,
> > that node dies (non-recoverable) and you bring back the other nodes.
> > These node don't have a chance of knowing they have invalid data. Well
> > they can know, because they might be shut down uncleanly. But then the
> > remaining nodes know that they have invalid data, so what? You can't
> > go on with that and have to bring in the backup you don't have...
> 
> This is exactly why halt behavior is default one. Without -nohalt, we don't
> have this problem.

But that simply stops all operations, and does not tolerate failures on small
system (2nodes,copies=2) or (3nodes,copies=3).

> > For data consistency it would have been better if the cluster stopped
> > writing after more then half of the copies died. And thus forced the
> > admins to fix the nodes well before that even occures.
> >
> > Setting a copy-value of more then one probably meant something for the
> > admin regarding data-security. So its safe to assume that he wants to
> > protect himself against the scenario of the last node dying with the
> > last consistent data on it.
> >
> > So, please give sheepdog real quorum calculation when there are more
> > then two copies wanted.
> 
> Quorum will fail the case if the majority nodes are down at the same time
> and non-recoverable, in this case, we lose the updates.

No, because writes/updates are not allowed when you do not have quorum.

> We actually have a more stronger constraint: if nr_nodes < copies, we halt
> the cluster. I think this is the safest choose.

That constrains in simply not acceptable on small systems.

- Dietmar