[Sheepdog] [PATCH v3 1/7] sheep: add transient failure detection

Thu May 3 11:38:54 CEST 2012

At Thu, 3 May 2012 16:29:54 +0800,
HaiTing Yao wrote:
> 
> >
> > If qemu uses cache=writethough, the I/O will be blocked.  Note that
> > the requested consistency level for Sheepdog is quite different from
> > one for Dynamo.
> >
> >
> 
> Yes, the I/O will be blocked without cache, but the block is not the fatal
> problem.
> 
> With wright through, I can use the object cache to keep the hinted handoff
> on VM hosted node and does not block I/O at all. If the temporary failed
> node comes back, then copy the hinted handoff to the node. This can be
> accomplished within days.

After write-through requests are finished, the data must be replicated
to the proper multiple nodes.  Otherwise, when there is inconsistency
between replicas, sheepdog needs to find the latest object.  I believe
something like asynchronous flush does really not fit for our block
storage.

The most simplest way is to update the node list and choose new target
nodes from it.

> 
> Some objects will lose one replication if the object is also a local
> request. Perhaps the lose of replication is not fatal, because  keeping the
> strict copies is difficult for our nodes management. If we choose one
> replacing node for the failed node to keep the strict copies, we can not
> deal with the replacing node failing again without center node and versions
> of object data.
> 
> The multi-cast policy of corosync can not promise without token lost. The
> token lost usually leads to network partition and whole cluster can not be
> used anymore. Tuning corosync can not solve token lost problem, so sheepdog
> must face this problem.

Corosync specific problems should be handled inside the corosync
cluster driver.

> 
> I can get rid of the I/O block, but firstly we must make it clear do we
> need this kind of failure detection.

IMHO, I don't agree with introducing another node status like
NODE_STATUS_FAIL since it makes the code complicated.  It seems that
we can take a simpler approach.

> 
> 
> 
> > >
> > >
> > > >
> > > > So I think the main benefit of this patchset is to allow us to restart
> > > > sheep daemons without changing node membership, but what's the reason
> > > > you want to avoid temporal membership changes?  Sheepdog blocks write
> > > > I/Os when it cannot create full replicas, so basically we should
> > > > remove the failed nodes from node membership ASAP.
> > > >
> > >
> > > Restarting the daemon will lead to two times of data recovery. If we
> > > upgrade the cluster with much data, the lazy repair is useful.
> >
> > It is definitely necessary to delay object recovery to avoid an extra
> > data copy against transient failure.  However, it doesn't look a good
> > idea to delay changing node membership which is used for deciding
> > object placement.
> >
> > Actually, Sheepdog already handles transient node failure gracefully.
> > For example,
> >
> >  Epoch  Nodes
> >     1  A, B, C, D
> >     2  A, B, C       <- node D fails temporally
> >     3  A, B, C, D
> >
> > If object recovery doesn't run at epoch 2, there is no object move
> > between nodes.  I know how to handle transient network partition is a
> > challenging problem with the current implementation, but I'd like to
> > see another approach which doesn't block I/Os for a long time.
> >
> 
> From my test, the recovery has began running when epoch 3 comes usually.

If so, it is a bug and should be fixed with a correct approach.

> 
> 
> >
> > If it is confusing to show frequent node membership changes to users,
> > how about managing two node lists?  One node list is used internally
> > for consistent hashing, and the other one is shown to administrators
> > and doesn't change rapidly.
> >
> >
> I do not think the frequent membership changes will give user much
> confusion. I just want to avoid the transient failure leading to network
> partition and unnecessary data recovery.

I think the problem is only how to handle transient network partition.
Currently, Sheepdog kills all the daemons which belongs to the smaller
partition to keep strong consistency.  I guess we should kill them
after timeout to ensure that the network partition is not transient.

Thanks,

Kazutaka