[Sheepdog] [PATCH v3 1/7] sheep: add transient failure detection

Fri May 4 09:55:18 CEST 2012

On Thu, May 3, 2012 at 5:38 PM, MORITA Kazutaka
<morita.kazutaka at gmail.com>wrote:

> At Thu, 3 May 2012 16:29:54 +0800,
> HaiTing Yao wrote:
> >
> > >
> > > If qemu uses cache=writethough, the I/O will be blocked.  Note that
> > > the requested consistency level for Sheepdog is quite different from
> > > one for Dynamo.
> > >
> > >
> >
> > Yes, the I/O will be blocked without cache, but the block is not the
> fatal
> > problem.
> >
> > With wright through, I can use the object cache to keep the hinted
> handoff
> > on VM hosted node and does not block I/O at all. If the temporary failed
> > node comes back, then copy the hinted handoff to the node. This can be
> > accomplished within days.
>
> After write-through requests are finished, the data must be replicated
> to the proper multiple nodes.  Otherwise, when there is inconsistency
> between replicas, sheepdog needs to find the latest object.  I believe
> something like asynchronous flush does really not fit for our block
> storage.
>

If I do this, I just use cache to record the hinted handoff for failing
node. There is no asynchronous flush. The data will write to the normal
nodes directly.

For example: Node A B C D

VM hosted on A

Assume the object should be distributed to B C D, and B is failing

Then the data is written to A C D, the data on A use object cache to store.

>
> The most simplest way is to update the node list and choose new target
> nodes from it.
>
> >
> > Some objects will lose one replication if the object is also a local
> > request. Perhaps the lose of replication is not fatal, because  keeping
> the
> > strict copies is difficult for our nodes management. If we choose one
> > replacing node for the failed node to keep the strict copies, we can not
> > deal with the replacing node failing again without center node and
> versions
> > of object data.
> >
> > The multi-cast policy of corosync can not promise without token lost. The
> > token lost usually leads to network partition and whole cluster can not
> be
> > used anymore. Tuning corosync can not solve token lost problem, so
> sheepdog
> > must face this problem.
>
> Corosync specific problems should be handled inside the corosync
> cluster driver.
>
> >
> > I can get rid of the I/O block, but firstly we must make it clear do we
> > need this kind of failure detection.
>
> IMHO, I don't agree with introducing another node status like
> NODE_STATUS_FAIL since it makes the code complicated.  It seems that
> we can take a simpler approach.
>
> >
> >
> >
> > > >
> > > >
> > > > >
> > > > > So I think the main benefit of this patchset is to allow us to
> restart
> > > > > sheep daemons without changing node membership, but what's the
> reason
> > > > > you want to avoid temporal membership changes?  Sheepdog blocks
> write
> > > > > I/Os when it cannot create full replicas, so basically we should
> > > > > remove the failed nodes from node membership ASAP.
> > > > >
> > > >
> > > > Restarting the daemon will lead to two times of data recovery. If we
> > > > upgrade the cluster with much data, the lazy repair is useful.
> > >
> > > It is definitely necessary to delay object recovery to avoid an extra
> > > data copy against transient failure.  However, it doesn't look a good
> > > idea to delay changing node membership which is used for deciding
> > > object placement.
> > >
> > > Actually, Sheepdog already handles transient node failure gracefully.
> > > For example,
> > >
> > >  Epoch  Nodes
> > >     1  A, B, C, D
> > >     2  A, B, C       <- node D fails temporally
> > >     3  A, B, C, D
> > >
> > > If object recovery doesn't run at epoch 2, there is no object move
> > > between nodes.  I know how to handle transient network partition is a
> > > challenging problem with the current implementation, but I'd like to
> > > see another approach which doesn't block I/Os for a long time.
> > >
> >
> > From my test, the recovery has began running when epoch 3 comes usually.
>
> If so, it is a bug and should be fixed with a correct approach.
>
> >
> >
> > >
> > > If it is confusing to show frequent node membership changes to users,
> > > how about managing two node lists?  One node list is used internally
> > > for consistent hashing, and the other one is shown to administrators
> > > and doesn't change rapidly.
> > >
> > >
> > I do not think the frequent membership changes will give user much
> > confusion. I just want to avoid the transient failure leading to network
> > partition and unnecessary data recovery.
>
> I think the problem is only how to handle transient network partition.
> Currently, Sheepdog kills all the daemons which belongs to the smaller
> partition to keep strong consistency.  I guess we should kill them
> after timeout to ensure that the network partition is not transient.
>
> Yes, I agreed with you. The main problem is network partition.

>From my test with corosync driver, node join/leave within seconds, and
sometimes within several minutes. The joining do not send request as the
daemon starts. When we start the daemon, the node will send one joining
request, but when the node comes back from the network partition , the node
will not send request. Now we can not deal with the node joining without
sending request. I tried to add this in my patch.

Node join/leaving quickly, we can give it a status as
NODE_STATUS_FAIL/NODE_STATUS_NORMAL, and we also can kick it out  and
welcome it again.

IMHO, I prefer the former.

1, The latter leads to two epoch++ of the cluster.

2, There are two recovery tasks

Whe the node joins again, we are not sure what the recovery_wqueue doing
for the leaving event that occurs a moment agao. If the worker queue are
doing getting object list, the new recovery must wait for the getting over.
Otherwise,  we can check if there is new recovery task during getting
object list.

As you say, dirver problem should controlled by driver. Do we add the
temporary failure detection to group layer or to driver layer?

Thanks
Haiti

> Thanks,
>
> Kazutaka
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120504/0a09407d/attachment-0003.html>