[Sheepdog] [PATCH v3 1/7] sheep: add transient failure detection
MORITA Kazutaka
morita.kazutaka at gmail.com
Thu May 3 11:38:54 CEST 2012
At Thu, 3 May 2012 16:29:54 +0800,
HaiTing Yao wrote:
>
> >
> > If qemu uses cache=writethough, the I/O will be blocked. Note that
> > the requested consistency level for Sheepdog is quite different from
> > one for Dynamo.
> >
> >
>
> Yes, the I/O will be blocked without cache, but the block is not the fatal
> problem.
>
> With wright through, I can use the object cache to keep the hinted handoff
> on VM hosted node and does not block I/O at all. If the temporary failed
> node comes back, then copy the hinted handoff to the node. This can be
> accomplished within days.
After write-through requests are finished, the data must be replicated
to the proper multiple nodes. Otherwise, when there is inconsistency
between replicas, sheepdog needs to find the latest object. I believe
something like asynchronous flush does really not fit for our block
storage.
The most simplest way is to update the node list and choose new target
nodes from it.
>
> Some objects will lose one replication if the object is also a local
> request. Perhaps the lose of replication is not fatal, because keeping the
> strict copies is difficult for our nodes management. If we choose one
> replacing node for the failed node to keep the strict copies, we can not
> deal with the replacing node failing again without center node and versions
> of object data.
>
> The multi-cast policy of corosync can not promise without token lost. The
> token lost usually leads to network partition and whole cluster can not be
> used anymore. Tuning corosync can not solve token lost problem, so sheepdog
> must face this problem.
Corosync specific problems should be handled inside the corosync
cluster driver.
>
> I can get rid of the I/O block, but firstly we must make it clear do we
> need this kind of failure detection.
IMHO, I don't agree with introducing another node status like
NODE_STATUS_FAIL since it makes the code complicated. It seems that
we can take a simpler approach.
>
>
>
> > >
> > >
> > > >
> > > > So I think the main benefit of this patchset is to allow us to restart
> > > > sheep daemons without changing node membership, but what's the reason
> > > > you want to avoid temporal membership changes? Sheepdog blocks write
> > > > I/Os when it cannot create full replicas, so basically we should
> > > > remove the failed nodes from node membership ASAP.
> > > >
> > >
> > > Restarting the daemon will lead to two times of data recovery. If we
> > > upgrade the cluster with much data, the lazy repair is useful.
> >
> > It is definitely necessary to delay object recovery to avoid an extra
> > data copy against transient failure. However, it doesn't look a good
> > idea to delay changing node membership which is used for deciding
> > object placement.
> >
> > Actually, Sheepdog already handles transient node failure gracefully.
> > For example,
> >
> > Epoch Nodes
> > 1 A, B, C, D
> > 2 A, B, C <- node D fails temporally
> > 3 A, B, C, D
> >
> > If object recovery doesn't run at epoch 2, there is no object move
> > between nodes. I know how to handle transient network partition is a
> > challenging problem with the current implementation, but I'd like to
> > see another approach which doesn't block I/Os for a long time.
> >
>
> From my test, the recovery has began running when epoch 3 comes usually.
If so, it is a bug and should be fixed with a correct approach.
>
>
> >
> > If it is confusing to show frequent node membership changes to users,
> > how about managing two node lists? One node list is used internally
> > for consistent hashing, and the other one is shown to administrators
> > and doesn't change rapidly.
> >
> >
> I do not think the frequent membership changes will give user much
> confusion. I just want to avoid the transient failure leading to network
> partition and unnecessary data recovery.
I think the problem is only how to handle transient network partition.
Currently, Sheepdog kills all the daemons which belongs to the smaller
partition to keep strong consistency. I guess we should kill them
after timeout to ensure that the network partition is not transient.
Thanks,
Kazutaka
More information about the sheepdog
mailing list