[Sheepdog] [PATCH v3 1/7] sheep: add transient failure detection

Thu May 3 10:29:54 CEST 2012

On Thu, May 3, 2012 at 3:35 PM, MORITA Kazutaka
<morita.kazutaka at gmail.com>wrote:

> At Thu, 3 May 2012 10:02:38 +0800,
> HaiTing Yao wrote:
> >
> > On Thu, May 3, 2012 at 3:37 AM, MORITA Kazutaka
> > <morita.kazutaka at gmail.com>wrote:
> >
> > > At Wed,  2 May 2012 15:12:49 +0800,
> > > yaohaiting.wujue at gmail.com wrote:
> > > >
> > > > From: HaiTing Yao <wujue.yht at taobao.com>
> > > >
> > > > Sometimes we need node can be back in a while.
> > > >
> > > > When we need this:
> > > >
> > > > 1, restart sheepdog daemon for ugrade or other purpose
> > > >
> > > > 2, the corosync driver lose its token for a short while
> > >
> > > This is a corosync specific problem, and should be handled by changing
> > > parameters in corosync.conf, I think.
> > >
> >
> > For cluster storage, storage system should deal with temporary node or
> > network failures. It can not assume the cluster is always stable.
> Changing
> > parameter of corosync can not eliminate the temporay node failue because
> of
> > some protocol reasons. I am not sure zookeeper and other drivers have
> same
> > problems, but zookeeper also has the timeout that zookeeper server can
> not
> > commnunicate with the node. I think it alos can not avoid the problem on
> > some conditions.
> >
> > I tried to implement the similar solution with Amzon Dynamo for temporary
> > node or network failures. Perhaps I  should keep the hinted handoff of
> > failed node on the VM hosted node, so I reused the object cache to keep
> > the hinted handoff. With the cache, the I/O will not be blocked.
>
> If qemu uses cache=writethough, the I/O will be blocked.  Note that
> the requested consistency level for Sheepdog is quite different from
> one for Dynamo.
>
>

Yes, the I/O will be blocked without cache, but the block is not the fatal
problem.

With wright through, I can use the object cache to keep the hinted handoff
on VM hosted node and does not block I/O at all. If the temporary failed
node comes back, then copy the hinted handoff to the node. This can be
accomplished within days.

Some objects will lose one replication if the object is also a local
request. Perhaps the lose of replication is not fatal, because  keeping the
strict copies is difficult for our nodes management. If we choose one
replacing node for the failed node to keep the strict copies, we can not
deal with the replacing node failing again without center node and versions
of object data.

The multi-cast policy of corosync can not promise without token lost. The
token lost usually leads to network partition and whole cluster can not be
used anymore. Tuning corosync can not solve token lost problem, so sheepdog
must face this problem.

I can get rid of the I/O block, but firstly we must make it clear do we
need this kind of failure detection.

> >
> >
> > >
> > > So I think the main benefit of this patchset is to allow us to restart
> > > sheep daemons without changing node membership, but what's the reason
> > > you want to avoid temporal membership changes?  Sheepdog blocks write
> > > I/Os when it cannot create full replicas, so basically we should
> > > remove the failed nodes from node membership ASAP.
> > >
> >
> > Restarting the daemon will lead to two times of data recovery. If we
> > upgrade the cluster with much data, the lazy repair is useful.
>
> It is definitely necessary to delay object recovery to avoid an extra
> data copy against transient failure.  However, it doesn't look a good
> idea to delay changing node membership which is used for deciding
> object placement.
>
> Actually, Sheepdog already handles transient node failure gracefully.
> For example,
>
>  Epoch  Nodes
>     1  A, B, C, D
>     2  A, B, C       <- node D fails temporally
>     3  A, B, C, D
>
> If object recovery doesn't run at epoch 2, there is no object move
> between nodes.  I know how to handle transient network partition is a
> challenging problem with the current implementation, but I'd like to
> see another approach which doesn't block I/Os for a long time.
>

>From my test, the recovery has began running when epoch 3 comes usually.

>
> If it is confusing to show frequent node membership changes to users,
> how about managing two node lists?  One node list is used internally
> for consistent hashing, and the other one is shown to administrators
> and doesn't change rapidly.
>
>
I do not think the frequent membership changes will give user much
confusion. I just want to avoid the transient failure leading to network
partition and unnecessary data recovery.

Thanks
Haiti

> Thanks,
>
> Kazutaka
>
> >
> > When we format the cluster, we can specify the temorary failure detection
> > on/off. When it is on, there is an optional lazy reparr for eager repair.
> >
> > Thanks
> > Haiti
> >
> > >
> > > Thanks,
> > >
> > > Kazutaka
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120503/99d2267e/attachment.html>