At Thu, 3 May 2012 10:02:38 +0800, HaiTing Yao wrote: > > On Thu, May 3, 2012 at 3:37 AM, MORITA Kazutaka > <morita.kazutaka at gmail.com>wrote: > > > At Wed, 2 May 2012 15:12:49 +0800, > > yaohaiting.wujue at gmail.com wrote: > > > > > > From: HaiTing Yao <wujue.yht at taobao.com> > > > > > > Sometimes we need node can be back in a while. > > > > > > When we need this: > > > > > > 1, restart sheepdog daemon for ugrade or other purpose > > > > > > 2, the corosync driver lose its token for a short while > > > > This is a corosync specific problem, and should be handled by changing > > parameters in corosync.conf, I think. > > > > For cluster storage, storage system should deal with temporary node or > network failures. It can not assume the cluster is always stable. Changing > parameter of corosync can not eliminate the temporay node failue because of > some protocol reasons. I am not sure zookeeper and other drivers have same > problems, but zookeeper also has the timeout that zookeeper server can not > commnunicate with the node. I think it alos can not avoid the problem on > some conditions. > > I tried to implement the similar solution with Amzon Dynamo for temporary > node or network failures. Perhaps I should keep the hinted handoff of > failed node on the VM hosted node, so I reused the object cache to keep > the hinted handoff. With the cache, the I/O will not be blocked. If qemu uses cache=writethough, the I/O will be blocked. Note that the requested consistency level for Sheepdog is quite different from one for Dynamo. > > > > > > So I think the main benefit of this patchset is to allow us to restart > > sheep daemons without changing node membership, but what's the reason > > you want to avoid temporal membership changes? Sheepdog blocks write > > I/Os when it cannot create full replicas, so basically we should > > remove the failed nodes from node membership ASAP. > > > > Restarting the daemon will lead to two times of data recovery. If we > upgrade the cluster with much data, the lazy repair is useful. It is definitely necessary to delay object recovery to avoid an extra data copy against transient failure. However, it doesn't look a good idea to delay changing node membership which is used for deciding object placement. Actually, Sheepdog already handles transient node failure gracefully. For example, Epoch Nodes 1 A, B, C, D 2 A, B, C <- node D fails temporally 3 A, B, C, D If object recovery doesn't run at epoch 2, there is no object move between nodes. I know how to handle transient network partition is a challenging problem with the current implementation, but I'd like to see another approach which doesn't block I/Os for a long time. If it is confusing to show frequent node membership changes to users, how about managing two node lists? One node list is used internally for consistent hashing, and the other one is shown to administrators and doesn't change rapidly. Thanks, Kazutaka > > When we format the cluster, we can specify the temorary failure detection > on/off. When it is on, there is an optional lazy reparr for eager repair. > > Thanks > Haiti > > > > > Thanks, > > > > Kazutaka |