At Thu, 3 May 2012 16:29:54 +0800, HaiTing Yao wrote: > > > > > If qemu uses cache=writethough, the I/O will be blocked. Note that > > the requested consistency level for Sheepdog is quite different from > > one for Dynamo. > > > > > > Yes, the I/O will be blocked without cache, but the block is not the fatal > problem. > > With wright through, I can use the object cache to keep the hinted handoff > on VM hosted node and does not block I/O at all. If the temporary failed > node comes back, then copy the hinted handoff to the node. This can be > accomplished within days. After write-through requests are finished, the data must be replicated to the proper multiple nodes. Otherwise, when there is inconsistency between replicas, sheepdog needs to find the latest object. I believe something like asynchronous flush does really not fit for our block storage. The most simplest way is to update the node list and choose new target nodes from it. > > Some objects will lose one replication if the object is also a local > request. Perhaps the lose of replication is not fatal, because keeping the > strict copies is difficult for our nodes management. If we choose one > replacing node for the failed node to keep the strict copies, we can not > deal with the replacing node failing again without center node and versions > of object data. > > The multi-cast policy of corosync can not promise without token lost. The > token lost usually leads to network partition and whole cluster can not be > used anymore. Tuning corosync can not solve token lost problem, so sheepdog > must face this problem. Corosync specific problems should be handled inside the corosync cluster driver. > > I can get rid of the I/O block, but firstly we must make it clear do we > need this kind of failure detection. IMHO, I don't agree with introducing another node status like NODE_STATUS_FAIL since it makes the code complicated. It seems that we can take a simpler approach. > > > > > > > > > > > > > > > > > So I think the main benefit of this patchset is to allow us to restart > > > > sheep daemons without changing node membership, but what's the reason > > > > you want to avoid temporal membership changes? Sheepdog blocks write > > > > I/Os when it cannot create full replicas, so basically we should > > > > remove the failed nodes from node membership ASAP. > > > > > > > > > > Restarting the daemon will lead to two times of data recovery. If we > > > upgrade the cluster with much data, the lazy repair is useful. > > > > It is definitely necessary to delay object recovery to avoid an extra > > data copy against transient failure. However, it doesn't look a good > > idea to delay changing node membership which is used for deciding > > object placement. > > > > Actually, Sheepdog already handles transient node failure gracefully. > > For example, > > > > Epoch Nodes > > 1 A, B, C, D > > 2 A, B, C <- node D fails temporally > > 3 A, B, C, D > > > > If object recovery doesn't run at epoch 2, there is no object move > > between nodes. I know how to handle transient network partition is a > > challenging problem with the current implementation, but I'd like to > > see another approach which doesn't block I/Os for a long time. > > > > From my test, the recovery has began running when epoch 3 comes usually. If so, it is a bug and should be fixed with a correct approach. > > > > > > If it is confusing to show frequent node membership changes to users, > > how about managing two node lists? One node list is used internally > > for consistent hashing, and the other one is shown to administrators > > and doesn't change rapidly. > > > > > I do not think the frequent membership changes will give user much > confusion. I just want to avoid the transient failure leading to network > partition and unnecessary data recovery. I think the problem is only how to handle transient network partition. Currently, Sheepdog kills all the daemons which belongs to the smaller partition to keep strong consistency. I guess we should kill them after timeout to ensure that the network partition is not transient. Thanks, Kazutaka |