At Wed, 6 Jun 2012 18:38:55 -0400, Christoph Hellwig wrote: > > I'm trying to understand the use case for the leave_list and all code > associated with it. > > From my reading the intention is to allow a cluster to start as long > as all the original nodes tried to join the cluster. What makes an > original node that tried to join the cluster but failed special over > one that never tried to join? It's not going to help us with getting > copies from it without a manual recover at least. How can we know whether the added node will join or fail without trying to add it to the cluster? The idea behind waiting all the original nodes is that we need to ensure that there is no other nodes who has the latest data. I'm using the below script to test master transfer. Is it possible to pass the test without leave_list? If yes, it's great but I think it is difficult. ==== #!/bin/bash set -ex DRIVER=${DRIVER:-local} for i in 0 1; do sheep/sheep /store/$i -z $i -p 700$i -c $DRIVER sleep 1 done # start Sheepdog with two nodes collie/collie cluster format -c 2 for i in 2 3 4; do # add one node after killing existing one node pkill -f "sheep /store/$((i - 2))" sleep 1 sheep/sheep /store/$i -z $i -p 700$i -c $DRIVER sleep 1 done # kill all existing nodes for i in 3 4; do pkill -f "sheep /store/$i" sleep 1 done for i in 0 1 2 3 4; do sheep/sheep /store/$i -z $i -p 700$i -c $DRIVER sleep 1 done echo check whether Sheepdog is running with only one node collie/collie cluster info -p 7004 # add the other nodes for i in 0 1 2 3; do sheep/sheep /store/$i -z $i -p 700$i -c $DRIVER sleep 1 done echo check whether all nodes have the same cluster info for i in 0 1 2 3 4; do collie/collie cluster info -p 700$i done |