[sheepdog] [PATCH] sheep: fix a epoch mismatch bug

Mon May 28 11:53:57 CEST 2012

On 05/28/2012 05:24 PM, Liu Yuan wrote:

> This is a nasty fallout from removing register/un-register group_fd, can be
> observed by following script:
> 
> Join a new node while someone left meantime
> ==============================
> for i in 0 1 2; do sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i;sleep 1;done
> collie/collie cluster format -c 3
> collie/collie vdi create test0 100M -P
> sleep 1
> for i in 3; do sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i;sleep 1;done
> for i in 1; do pkill -f "sheep/sheep -d /home/tailai.ly/sheepdog/store/$i -z $i -p 700$i";done;
> ==============================
> 
> The culprit is that we failed to inc sys->epoch because the sys stat of the
> newly joined node is SD_STATUS_WAIT_FOR_FORMAT before calling __sd_join_done().
> 
> The fix is simple, adding a new status to indicate that "I'm already joined,
> though need update other states, I'm still capable of recovering"

Hmm, this patch bring regression, can't handle multiple nodes joining
correctly. I'm still poking around what causes the problem.

Thanks,
Yuan