[sheepdog] [PATCH 0/3] fix master transfers

Tue Aug 7 17:15:38 CEST 2012

A newly joining nod with a higher epoch than the one on the current
master gets a CJ_RES_MASTER_TRANSFER which should make it take over
mastership.  Currently this case only kills the current master which
might cause incorrect epoch log entries on other nodes and thus cause
additional crashes down the road.

This series makes sure all nodes die in this case and can be restarted
by the mangement tool so that we can get back to a healthy cluster
quickly.  I'd love to be able to totall reset the state inside a sheep
daemon for this case, but so fat I havenot found an easy way for it.

A simple test case for master transfers is below:

#!/bin/bash

set -e
set -x

CLUSTER="-c local"

BASEDIR=/mnt/sheepdog

SRCDIR=/home/hch/work/sheepdog
SHEEP="${SRCDIR}/sheep/sheep ${CLUSTER}"
COLLIE="${SRCDIR}/collie/collie"

killall sheep || true

mkdir -p ${BASEDIR}
rm -rf ${BASEDIR}/7???

# start three sheep and format the cluster
for i in `seq 7000 7002`; do
    ${SHEEP} -p $i -z $i ${BASEDIR}/${i} -P ${BASEDIR}/${i}/sheep.pid
    sleep 1
done
${COLLIE} cluster format

# start three more sheep
for i in `seq 7003 7005`; do
    ${SHEEP} -p $i -z $i ${BASEDIR}/${i} -P ${BASEDIR}/${i}/sheep.pid
    sleep 1
done

# stop three sheep
for i in `seq 7003 7005`; do
    kill `cat ${BASEDIR}/${i}/sheep.pid`
    rm ${BASEDIR}/${i}/sheep.pid
done

sleep 1

# and shut the cluster down
${COLLIE} cluster shutdown

# restart the three sheep that were stopped earlier
for i in `seq 7005 -1 7003`; do
    ${SHEEP} -p $i -z $i ${BASEDIR}/${i} -P ${BASEDIR}/${i}/sheep.pid
    sleep 1
done

# and restart the first three sheep
for i in `seq 7000 7002`; do
    ${SHEEP} -p $i -z $i ${BASEDIR}/${i} -P ${BASEDIR}/${i}/sheep.pid
    sleep 1
done