[sheepdog] [PATCH v3 1/2] sheep: handle block/unblock/notify error

Tue Jul 9 06:58:17 CEST 2013

On Tue, Jul 09, 2013 at 12:41:09PM +0800, Liu Yuan wrote:
> On Mon, Jul 08, 2013 at 09:07:14PM -0700, Kai Zhang wrote:
> > In group.c, it uses 3 broadcast operations: block, unblock and notify.
> > These broadcast operations are implemented by cluster drivers.
> > For example, corosync implements it by cpg_mcast_joined() while zookeeper by
> > sequential node.
> > And they can fail if network is unavailable for a while.
> > 
> > However, current group.c doesn't handle errors of block/unblock/notify events
> > and just ignore them.
> > 
> > This patch add a new error SD_RES_CLUSTER_ERROR to indicate these errors.
> > 
> > Signed-off-by: Kai Zhang <kyle at zelin.io>
> > ---
> >  include/sheep.h           |    1 +
> >  include/sheepdog_proto.h  |    1 +
> >  sheep/cluster.h           |   10 +++++++---
> >  sheep/cluster/corosync.c  |   17 ++++++++++-------
> >  sheep/cluster/local.c     |   10 +++++++---
> >  sheep/cluster/shepherd.c  |   12 ++++++++----
> >  sheep/cluster/zookeeper.c |   13 ++++++++-----
> >  sheep/group.c             |   40 +++++++++++++++++++++++++++++++++++-----
> >  8 files changed, 77 insertions(+), 27 deletions(-)
> > 
> > diff --git a/include/sheep.h b/include/sheep.h
> > index 0d3fae4..3541012 100644
> > --- a/include/sheep.h
> > +++ b/include/sheep.h
> > @@ -204,6 +204,7 @@ static inline const char *sd_strerror(int err)
> >  		[SD_RES_JOIN_FAILED] = "Node has failed to join cluster",
> >  		[SD_RES_HALT] = "IO has halted as there are too few living nodes",
> >  		[SD_RES_READONLY] = "Object is read-only",
> > +		[SD_RES_CLUSTER_ERROR] = "Cluster error",
> >  
> >  		/* from internal_proto.h */
> >  		[SD_RES_OLD_NODE_VER] = "Request has an old epoch",
> > diff --git a/include/sheepdog_proto.h b/include/sheepdog_proto.h
> > index 156457a..4e9c84e 100644
> > --- a/include/sheepdog_proto.h
> > +++ b/include/sheepdog_proto.h
> > @@ -71,6 +71,7 @@
> >  #define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog */
> >  #define SD_RES_HALT          0x19 /* Sheepdog is stopped doing IO */
> >  #define SD_RES_READONLY      0x1A /* Object is read-only */
> > +#define SD_RES_CLUSTER_ERROR 0x1B /* Cluster error */
> 
> "Cluster driver error"
> 
> >  
> >  /* errors above 0x80 are sheepdog-internal */
> 
> SD_RES_CLUSTER_ERROR should be above 0x80 because it is sheepdog internal error.

And we should teach gateway to retry this error untill success, collie and QEMU
has no interest of this error.

Thanks
Yuan