[Sheepdog] Cluster appears down but nodes report different epochs

Wed Nov 9 03:44:52 CET 2011

At Tue, 8 Nov 2011 10:56:51 -0500,
Shawn Moore wrote:
> 
> > Probably, we need to add support for using different NICs for data
> > I/Os and monitoring.
> 
> We currently have 4 1Gb nics bonded together using mode 4
> (LACP/802.3ad).  On another note, I am still doing more testing, but
> it almost looks like the TOTAL cluster speed might be limited to 1Gb
> instead of up to 4Gb (I know I can't get that much in truth though,
> but should be higher than 1Gb).  Does anyone have any insight into
> this?  We are using enterprise grade switches with two cards (18Gb/s
> fabric).

How did you measure the total cluster speed?  Could your disks be a
bottleneck?

> 
> 
> >> Shawn, did you format with -H or --nohalt option? If not, might be some
> >> bug in halt path.
> 
> Yes, we need that option as we will have two zones with copies being
> some even number.  So if we don't use -H and one zone goes offline,
> the cluster will quit serving data.

But, cluster info on blade161 says that

  [root at blade161 ~]# collie cluster info
  Cluster status: The sheepdog is stopped doing IO, short of living nodes

I think some bugs are in halt path.

Thanks,

Kazutaka