[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan

Thu Nov 24 19:23:57 CET 2011

At Tue, 22 Nov 2011 21:26:48 +0900,
MORITA Kazutaka wrote:
> 
> At Sat, 19 Nov 2011 11:53:09 +0900,
> MORITA Kazutaka wrote:
> > 
> > At Fri, 18 Nov 2011 11:23:13 +0000,
> > Chris Webb wrote:
> > > 
> > > MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:
> > > 
> > > > I finished all of the cluster driver implementation we planed, so I
> > > > think of releasing 0.3.0 this weekend.  If you have pending patches
> > > > for 0.3.0, please send them until Nov 18th.  I'll spend this week
> > > > testing Sheepdog.
> > > 
> > > Hi. I thought I would make myself useful and grab the head of master today
> > > (currently 45eb24f01f8a) and do some testing of the code you're about to cut
> > > into a release.
> > > 
> > > I repeated my tests that caused problems a week or so ago, and everything
> > > seemed great. I created a small cluster, and did reads+writes through
> > > collie, booted some guests, heavily exercised the attr support for locking
> > > and so on. Everything seems very solid so far.
> > > 
> > > As a final test, I thought I'd try failing a node as I haven't done that for a
> > > while. This didn't work as I'd hope I'm afraid. I then reproduced the problem
> > > at a smaller scale on my small three node (with three sheep per node) cluster.
> > > 
> > > The nodes are numbered 002{6,7,8) with IP addresses 172.16.101.{7,9,11}
> > > respectively, and sheep running on ports 7000, 7001 and 7002 on each node as
> > > sheep -D -p 700{0,1,2}. Sheepdog and corosync IO (both multicast and
> > > unicast) happens on interface eth1 in what follows.
> > > 
> > > I fail node 0028 by taking out its network connection:
> > > 
> > >   0028# ip link set eth1 down
> > > 
> > > At this point, all the sheep on this node have gone away (correctly):
> > > 
> > >   0028# ps ax | grep [s]heep
> > >   0028# 
> > > 
> > > but the log suggests they may not have died in quite the intended way:
> > > 
> > >   0028# tail /mnt/sheep-0028-00/sheep.log 
> > >   Nov 18 10:35:00 cluster_queue_request(221) 0x17304e0 89
> > >   Nov 18 10:35:00 do_lookup_vdi(234) looking for 3f6d01e8-b627-4b1d-b2b8-e33daa571c25 (5c29e4)
> > >   Nov 18 10:35:00 cluster_queue_request(221) 0x17304e0 89
> > >   Nov 18 10:35:00 do_lookup_vdi(234) looking for 83ec9d3c-962d-47dc-9f9f-f4e4f4668767 (cc54cb)
> > >   Nov 18 10:35:00 cluster_queue_request(221) 0x17304e0 89
> > >   Nov 18 10:35:00 do_lookup_vdi(234) looking for 83ec9d3c-962d-47dc-9f9f-f4e4f4668767 (cc54cb)
> > >   Nov 18 10:40:17 ob_open(445) failed to open /mnt/sheep-0028-00/obj/00000001/20cc54cb5ed7a388: No such file or directory
> > >   Nov 18 10:40:23 ob_open(445) failed to open /mnt/sheep-0028-00/obj/00000001/20cc54cb5ed7a389: No such file or directory
> > >   Nov 18 10:48:11 sd_leave_handler(1291) network partition bug: this sheep should have exited
> > >   Nov 18 10:48:13 log_sigsegv(358) logger pid 2082 exiting abnormally
> > 
> > Similar problems are reported as corosync bugs on this list before,
> > but this looks apparently a Sheepdog bug.
> > 
> > > 
> > > On the other nodes, nothing works any more, e.g.
> > > 
> > >   0026# collie vdi list
> > >     name        id    size    used  shared    creation time   vdi id
> > >   ------------------------------------------------------------------
> > >   failed to read object, 805c29e400000000 Remote node has a new epoch
> > >   failed to read a inode header
> > >   failed to read object, 80cc54cb00000000 Remote node has a new epoch
> > >   failed to read a inode header
> > > 
> > > Presumably something about sheepdog fail-over has broken recently? I don't seem
> > > to be able to get a cluster with a failed node to work in any configuration I
> > > try, so I guess this might be very easy to reproduce, but in case not, the
> > > sheep.log files from this simple test are at
> > > 
> > >   http://cdw.me.uk/tmp/sheep-0026-00.log
> > >   http://cdw.me.uk/tmp/sheep-0026-01.log
> > >   http://cdw.me.uk/tmp/sheep-0026-02.log
> > >   http://cdw.me.uk/tmp/sheep-0027-00.log
> > >   http://cdw.me.uk/tmp/sheep-0027-01.log
> > >   http://cdw.me.uk/tmp/sheep-0027-02.log
> > >   http://cdw.me.uk/tmp/sheep-0028-00.log
> > >   http://cdw.me.uk/tmp/sheep-0028-01.log
> > >   http://cdw.me.uk/tmp/sheep-0028-02.log
> > > 
> > > There seem to be a lot of EMISSING-type errors in there, both during normal
> > > running and after the failure. Perhaps these are a clue?
> > 
> > Thanks a lot!  These would help us.
> > 
> > We changed a cluster management few months ago, and it would cause
> > some problems, I think.
> 
> It seems that we need a bit more time to fix this bug.  It's not good
> to keep many patches pending long time, so I'll pull the next branch
> and create fixes on it for 0.3.0.

Hi Chris, I've sent some fixes.  Can you try tests with the devel
branch?

Thanks,

Kazutaka