[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Tue Nov 22 13:26:48 CET 2011


At Sat, 19 Nov 2011 11:53:09 +0900,
MORITA Kazutaka wrote:
> 
> At Fri, 18 Nov 2011 11:23:13 +0000,
> Chris Webb wrote:
> > 
> > MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp> writes:
> > 
> > > I finished all of the cluster driver implementation we planed, so I
> > > think of releasing 0.3.0 this weekend.  If you have pending patches
> > > for 0.3.0, please send them until Nov 18th.  I'll spend this week
> > > testing Sheepdog.
> > 
> > Hi. I thought I would make myself useful and grab the head of master today
> > (currently 45eb24f01f8a) and do some testing of the code you're about to cut
> > into a release.
> > 
> > I repeated my tests that caused problems a week or so ago, and everything
> > seemed great. I created a small cluster, and did reads+writes through
> > collie, booted some guests, heavily exercised the attr support for locking
> > and so on. Everything seems very solid so far.
> > 
> > As a final test, I thought I'd try failing a node as I haven't done that for a
> > while. This didn't work as I'd hope I'm afraid. I then reproduced the problem
> > at a smaller scale on my small three node (with three sheep per node) cluster.
> > 
> > The nodes are numbered 002{6,7,8) with IP addresses 172.16.101.{7,9,11}
> > respectively, and sheep running on ports 7000, 7001 and 7002 on each node as
> > sheep -D -p 700{0,1,2}. Sheepdog and corosync IO (both multicast and
> > unicast) happens on interface eth1 in what follows.
> > 
> > I fail node 0028 by taking out its network connection:
> > 
> >   0028# ip link set eth1 down
> > 
> > At this point, all the sheep on this node have gone away (correctly):
> > 
> >   0028# ps ax | grep [s]heep
> >   0028# 
> > 
> > but the log suggests they may not have died in quite the intended way:
> > 
> >   0028# tail /mnt/sheep-0028-00/sheep.log 
> >   Nov 18 10:35:00 cluster_queue_request(221) 0x17304e0 89
> >   Nov 18 10:35:00 do_lookup_vdi(234) looking for 3f6d01e8-b627-4b1d-b2b8-e33daa571c25 (5c29e4)
> >   Nov 18 10:35:00 cluster_queue_request(221) 0x17304e0 89
> >   Nov 18 10:35:00 do_lookup_vdi(234) looking for 83ec9d3c-962d-47dc-9f9f-f4e4f4668767 (cc54cb)
> >   Nov 18 10:35:00 cluster_queue_request(221) 0x17304e0 89
> >   Nov 18 10:35:00 do_lookup_vdi(234) looking for 83ec9d3c-962d-47dc-9f9f-f4e4f4668767 (cc54cb)
> >   Nov 18 10:40:17 ob_open(445) failed to open /mnt/sheep-0028-00/obj/00000001/20cc54cb5ed7a388: No such file or directory
> >   Nov 18 10:40:23 ob_open(445) failed to open /mnt/sheep-0028-00/obj/00000001/20cc54cb5ed7a389: No such file or directory
> >   Nov 18 10:48:11 sd_leave_handler(1291) network partition bug: this sheep should have exited
> >   Nov 18 10:48:13 log_sigsegv(358) logger pid 2082 exiting abnormally
> 
> Similar problems are reported as corosync bugs on this list before,
> but this looks apparently a Sheepdog bug.
> 
> > 
> > On the other nodes, nothing works any more, e.g.
> > 
> >   0026# collie vdi list
> >     name        id    size    used  shared    creation time   vdi id
> >   ------------------------------------------------------------------
> >   failed to read object, 805c29e400000000 Remote node has a new epoch
> >   failed to read a inode header
> >   failed to read object, 80cc54cb00000000 Remote node has a new epoch
> >   failed to read a inode header
> > 
> > Presumably something about sheepdog fail-over has broken recently? I don't seem
> > to be able to get a cluster with a failed node to work in any configuration I
> > try, so I guess this might be very easy to reproduce, but in case not, the
> > sheep.log files from this simple test are at
> > 
> >   http://cdw.me.uk/tmp/sheep-0026-00.log
> >   http://cdw.me.uk/tmp/sheep-0026-01.log
> >   http://cdw.me.uk/tmp/sheep-0026-02.log
> >   http://cdw.me.uk/tmp/sheep-0027-00.log
> >   http://cdw.me.uk/tmp/sheep-0027-01.log
> >   http://cdw.me.uk/tmp/sheep-0027-02.log
> >   http://cdw.me.uk/tmp/sheep-0028-00.log
> >   http://cdw.me.uk/tmp/sheep-0028-01.log
> >   http://cdw.me.uk/tmp/sheep-0028-02.log
> > 
> > There seem to be a lot of EMISSING-type errors in there, both during normal
> > running and after the failure. Perhaps these are a clue?
> 
> Thanks a lot!  These would help us.
> 
> We changed a cluster management few months ago, and it would cause
> some problems, I think.

It seems that we need a bit more time to fix this bug.  It's not good
to keep many patches pending long time, so I'll pull the next branch
and create fixes on it for 0.3.0.

Thanks,

Kazutaka



More information about the sheepdog mailing list