On 10/24/2011 07:46 PM, MORITA Kazutaka wrote: > At Mon, 24 Oct 2011 17:59:07 +0800, > Yibin Shen wrote: >> >> On Mon, Oct 24, 2011 at 5:43 PM, Liu Yuan <namei.unix at gmail.com> wrote: >>> On 10/24/2011 05:29 PM, zituan at taobao.com wrote: >>> >>>> From: Yibin Shen <zituan at taobao.com> >>>> >>>> In some situation, the left node can also receive the confchg event, >>>> which may be caused by corosync, that will lead to a network partition, >>>> this patch fix it. >>>> >>> >>> >>> So what kind of situation, and why left node can still process requeset >>> as it is already being 'left'? >> actually, none of they have left, both corosync & sheep are kept running, >> and I think sheep is disconnected from corosync temporary. >> in node '10.232.134.8', we found such log: >> Oct 19 17:35:49 sd_leave_handler(1852) leave ip: 10.232.134.8, pid: 25813 >> Oct 19 17:35:49 sd_leave_handler(1854) [0] ip: 10.232.134.1, pid: 2882 >> Oct 19 17:35:49 sd_leave_handler(1854) [1] ip: 10.232.134.2, pid: 27161 >> Oct 19 17:35:49 sd_leave_handler(1854) [2] ip: 10.232.134.3, pid: 25154 >> Oct 19 17:35:49 sd_leave_handler(1854) [3] ip: 10.232.134.4, pid: 24918 >> Oct 19 17:35:49 sd_leave_handler(1854) [4] ip: 10.232.134.5, pid: 7697 >> Oct 19 17:35:49 sd_leave_handler(1854) [5] ip: 10.232.134.6, pid: 7533 >> Oct 19 17:35:49 sd_leave_handler(1854) [6] ip: 10.232.134.7, pid: 25044 >> Oct 19 17:35:49 sd_leave_handler(1854) [7] ip: 10.232.134.9, pid: 25224 >> Oct 19 17:35:49 sd_leave_handler(1854) [8] ip: 10.232.134.10, pid: 24326 >> Oct 19 17:35:49 sd_leave_handler(1854) [9] ip: 10.232.134.11, pid: 24071 >> Oct 19 17:35:49 sd_leave_handler(1854) [a] ip: 10.232.134.12, pid: 24390 >> Oct 19 17:35:49 sd_leave_handler(1854) [b] ip: 10.232.134.13, pid: 25168 >> Oct 19 17:35:49 sd_leave_handler(1854) [c] ip: 10.232.134.14, pid: 20454 >> Oct 19 17:35:49 sd_leave_handler(1854) [d] ip: 10.232.134.15, pid: 19971 >> Oct 19 17:35:49 sd_leave_handler(1854) [e] ip: 10.232.134.17, pid: 20015 >> Oct 19 17:35:49 sd_leave_handler(1854) [f] ip: 10.232.134.19, pid: 20610 >> Oct 19 17:35:49 sd_leave_handler(1867) allow new confchg, 0xff8ba0 >> Oct 19 17:35:49 start_cpg_event_work(1655) 0 1 >> Oct 19 17:35:49 cpg_event_fn(1463) 0xff8ba0, 1 2 >> Oct 19 17:35:49 check_majority(1307) majority nodes are alive >> Oct 19 17:35:49 cpg_event_done(1501) 0xff8ba0 >> >> CC Yunkai Zhang >> He can give us some detailed information inside corosync So can the accidentally left but healthy node (for whatever reason) rejoin the cluster instead of being panic-out? Thanks, Yuan |