When I first started with sheepdog, I had partitioning issues with corosync. Zookeeper has been much more reliable for me. We have one of our nodes in a separate building, and multicast wasn't reliably transiting the switches in between nodes. On 08/07/2013 09:58 PM, Liu Yuan wrote: > On Thu, Aug 08, 2013 at 02:23:11AM +0200, Valerio Pachera wrote: >> Auch! This time things went bad: >> cluster has stopped. >> >> But it seem not to be related to the recover of the "small" disk I >> inserted back (/mnt/sheep/dsk01): it didn't get full. >> It seems something about the command attr I run yesterday. >> >> sheepdog001 >> /dev/mapper/vg00-sheepdsk01 167G 70G >> 97G 43% /mnt/sheep/dsk01 >> /dev/sdb1 466G 367G >> 100G 79% /mnt/sheep/dsk02 >> /dev/sdc1 1,9T 1,2T >> 679G 64% /mnt/sheep/dsk03 >> >> What I find unusual is a call trace on node 'sheepdog001' (the other >> nodes have no call trace). >> I can see that from dmesg, /var/log/messages and /var/log/syslog >> [note: 1][note: 4] >> >> On all nodes I find corsync messages in /var/log/syslog, all at the >> same time 21:00 [note: 2] >> In sheepdog001 there's some extra message right after the call trace [note: 4] >> >> I report also what's int the sheep.log file of the nodes [note: 3] >> >> I checked the filesystem of /mnt/sheep/dsk01 device and seems fine [note: 5] >> >> Checking the smart of the disk I notice a value that might means problems: >> 183 Runtime_Bad_Block >> >> I'm running a full test to check the disk status. >> >> [note: 1] >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page >> allocation failure: order:4, mode:0xd0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm: >> attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace: >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725491] >> [<ffffffff810b88d7>] ? warn_alloc_failed+0x11a/0x12d >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725504] >> [<ffffffff810c9c7c>] ? next_zone+0x25/0x27 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725515] >> [<ffffffff810bb605>] ? __alloc_pages_nodemask+0x704/0x7aa >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725528] >> [<ffffffff810eaa2d>] ? kmem_getpages+0x4c/0xd9 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725538] >> [<ffffffff810ebdde>] ? fallback_alloc+0x13e/0x1e2 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725548] >> [<ffffffff8111312c>] ? getxattr+0x84/0x126 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725555] >> [<ffffffff810ec2e5>] ? __kmalloc+0xba/0x112 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725564] >> [<ffffffff810364e8>] ? should_resched+0x5/0x23 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725573] >> [<ffffffff8111312c>] ? getxattr+0x84/0x126 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725582] >> [<ffffffff810ead50>] ? kmem_cache_free+0x2d/0x69 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725592] >> [<ffffffff81105943>] ? user_path_at_empty+0x53/0x7b >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725603] >> [<ffffffff811138d3>] ? sys_lgetxattr+0x3f/0x5c >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725612] >> [<ffffffff8134f0f5>] ? page_fault+0x25/0x30 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725620] >> [<ffffffff81353b52>] ? system_call_fastpath+0x16/0x1b >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725626] Mem-Info: >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725630] Node 0 DMA per-cpu: >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725637] CPU 0: hi: >> 0, btch: 1 usd: 0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725643] CPU 1: hi: >> 0, btch: 1 usd: 0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725648] Node 0 DMA32 per-cpu: >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725655] CPU 0: hi: >> 186, btch: 31 usd: 0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725660] CPU 1: hi: >> 186, btch: 31 usd: 0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725665] Node 0 Normal per-cpu: >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725671] CPU 0: hi: >> 186, btch: 31 usd: 0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725676] CPU 1: hi: >> 186, btch: 31 usd: 0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725690] active_anon:12964 >> inactive_anon:12371 isolated_anon:0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725693] >> active_file:169712 inactive_file:547532 isolated_file:0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725697] unevictable:949 >> dirty:1378 writeback:0 unstable:0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725700] free:45231 >> slab_reclaimable:147488 slab_unreclaimable:6686 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725703] mapped:3632 >> shmem:1940 pagetables:663 bounce:0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725710] Node 0 DMA >> free:15880kB min:264kB low:328kB high:396kB active_anon:0kB >> inactive_anon:0kB active_file:0kB inactive_file:20kB unevictable:0kB >> isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB >> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:8kB >> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB >> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725739] lowmem_reserve[]: >> 0 3383 3888 3888 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725748] Node 0 DMA32 >> free:152592kB min:58576kB low:73220kB high:87864kB active_anon:45376kB >> inactive_anon:19188kB active_file:604448kB inactive_file:2116456kB >> unevictable:88kB isolated(anon):0kB isolated(file):0kB >> present:3464992kB mlocked:88kB dirty:4388kB writeback:0kB >> mapped:2364kB shmem:820kB slab_reclaimable:415320kB >> slab_unreclaimable:7688kB kernel_stack:224kB pagetables:716kB >> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 >> all_unreclaimable? no >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725779] lowmem_reserve[]: >> 0 0 505 505 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725788] Node 0 Normal >> free:12452kB min:8740kB low:10924kB high:13108kB active_anon:6480kB >> inactive_anon:30296kB active_file:74400kB inactive_file:73652kB >> unevictable:3708kB isolated(anon):0kB isolated(file):0kB >> present:517120kB mlocked:3708kB dirty:1124kB writeback:0kB >> mapped:12164kB shmem:6940kB slab_reclaimable:174624kB >> slab_unreclaimable:19056kB kernel_stack:1208kB pagetables:1936kB >> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 >> all_unreclaimable? no >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725818] lowmem_reserve[]: 0 0 0 0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725826] Node 0 DMA: 2*4kB >> 2*8kB 1*16kB 3*32kB 4*64kB 3*128kB 3*256kB 2*512kB 3*1024kB 3*2048kB >> 1*4096kB = 15880kB >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725848] Node 0 DMA32: >> 25550*4kB 5749*8kB 11*16kB 4*32kB 0*64kB 0*128kB 0*256kB 0*512kB >> 0*1024kB 2*2048kB 0*4096kB = 152592kB >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725869] Node 0 Normal: >> 2051*4kB 19*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB >> 0*2048kB 1*4096kB = 12452kB >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725891] 721148 total >> pagecache pages >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725895] 1311 pages in swap cache >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725901] Swap cache stats: >> add 18691, delete 17380, find 3363212/3363399 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725906] Free swap = 7760204kB >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725910] Total swap = 7815584kB >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.785649] 1015792 pages RAM >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.785656] 32718 pages reserved >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.785664] 636786 pages shared >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.785671] 298954 pages non-shared >> Aug 7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop >> messages from pid 23853 due to rate-limiting > Looks to me not a big problem, just allocation failed. > >> [note: 2] >> Aug 7 21:00:07 sheepdog004 corosync[4365]: [TOTEM ] Retransmit >> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b >> 758c 758d 758e 758f 7590 7591 7592 >> Aug 7 21:00:07 sheepdog004 corosync[4365]: [TOTEM ] Retransmit >> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b >> 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 >> 759a 759b 759c >> ... > Hello corosync guys, is this normal? Sheep daemon detected a network partition. > >> [note: 3] >> sheepdog001 >> ... >> Aug 07 21:09:53 [main] recover_object_main(626) done:208768 >> count:414381, oid:c8d13200032c4f >> Aug 07 21:09:53 [main] recover_object_main(626) done:208769 >> count:414381, oid:c8d13600033c77 >> Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition >> is detected > It looks like a corosync problem. > >> Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted). >> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7ff842f2102f] >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7ff84252d474] >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7ff8425306ef] >> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7ff842a8cd74] >> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler >> Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop >> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc) >> [0x7ff842519eac] >> Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498] >> Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb >> Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb >> Aug 07 21:09:54 [main] crash_handler(487) sheep pid 23853 exited unexpectedly. >> >> sheepdog002 >> ... >> Aug 07 08:17:17 [gway 26470] wait_forward_request(176) poll timeout 1, >> disks of some nodes or network is busy. Going to poll-wait again >> Aug 07 08:17:17 [gway 26476] wait_forward_request(176) poll timeout 1, >> disks of some nodes or network is busy. Going to poll-wait again >> Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition >> is detected >> Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted). >> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7fd1e9afd02f] >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7fd1e9109474] >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7fd1e910c6ef] >> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7fd1e9668d74] >> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler >> Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop >> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main >> Aug 07 21:09:54 [main] sd_backtrace(848) >> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc) >> [0x7fd1e90f5eac] >> Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498] >> Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb >> Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb >> Aug 07 21:09:54 [main] crash_handler(487) sheep pid 20833 exited unexpectedly. >> >> sheepdog003 sheepdog003 >> seep.log is empty. >> >> [note:4] >> sheepdog001 syslog >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page >> allocation failure: order:4, mode:0xd0 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm: >> attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1 >> Aug 7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace: >> ... >> Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message >> Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message >> Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message >> ... >> Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message >> Aug 7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop >> messages from pid 23853 due to rate-limiting >> Aug 7 21:00:07 sheepdog001 corosync[23846]: [TOTEM ] Retransmit >> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b >> 758c 758d 758e 758f 7590 7591 7592 >> Aug 7 21:00:07 sheepdog001 corosync[23846]: [TOTEM ] Retransmit >> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b >> 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 >> 7597 7598 7599 759a 759b 759c > Again the same corosync messages. Cced corosync list. > > Thanks > Yuan -------------- next part -------------- A non-text attachment was scrubbed... Name: ajhobbs.vcf Type: text/x-vcard Size: 353 bytes Desc: ajhobbs.vcf URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20130808/f0899e2d/attachment.vcf> |