Auch! This time things went bad: cluster has stopped. But it seem not to be related to the recover of the "small" disk I inserted back (/mnt/sheep/dsk01): it didn't get full. It seems something about the command attr I run yesterday. sheepdog001 /dev/mapper/vg00-sheepdsk01 167G 70G 97G 43% /mnt/sheep/dsk01 /dev/sdb1 466G 367G 100G 79% /mnt/sheep/dsk02 /dev/sdc1 1,9T 1,2T 679G 64% /mnt/sheep/dsk03 What I find unusual is a call trace on node 'sheepdog001' (the other nodes have no call trace). I can see that from dmesg, /var/log/messages and /var/log/syslog [note: 1][note: 4] On all nodes I find corsync messages in /var/log/syslog, all at the same time 21:00 [note: 2] In sheepdog001 there's some extra message right after the call trace [note: 4] I report also what's int the sheep.log file of the nodes [note: 3] I checked the filesystem of /mnt/sheep/dsk01 device and seems fine [note: 5] Checking the smart of the disk I notice a value that might means problems: 183 Runtime_Bad_Block I'm running a full test to check the disk status. [note: 1] Aug 7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page allocation failure: order:4, mode:0xd0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm: attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace: Aug 7 15:34:05 sheepdog001 kernel: [3543736.725491] [<ffffffff810b88d7>] ? warn_alloc_failed+0x11a/0x12d Aug 7 15:34:05 sheepdog001 kernel: [3543736.725504] [<ffffffff810c9c7c>] ? next_zone+0x25/0x27 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725515] [<ffffffff810bb605>] ? __alloc_pages_nodemask+0x704/0x7aa Aug 7 15:34:05 sheepdog001 kernel: [3543736.725528] [<ffffffff810eaa2d>] ? kmem_getpages+0x4c/0xd9 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725538] [<ffffffff810ebdde>] ? fallback_alloc+0x13e/0x1e2 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725548] [<ffffffff8111312c>] ? getxattr+0x84/0x126 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725555] [<ffffffff810ec2e5>] ? __kmalloc+0xba/0x112 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725564] [<ffffffff810364e8>] ? should_resched+0x5/0x23 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725573] [<ffffffff8111312c>] ? getxattr+0x84/0x126 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725582] [<ffffffff810ead50>] ? kmem_cache_free+0x2d/0x69 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725592] [<ffffffff81105943>] ? user_path_at_empty+0x53/0x7b Aug 7 15:34:05 sheepdog001 kernel: [3543736.725603] [<ffffffff811138d3>] ? sys_lgetxattr+0x3f/0x5c Aug 7 15:34:05 sheepdog001 kernel: [3543736.725612] [<ffffffff8134f0f5>] ? page_fault+0x25/0x30 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725620] [<ffffffff81353b52>] ? system_call_fastpath+0x16/0x1b Aug 7 15:34:05 sheepdog001 kernel: [3543736.725626] Mem-Info: Aug 7 15:34:05 sheepdog001 kernel: [3543736.725630] Node 0 DMA per-cpu: Aug 7 15:34:05 sheepdog001 kernel: [3543736.725637] CPU 0: hi: 0, btch: 1 usd: 0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725643] CPU 1: hi: 0, btch: 1 usd: 0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725648] Node 0 DMA32 per-cpu: Aug 7 15:34:05 sheepdog001 kernel: [3543736.725655] CPU 0: hi: 186, btch: 31 usd: 0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725660] CPU 1: hi: 186, btch: 31 usd: 0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725665] Node 0 Normal per-cpu: Aug 7 15:34:05 sheepdog001 kernel: [3543736.725671] CPU 0: hi: 186, btch: 31 usd: 0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725676] CPU 1: hi: 186, btch: 31 usd: 0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725690] active_anon:12964 inactive_anon:12371 isolated_anon:0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725693] active_file:169712 inactive_file:547532 isolated_file:0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725697] unevictable:949 dirty:1378 writeback:0 unstable:0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725700] free:45231 slab_reclaimable:147488 slab_unreclaimable:6686 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725703] mapped:3632 shmem:1940 pagetables:663 bounce:0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725710] Node 0 DMA free:15880kB min:264kB low:328kB high:396kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:20kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:8kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Aug 7 15:34:05 sheepdog001 kernel: [3543736.725739] lowmem_reserve[]: 0 3383 3888 3888 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725748] Node 0 DMA32 free:152592kB min:58576kB low:73220kB high:87864kB active_anon:45376kB inactive_anon:19188kB active_file:604448kB inactive_file:2116456kB unevictable:88kB isolated(anon):0kB isolated(file):0kB present:3464992kB mlocked:88kB dirty:4388kB writeback:0kB mapped:2364kB shmem:820kB slab_reclaimable:415320kB slab_unreclaimable:7688kB kernel_stack:224kB pagetables:716kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Aug 7 15:34:05 sheepdog001 kernel: [3543736.725779] lowmem_reserve[]: 0 0 505 505 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725788] Node 0 Normal free:12452kB min:8740kB low:10924kB high:13108kB active_anon:6480kB inactive_anon:30296kB active_file:74400kB inactive_file:73652kB unevictable:3708kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:3708kB dirty:1124kB writeback:0kB mapped:12164kB shmem:6940kB slab_reclaimable:174624kB slab_unreclaimable:19056kB kernel_stack:1208kB pagetables:1936kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Aug 7 15:34:05 sheepdog001 kernel: [3543736.725818] lowmem_reserve[]: 0 0 0 0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725826] Node 0 DMA: 2*4kB 2*8kB 1*16kB 3*32kB 4*64kB 3*128kB 3*256kB 2*512kB 3*1024kB 3*2048kB 1*4096kB = 15880kB Aug 7 15:34:05 sheepdog001 kernel: [3543736.725848] Node 0 DMA32: 25550*4kB 5749*8kB 11*16kB 4*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 0*4096kB = 152592kB Aug 7 15:34:05 sheepdog001 kernel: [3543736.725869] Node 0 Normal: 2051*4kB 19*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 12452kB Aug 7 15:34:05 sheepdog001 kernel: [3543736.725891] 721148 total pagecache pages Aug 7 15:34:05 sheepdog001 kernel: [3543736.725895] 1311 pages in swap cache Aug 7 15:34:05 sheepdog001 kernel: [3543736.725901] Swap cache stats: add 18691, delete 17380, find 3363212/3363399 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725906] Free swap = 7760204kB Aug 7 15:34:05 sheepdog001 kernel: [3543736.725910] Total swap = 7815584kB Aug 7 15:34:05 sheepdog001 kernel: [3543736.785649] 1015792 pages RAM Aug 7 15:34:05 sheepdog001 kernel: [3543736.785656] 32718 pages reserved Aug 7 15:34:05 sheepdog001 kernel: [3543736.785664] 636786 pages shared Aug 7 15:34:05 sheepdog001 kernel: [3543736.785671] 298954 pages non-shared Aug 7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop messages from pid 23853 due to rate-limiting [note: 2] Aug 7 21:00:07 sheepdog004 corosync[4365]: [TOTEM ] Retransmit List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b 758c 758d 758e 758f 7590 7591 7592 Aug 7 21:00:07 sheepdog004 corosync[4365]: [TOTEM ] Retransmit List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 759a 759b 759c ... [note: 3] sheepdog001 ... Aug 07 21:09:53 [main] recover_object_main(626) done:208768 count:414381, oid:c8d13200032c4f Aug 07 21:09:53 [main] recover_object_main(626) done:208769 count:414381, oid:c8d13600033c77 Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition is detected Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted). Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7ff842f2102f] Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7ff84252d474] Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7ff8425306ef] Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg Aug 07 21:09:54 [main] sd_backtrace(848) /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7ff842a8cd74] Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc) [0x7ff842519eac] Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498] Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb Aug 07 21:09:54 [main] crash_handler(487) sheep pid 23853 exited unexpectedly. sheepdog002 ... Aug 07 08:17:17 [gway 26470] wait_forward_request(176) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again Aug 07 08:17:17 [gway 26476] wait_forward_request(176) poll timeout 1, disks of some nodes or network is busy. Going to poll-wait again Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition is detected Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted). Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7fd1e9afd02f] Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7fd1e9109474] Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7fd1e910c6ef] Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg Aug 07 21:09:54 [main] sd_backtrace(848) /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7fd1e9668d74] Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main Aug 07 21:09:54 [main] sd_backtrace(848) /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc) [0x7fd1e90f5eac] Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498] Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb Aug 07 21:09:54 [main] crash_handler(487) sheep pid 20833 exited unexpectedly. sheepdog003 sheepdog003 seep.log is empty. [note:4] sheepdog001 syslog Aug 7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page allocation failure: order:4, mode:0xd0 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm: attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1 Aug 7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace: ... Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message ... Aug 7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message Aug 7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop messages from pid 23853 due to rate-limiting Aug 7 21:00:07 sheepdog001 corosync[23846]: [TOTEM ] Retransmit List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b 758c 758d 758e 758f 7590 7591 7592 Aug 7 21:00:07 sheepdog001 corosync[23846]: [TOTEM ] Retransmit List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 759a 759b 759c [note: 5] root at sheepdog001:~# xfs_repair -n /dev/mapper/vg00-sheepdsk01 Phase 1 - find and verify superblock... Phase 2 - using internal log - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. |