[sheepdog-users] Single disk getting full

Liu Yuan namei.unix at gmail.com
Thu Aug 8 03:58:10 CEST 2013


On Thu, Aug 08, 2013 at 02:23:11AM +0200, Valerio Pachera wrote:
> Auch! This time things went bad:
> cluster has stopped.
> 
> But it seem not to be related to the recover of the "small" disk I
> inserted back (/mnt/sheep/dsk01): it didn't get full.
> It seems something about the command attr I run yesterday.
> 
> sheepdog001
> /dev/mapper/vg00-sheepdsk01                             167G   70G
> 97G  43% /mnt/sheep/dsk01
> /dev/sdb1                                               466G  367G
> 100G  79% /mnt/sheep/dsk02
> /dev/sdc1                                               1,9T  1,2T
> 679G  64% /mnt/sheep/dsk03
> 
> What I find unusual is a call trace on node 'sheepdog001' (the other
> nodes have no call trace).
> I can see that from dmesg, /var/log/messages and /var/log/syslog
> [note: 1][note: 4]
> 
> On all nodes I find corsync messages in /var/log/syslog, all at the
> same time 21:00 [note: 2]
> In sheepdog001 there's some extra message right after the call trace [note: 4]
> 
> I report also what's int the sheep.log file of the nodes [note: 3]
> 
> I checked the filesystem of /mnt/sheep/dsk01 device and seems fine [note: 5]
> 
> Checking the smart of the disk I notice a value that might means problems:
>   183 Runtime_Bad_Block
> 
> I'm running a full test to check the disk status.
> 
> [note: 1]
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page
> allocation failure: order:4, mode:0xd0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm:
> attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace:
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725491]
> [<ffffffff810b88d7>] ? warn_alloc_failed+0x11a/0x12d
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725504]
> [<ffffffff810c9c7c>] ? next_zone+0x25/0x27
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725515]
> [<ffffffff810bb605>] ? __alloc_pages_nodemask+0x704/0x7aa
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725528]
> [<ffffffff810eaa2d>] ? kmem_getpages+0x4c/0xd9
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725538]
> [<ffffffff810ebdde>] ? fallback_alloc+0x13e/0x1e2
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725548]
> [<ffffffff8111312c>] ? getxattr+0x84/0x126
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725555]
> [<ffffffff810ec2e5>] ? __kmalloc+0xba/0x112
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725564]
> [<ffffffff810364e8>] ? should_resched+0x5/0x23
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725573]
> [<ffffffff8111312c>] ? getxattr+0x84/0x126
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725582]
> [<ffffffff810ead50>] ? kmem_cache_free+0x2d/0x69
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725592]
> [<ffffffff81105943>] ? user_path_at_empty+0x53/0x7b
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725603]
> [<ffffffff811138d3>] ? sys_lgetxattr+0x3f/0x5c
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725612]
> [<ffffffff8134f0f5>] ? page_fault+0x25/0x30
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725620]
> [<ffffffff81353b52>] ? system_call_fastpath+0x16/0x1b
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725626] Mem-Info:
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725630] Node 0 DMA per-cpu:
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725637] CPU    0: hi:
> 0, btch:   1 usd:   0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725643] CPU    1: hi:
> 0, btch:   1 usd:   0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725648] Node 0 DMA32 per-cpu:
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725655] CPU    0: hi:
> 186, btch:  31 usd:   0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725660] CPU    1: hi:
> 186, btch:  31 usd:   0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725665] Node 0 Normal per-cpu:
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725671] CPU    0: hi:
> 186, btch:  31 usd:   0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725676] CPU    1: hi:
> 186, btch:  31 usd:   0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725690] active_anon:12964
> inactive_anon:12371 isolated_anon:0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725693]
> active_file:169712 inactive_file:547532 isolated_file:0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725697]  unevictable:949
> dirty:1378 writeback:0 unstable:0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725700]  free:45231
> slab_reclaimable:147488 slab_unreclaimable:6686
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725703]  mapped:3632
> shmem:1940 pagetables:663 bounce:0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725710] Node 0 DMA
> free:15880kB min:264kB low:328kB high:396kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:20kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:8kB
> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725739] lowmem_reserve[]:
> 0 3383 3888 3888
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725748] Node 0 DMA32
> free:152592kB min:58576kB low:73220kB high:87864kB active_anon:45376kB
> inactive_anon:19188kB active_file:604448kB inactive_file:2116456kB
> unevictable:88kB isolated(anon):0kB isolated(file):0kB
> present:3464992kB mlocked:88kB dirty:4388kB writeback:0kB
> mapped:2364kB shmem:820kB slab_reclaimable:415320kB
> slab_unreclaimable:7688kB kernel_stack:224kB pagetables:716kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725779] lowmem_reserve[]:
> 0 0 505 505
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725788] Node 0 Normal
> free:12452kB min:8740kB low:10924kB high:13108kB active_anon:6480kB
> inactive_anon:30296kB active_file:74400kB inactive_file:73652kB
> unevictable:3708kB isolated(anon):0kB isolated(file):0kB
> present:517120kB mlocked:3708kB dirty:1124kB writeback:0kB
> mapped:12164kB shmem:6940kB slab_reclaimable:174624kB
> slab_unreclaimable:19056kB kernel_stack:1208kB pagetables:1936kB
> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725818] lowmem_reserve[]: 0 0 0 0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725826] Node 0 DMA: 2*4kB
> 2*8kB 1*16kB 3*32kB 4*64kB 3*128kB 3*256kB 2*512kB 3*1024kB 3*2048kB
> 1*4096kB = 15880kB
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725848] Node 0 DMA32:
> 25550*4kB 5749*8kB 11*16kB 4*32kB 0*64kB 0*128kB 0*256kB 0*512kB
> 0*1024kB 2*2048kB 0*4096kB = 152592kB
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725869] Node 0 Normal:
> 2051*4kB 19*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
> 0*2048kB 1*4096kB = 12452kB
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725891] 721148 total
> pagecache pages
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725895] 1311 pages in swap cache
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725901] Swap cache stats:
> add 18691, delete 17380, find 3363212/3363399
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725906] Free swap  = 7760204kB
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725910] Total swap = 7815584kB
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785649] 1015792 pages RAM
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785656] 32718 pages reserved
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785664] 636786 pages shared
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785671] 298954 pages non-shared
> Aug  7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop
> messages from pid 23853 due to rate-limiting

Looks to me not a big problem, just allocation failed.

> 
> [note: 2]
> Aug  7 21:00:07 sheepdog004 corosync[4365]:   [TOTEM ] Retransmit
> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
> 758c 758d 758e 758f 7590 7591 7592
> Aug  7 21:00:07 sheepdog004 corosync[4365]:   [TOTEM ] Retransmit
> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
> 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599
> 759a 759b 759c
> ...

Hello corosync guys, is this normal? Sheep daemon detected a network partition.

> [note: 3]
> sheepdog001
> ...
> Aug 07 21:09:53 [main] recover_object_main(626) done:208768
> count:414381, oid:c8d13200032c4f
> Aug 07 21:09:53 [main] recover_object_main(626) done:208769
> count:414381, oid:c8d13600033c77
> Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition
> is detected

It looks like a corosync problem.

> Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted).
> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7ff842f2102f]
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7ff84252d474]
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7ff8425306ef]
> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7ff842a8cd74]
> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler
> Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop
> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc)
> [0x7ff842519eac]
> Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498]
> Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb
> Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb
> Aug 07 21:09:54 [main] crash_handler(487) sheep pid 23853 exited unexpectedly.
> 
> sheepdog002
> ...
> Aug 07 08:17:17 [gway 26470] wait_forward_request(176) poll timeout 1,
> disks of some nodes or network is busy. Going to poll-wait again
> Aug 07 08:17:17 [gway 26476] wait_forward_request(176) poll timeout 1,
> disks of some nodes or network is busy. Going to poll-wait again
> Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition
> is detected
> Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted).
> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7fd1e9afd02f]
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7fd1e9109474]
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7fd1e910c6ef]
> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7fd1e9668d74]
> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler
> Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop
> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main
> Aug 07 21:09:54 [main] sd_backtrace(848)
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc)
> [0x7fd1e90f5eac]
> Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498]
> Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb
> Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb
> Aug 07 21:09:54 [main] crash_handler(487) sheep pid 20833 exited unexpectedly.
> 
> sheepdog003 sheepdog003
>   seep.log is empty.
> 
> [note:4]
> sheepdog001 syslog
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page
> allocation failure: order:4, mode:0xd0
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm:
> attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1
> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace:
> ...
> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
> ...
> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
> Aug  7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop
> messages from pid 23853 due to rate-limiting
> Aug  7 21:00:07 sheepdog001 corosync[23846]:   [TOTEM ] Retransmit
> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
> 758c 758d 758e 758f 7590 7591 7592
> Aug  7 21:00:07 sheepdog001 corosync[23846]:   [TOTEM ] Retransmit
> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
> 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596
>  7597 7598 7599 759a 759b 759c

Again the same corosync messages. Cced corosync list.

Thanks
Yuan



More information about the sheepdog-users mailing list