[sheepdog-users] Single disk getting full

Andrew J. Hobbs ajhobbs at desu.edu
Thu Aug 8 14:44:50 CEST 2013


When I first started with sheepdog, I had partitioning issues with 
corosync.  Zookeeper has been much more reliable for me.

We have one of our nodes in a separate building, and multicast wasn't 
reliably transiting the switches in between nodes.


On 08/07/2013 09:58 PM, Liu Yuan wrote:
> On Thu, Aug 08, 2013 at 02:23:11AM +0200, Valerio Pachera wrote:
>> Auch! This time things went bad:
>> cluster has stopped.
>>
>> But it seem not to be related to the recover of the "small" disk I
>> inserted back (/mnt/sheep/dsk01): it didn't get full.
>> It seems something about the command attr I run yesterday.
>>
>> sheepdog001
>> /dev/mapper/vg00-sheepdsk01                             167G   70G
>> 97G  43% /mnt/sheep/dsk01
>> /dev/sdb1                                               466G  367G
>> 100G  79% /mnt/sheep/dsk02
>> /dev/sdc1                                               1,9T  1,2T
>> 679G  64% /mnt/sheep/dsk03
>>
>> What I find unusual is a call trace on node 'sheepdog001' (the other
>> nodes have no call trace).
>> I can see that from dmesg, /var/log/messages and /var/log/syslog
>> [note: 1][note: 4]
>>
>> On all nodes I find corsync messages in /var/log/syslog, all at the
>> same time 21:00 [note: 2]
>> In sheepdog001 there's some extra message right after the call trace [note: 4]
>>
>> I report also what's int the sheep.log file of the nodes [note: 3]
>>
>> I checked the filesystem of /mnt/sheep/dsk01 device and seems fine [note: 5]
>>
>> Checking the smart of the disk I notice a value that might means problems:
>>    183 Runtime_Bad_Block
>>
>> I'm running a full test to check the disk status.
>>
>> [note: 1]
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page
>> allocation failure: order:4, mode:0xd0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm:
>> attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace:
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725491]
>> [<ffffffff810b88d7>] ? warn_alloc_failed+0x11a/0x12d
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725504]
>> [<ffffffff810c9c7c>] ? next_zone+0x25/0x27
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725515]
>> [<ffffffff810bb605>] ? __alloc_pages_nodemask+0x704/0x7aa
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725528]
>> [<ffffffff810eaa2d>] ? kmem_getpages+0x4c/0xd9
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725538]
>> [<ffffffff810ebdde>] ? fallback_alloc+0x13e/0x1e2
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725548]
>> [<ffffffff8111312c>] ? getxattr+0x84/0x126
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725555]
>> [<ffffffff810ec2e5>] ? __kmalloc+0xba/0x112
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725564]
>> [<ffffffff810364e8>] ? should_resched+0x5/0x23
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725573]
>> [<ffffffff8111312c>] ? getxattr+0x84/0x126
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725582]
>> [<ffffffff810ead50>] ? kmem_cache_free+0x2d/0x69
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725592]
>> [<ffffffff81105943>] ? user_path_at_empty+0x53/0x7b
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725603]
>> [<ffffffff811138d3>] ? sys_lgetxattr+0x3f/0x5c
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725612]
>> [<ffffffff8134f0f5>] ? page_fault+0x25/0x30
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725620]
>> [<ffffffff81353b52>] ? system_call_fastpath+0x16/0x1b
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725626] Mem-Info:
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725630] Node 0 DMA per-cpu:
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725637] CPU    0: hi:
>> 0, btch:   1 usd:   0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725643] CPU    1: hi:
>> 0, btch:   1 usd:   0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725648] Node 0 DMA32 per-cpu:
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725655] CPU    0: hi:
>> 186, btch:  31 usd:   0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725660] CPU    1: hi:
>> 186, btch:  31 usd:   0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725665] Node 0 Normal per-cpu:
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725671] CPU    0: hi:
>> 186, btch:  31 usd:   0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725676] CPU    1: hi:
>> 186, btch:  31 usd:   0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725690] active_anon:12964
>> inactive_anon:12371 isolated_anon:0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725693]
>> active_file:169712 inactive_file:547532 isolated_file:0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725697]  unevictable:949
>> dirty:1378 writeback:0 unstable:0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725700]  free:45231
>> slab_reclaimable:147488 slab_unreclaimable:6686
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725703]  mapped:3632
>> shmem:1940 pagetables:663 bounce:0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725710] Node 0 DMA
>> free:15880kB min:264kB low:328kB high:396kB active_anon:0kB
>> inactive_anon:0kB active_file:0kB inactive_file:20kB unevictable:0kB
>> isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB
>> dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:8kB
>> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
>> bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725739] lowmem_reserve[]:
>> 0 3383 3888 3888
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725748] Node 0 DMA32
>> free:152592kB min:58576kB low:73220kB high:87864kB active_anon:45376kB
>> inactive_anon:19188kB active_file:604448kB inactive_file:2116456kB
>> unevictable:88kB isolated(anon):0kB isolated(file):0kB
>> present:3464992kB mlocked:88kB dirty:4388kB writeback:0kB
>> mapped:2364kB shmem:820kB slab_reclaimable:415320kB
>> slab_unreclaimable:7688kB kernel_stack:224kB pagetables:716kB
>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>> all_unreclaimable? no
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725779] lowmem_reserve[]:
>> 0 0 505 505
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725788] Node 0 Normal
>> free:12452kB min:8740kB low:10924kB high:13108kB active_anon:6480kB
>> inactive_anon:30296kB active_file:74400kB inactive_file:73652kB
>> unevictable:3708kB isolated(anon):0kB isolated(file):0kB
>> present:517120kB mlocked:3708kB dirty:1124kB writeback:0kB
>> mapped:12164kB shmem:6940kB slab_reclaimable:174624kB
>> slab_unreclaimable:19056kB kernel_stack:1208kB pagetables:1936kB
>> unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
>> all_unreclaimable? no
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725818] lowmem_reserve[]: 0 0 0 0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725826] Node 0 DMA: 2*4kB
>> 2*8kB 1*16kB 3*32kB 4*64kB 3*128kB 3*256kB 2*512kB 3*1024kB 3*2048kB
>> 1*4096kB = 15880kB
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725848] Node 0 DMA32:
>> 25550*4kB 5749*8kB 11*16kB 4*32kB 0*64kB 0*128kB 0*256kB 0*512kB
>> 0*1024kB 2*2048kB 0*4096kB = 152592kB
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725869] Node 0 Normal:
>> 2051*4kB 19*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
>> 0*2048kB 1*4096kB = 12452kB
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725891] 721148 total
>> pagecache pages
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725895] 1311 pages in swap cache
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725901] Swap cache stats:
>> add 18691, delete 17380, find 3363212/3363399
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725906] Free swap  = 7760204kB
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725910] Total swap = 7815584kB
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785649] 1015792 pages RAM
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785656] 32718 pages reserved
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785664] 636786 pages shared
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.785671] 298954 pages non-shared
>> Aug  7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop
>> messages from pid 23853 due to rate-limiting
> Looks to me not a big problem, just allocation failed.
>
>> [note: 2]
>> Aug  7 21:00:07 sheepdog004 corosync[4365]:   [TOTEM ] Retransmit
>> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
>> 758c 758d 758e 758f 7590 7591 7592
>> Aug  7 21:00:07 sheepdog004 corosync[4365]:   [TOTEM ] Retransmit
>> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
>> 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599
>> 759a 759b 759c
>> ...
> Hello corosync guys, is this normal? Sheep daemon detected a network partition.
>
>> [note: 3]
>> sheepdog001
>> ...
>> Aug 07 21:09:53 [main] recover_object_main(626) done:208768
>> count:414381, oid:c8d13200032c4f
>> Aug 07 21:09:53 [main] recover_object_main(626) done:208769
>> count:414381, oid:c8d13600033c77
>> Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition
>> is detected
> It looks like a corosync problem.
>
>> Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted).
>> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7ff842f2102f]
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7ff84252d474]
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7ff8425306ef]
>> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7ff842a8cd74]
>> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler
>> Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop
>> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc)
>> [0x7ff842519eac]
>> Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498]
>> Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb
>> Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb
>> Aug 07 21:09:54 [main] crash_handler(487) sheep pid 23853 exited unexpectedly.
>>
>> sheepdog002
>> ...
>> Aug 07 08:17:17 [gway 26470] wait_forward_request(176) poll timeout 1,
>> disks of some nodes or network is busy. Going to poll-wait again
>> Aug 07 08:17:17 [gway 26476] wait_forward_request(176) poll timeout 1,
>> disks of some nodes or network is busy. Going to poll-wait again
>> Aug 07 21:09:54 [main] cdrv_cpg_confchg(602) PANIC: Network partition
>> is detected
>> Aug 07 21:09:54 [main] crash_handler(181) sheep exits unexpectedly (Aborted).
>> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:183: crash_handler
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libpthread.so.0(+0xf02f) [0x7fd1e9afd02f]
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x34) [0x7fd1e9109474]
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libc.so.6(abort+0x17f) [0x7fd1e910c6ef]
>> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:602: cdrv_cpg_confchg
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /usr/lib/libcpg.so.4(cpg_dispatch+0x594) [0x7fd1e9668d74]
>> Aug 07 21:09:54 [main] sd_backtrace(834) corosync.c:744: corosync_handler
>> Aug 07 21:09:54 [main] sd_backtrace(834) event.c:209: do_event_loop
>> Aug 07 21:09:54 [main] sd_backtrace(834) sheep.c:795: main
>> Aug 07 21:09:54 [main] sd_backtrace(848)
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfc)
>> [0x7fd1e90f5eac]
>> Aug 07 21:09:54 [main] sd_backtrace(848) sheep() [0x405498]
>> Aug 07 21:09:54 [main] __dump_stack_frames(744) cannot find gdb
>> Aug 07 21:09:54 [main] __sd_dump_variable(694) cannot find gdb
>> Aug 07 21:09:54 [main] crash_handler(487) sheep pid 20833 exited unexpectedly.
>>
>> sheepdog003 sheepdog003
>>    seep.log is empty.
>>
>> [note:4]
>> sheepdog001 syslog
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725454] attr: page
>> allocation failure: order:4, mode:0xd0
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725468] Pid: 14171, comm:
>> attr Not tainted 3.2.0-4-amd64 #1 Debian 3.2.46-1
>> Aug  7 15:34:05 sheepdog001 kernel: [3543736.725474] Call Trace:
>> ...
>> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
>> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
>> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
>> ...
>> Aug  7 16:28:04 sheepdog001 sheep: enqueue: log area overrun, dropping message
>> Aug  7 16:28:04 sheepdog001 rsyslogd-2177: imuxsock begins to drop
>> messages from pid 23853 due to rate-limiting
>> Aug  7 21:00:07 sheepdog001 corosync[23846]:   [TOTEM ] Retransmit
>> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
>> 758c 758d 758e 758f 7590 7591 7592
>> Aug  7 21:00:07 sheepdog001 corosync[23846]:   [TOTEM ] Retransmit
>> List: 757f 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 758a 758b
>> 758c 758d 758e 758f 7590 7591 7592 7593 7594 7595 7596
>>   7597 7598 7599 759a 759b 759c
> Again the same corosync messages. Cced corosync list.
>
> Thanks
> Yuan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20130808/f0899e2d/attachment-0005.vcf>


More information about the sheepdog-users mailing list