[sheepdog] [sbd] I/O stuck shortly after starting writes

Mon Jul 21 07:43:44 CEST 2014

On Sun, Jul 20, 2014 at 01:08:12AM -0400, Miles Fidelman wrote:
> A little late to the party here, but just saw this... question at end..
> 
> 
> /On Mon Jun 2 05:52:30 CEST 2014/
> *Liu Yuan* namei.unix at gmail.com <mailto:sheepdog%40lists.wpkg.org?Subject=Re%3A%20%5Bsheepdog%5D%20%5Bsbd%5D%20I/O%20stuck%20shortly%20after%20starting%20writes&In-Reply-To=%3C20140602035230.GB31935%40ubuntu-precise%3E>wrote
> 
> >On Sun, Jun 01, 2014 at 09:56:00PM +0200, Marcin Mirosław wrote:
> >>/  Hi!
> >/>/  I'm launching three sheeps locally, creating vdi with EC 2:1. Next I'm
> >/>/  starting sbd0 block device. mkfs.xfs /dev/sbd0 && mount ...
> >/>/  Next step is starting simple dd command: dd if=/dev/zero
> >/>/  of=/mnt/test/zero bs=4M count=2000
> >/>/  After short moment I've got man sheep stuck in D state:
> >/>/  sheepdog  4126  1.2  9.8 2199564 151052 ?      Sl   21:41   0:06
> >/>/  /usr/sbin/sheep -n --port 7000 -z 0 /mnt/sdb1 --pidfile
> >/>/  /run/sheepdog/sheepdog.sdb1
> >/>/  sheepdog  4127  0.0  0.0  34468   396 ?        Ds   21:41   0:00
> >/>/  /usr/sbin/sheep -n --port 7000 -z 0 /mnt/sdb1 --pidfile
> >/>/  /run/sheepdog/sheepdog.sdb1
> >/>/  sheepdog  4179  0.2  6.7 1855792 103780 ?      Sl   21:41   0:01
> >/>/  /usr/sbin/sheep -n --port 7001 -z 1 /mnt/sdc1 --pidfile
> >/>/  /run/sheepdog/sheepdog.sdc1
> >/>/  sheepdog  4180  0.0  0.0  34468   396 ?        Ss   21:41   0:00
> >/>/  /usr/sbin/sheep -n --port 7001 -z 1 /mnt/sdc1 --pidfile
> >/>/  /run/sheepdog/sheepdog.sdc1
> >/>/  sheepdog  4231  0.3  7.3 1863228 111700 ?      Sl   21:41   0:01
> >/>/  /usr/sbin/sheep -n --port 7002 -z 2 /mnt/sdd1 --pidfile
> >/>/  /run/sheepdog/sheepdog.sdd1
> >/>/  sheepdog  4232  0.0  0.0  34468   400 ?        Ss   21:41   0:00
> >/>/  /usr/sbin/sheep -n --port 7002 -z 2 /mnt/sdd1 --pidfile
> >/>/  /run/sheepdog/sheepdog.sdd1
> >/>/  />/  Also dd stucks:
> >/>/  root      4326  0.2  0.3  14764  4664 pts/1    D+   21:44   0:01 dd
> >/>/  if=/dev/zero of=/mnt/test/zero bs=4M count=2000
> >/>/  />/  There is in dmesg:
> >/>/  />/  [ 6386.240000] INFO: rcu_sched self-detected stall on
> >CPU { 0}  (t=6001
> >/>/  jiffies g=139833 c=139832 q=86946)
> >/>/  [ 6386.240000] sending NMI to all CPUs:
> >/>/  [ 6386.240000] NMI backtrace for cpu 0
> >/>/  [ 6386.240000] CPU: 0 PID: 4286 Comm: sbd_submiter Tainted: P
> >/>/  O 3.12.20-gentoo #1
> >/>/  [ 6386.240000] Hardware name: Gigabyte Technology Co., Ltd.
> >/>/  965P-S3/965P-S3, BIOS F14A 07/31/2008
> >/>/  [ 6386.240000] task: ffff88001cff3960 ti: ffff88002f78a000 task.ti:
> >/>/  ffff88002f78a000
> >/>/  [ 6386.240000] RIP: 0010:[<ffffffff811d4542>]  [<ffffffff811d4542>]
> >/>/  __const_udelay+0x12/0x30
> >/>/  [ 6386.240000] RSP: 0000:ffff88005f403dc8  EFLAGS: 00000006
> >/>/  [ 6386.240000] RAX: 0000000001062560 RBX: 0000000000002710 RCX:
> >/>/  0000000000000006
> >/>/  [ 6386.240000] RDX: 0000000001140694 RSI: 0000000000000002 RDI:
> >/>/  0000000000418958
> >/>/  [ 6386.240000] RBP: ffff88005f403de8 R08: 000000000000000a R09:
> >/>/  00000000000002bc
> >/>/  [ 6386.240000] R10: 0000000000000000 R11: 00000000000002bb R12:
> >/>/  ffffffff8149eec0
> >/>/  [ 6386.240000] R13: ffffffff8149eec0 R14: ffff88005f40d700 R15:
> >/>/  00000000000153a2
> >/>/  [ 6386.240000] FS:  0000000000000000(0000) GS:ffff88005f400000(0000)
> >/>/  knlGS:0000000000000000
> >/>/  [ 6386.240000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >/>/  [ 6386.240000] CR2: 00007f3bf6907000 CR3: 0000000001488000 CR4:
> >/>/  00000000000007f0
> >/>/  [ 6386.240000] Stack:
> >/>/  [ 6386.240000]  ffff88005f403de8 ffffffff8102d00a 0000000000000000
> >/>/  ffffffff814c43b8
> >/>/  [ 6386.240000]  ffff88005f403e58 ffffffff810967ac ffff88001bcb1800
> >/>/  0000000000000001
> >/>/  [ 6386.240000]  ffff88005f403e18 ffffffff81098407 ffff88002f78a000
> >/>/  0000000000000000
> >/>/  [ 6386.240000] Call Trace:
> >/>/  [ 6386.240000]  <IRQ>
> >/>/  />/  [ 6386.240000]  [<ffffffff8102d00a>] ?
> >/>/  arch_trigger_all_cpu_backtrace+0x5a/0x80
> >/>/  [ 6386.240000]  [<ffffffff810967ac>] rcu_check_callbacks+0x2fc/0x570
> >/>/  [ 6386.240000]  [<ffffffff81098407>] ? acct_account_cputime+0x17/0x20
> >/>/  [ 6386.240000]  [<ffffffff810494d3>] update_process_times+0x43/0x80
> >/>/  [ 6386.240000]  [<ffffffff81082621>] tick_sched_handle.isra.12+0x31/0x40
> >/>/  [ 6386.240000]  [<ffffffff81082764>] tick_sched_timer+0x44/0x70
> >/>/  [ 6386.240000]  [<ffffffff8105dc4a>] __run_hrtimer.isra.29+0x4a/0xd0
> >/>/  [ 6386.240000]  [<ffffffff8105e415>] hrtimer_interrupt+0xf5/0x230
> >/>/  [ 6386.240000]  [<ffffffff8102b7f6>] local_apic_timer_interrupt+0x36/0x60
> >/>/  [ 6386.240000]  [<ffffffff8102bc0e>] smp_apic_timer_interrupt+0x3e/0x60
> >/>/  [ 6386.240000]  [<ffffffff8136aaca>] apic_timer_interrupt+0x6a/0x70
> >/>/  [ 6386.240000]  <EOI>
> >/>/  />/  [ 6386.240000]  [<ffffffff811d577d>] ?
> >__write_lock_failed+0xd/0x20
> >/>/  [ 6386.240000]  [<ffffffff813690f2>] _raw_write_lock+0x12/0x20
> >/>/  [ 6386.240000]  [<ffffffffa030579b>] sheep_aiocb_submit+0x2db/0x360 [sbd]
> >/>/  [ 6386.240000]  [<ffffffffa030544e>] ? sheep_aiocb_setup+0x13e/0x1b0 [sbd]
> >/>/  [ 6386.240000]  [<ffffffffa0304740>] 0xffffffffa030473f
> >/>/  [ 6386.240000]  [<ffffffff8105b510>] ? finish_wait+0x80/0x80
> >/>/  [ 6386.240000]  [<ffffffffa03046c0>] ? 0xffffffffa03046bf
> >/>/  [ 6386.240000]  [<ffffffff8105af4b>] kthread+0xbb/0xc0
> >/>/  [ 6386.240000]  [<ffffffff8105ae90>] ? kthread_create_on_node+0x120/0x120
> >/>/  [ 6386.240000]  [<ffffffff81369d7c>] ret_from_fork+0x7c/0xb0
> >/>/  [ 6386.240000]  [<ffffffff8105ae90>] ? kthread_create_on_node+0x120/0x120
> >/>/  [ 6386.240000] Code: c8 5d c3 66 0f 1f 44 00 00 55 48 89 e5 ff 15 ae 2d
> >/>/  2d 00 5d c3 0f 1f 40 00 55 48 8d 04 bd 00 00 00 00 65 48 8b 14 25 20 0d
> >/>/  01 00 <48> 8d 14 92 48 89 e5 48 8d 14 92 f7 e2 48 8d 7a 01 ff 15 7f 2d
> >/>/  [ 6386.240208] NMI backtrace for cpu 1
> >/>/  [ 6386.240213] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O
> >/>/  3.12.20-gentoo #1
> >/>/  [ 6386.240215] Hardware name: Gigabyte Technology Co., Ltd.
> >/>/  965P-S3/965P-S3, BIOS F14A 07/31/2008
> >/>/  [ 6386.240218] task: ffff88005d07b960 ti: ffff88005d09a000 task.ti:
> >/>/  ffff88005d09a000
> >/>/  [ 6386.240220] RIP: 0010:[<ffffffff8100b536>]  [<ffffffff8100b536>]
> >/>/  default_idle+0x6/0x10
> >/>/  [ 6386.240227] RSP: 0018:ffff88005d09bea8  EFLAGS: 00000286
> >/>/  [ 6386.240229] RAX: 00000000ffffffed RBX: ffff88005d09bfd8 RCX:
> >/>/  0100000000000000
> >/>/  [ 6386.240231] RDX: 0100000000000000 RSI: 0000000000000000 RDI:
> >/>/  0000000000000001
> >/>/  [ 6386.240234] RBP: ffff88005d09bea8 R08: 0000000000000000 R09:
> >/>/  0000000000000000
> >/>/  [ 6386.240236] R10: ffff88005f480000 R11: 0000000000000e1e R12:
> >/>/  ffffffff814c43b0
> >/>/  [ 6386.240238] R13: ffff88005d09bfd8 R14: ffff88005d09bfd8 R15:
> >/>/  ffff88005d09bfd8
> >/>/  [ 6386.240241] FS:  0000000000000000(0000) GS:ffff88005f480000(0000)
> >/>/  knlGS:0000000000000000
> >/>/  [ 6386.240243] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >/>/  [ 6386.240246] CR2: 00007f31fe05d000 CR3: 000000002a0f5000 CR4:
> >/>/  00000000000007e0
> >/>/  [ 6386.240247] Stack:
> >/>/  [ 6386.240249]  ffff88005d09beb8 ffffffff8100bc56 ffff88005d09bf18
> >/>/  ffffffff81073c7a
> >/>/  [ 6386.240253]  0000000000000000 ffff88005d09bfd8 ffff88005d09bef8
> >/>/  15e35c4b103d3891
> >/>/  [ 6386.240256]  0000000000000001 0000000000000001 0000000000000001
> >/>/  0000000000000000
> >/>/  [ 6386.240260] Call Trace:
> >/>/  [ 6386.240264]  [<ffffffff8100bc56>] arch_cpu_idle+0x16/0x20
> >/>/  [ 6386.240269]  [<ffffffff81073c7a>] cpu_startup_entry+0xda/0x1c0
> >/>/  [ 6386.240273]  [<ffffffff8102a1f1>] start_secondary+0x1e1/0x240
> >/>/  [ 6386.240275] Code: 21 ff ff ff 90 48 b8 00 00 00 00 01 00 00 00 48 89
> >/>/  07 e9 0e ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 55 48 89 e5
> >/>/  fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 fe 48 c7 c7 40 e7 58 81
> >/
> >Actually, this is a notorious memory dead-lock and can't be solved if you run
> >sbd and sheep on the same nodes. It is a problem similar to NFS that you have
> >NFS server and client in the same machine.
> >
> >Deadlock:
> >Suppose SBD has dirty pages to writeout to backend of sheep. But sheep itself
> >need some clean pages to hold this dirty data before do actually disk IOs and
> >wait kernel to clean some dirty pages, unfortunately those dirty pages need
> >sheep's help to writeout.
> >
> >Solution:
> >never run client SBD and sheep daemon on the same node.
> >
> >Thanks
> >Yuan
> 
> Here I was, all excited to start experimenting with SBDs, to see if
> they could work as block devices for Xen, and I come across this
> caveat: "never run client SBD and sheep daemon on the same node."
> 
> This would seem to negate the whole purpose of exposing SBDs - i.e.,
> making Sheepdog usable with things other than KVM/Qemu.  Unless I'm
> radically mistaken, it's standard to mount a Sheepdog volume on the
> same node as a daemon - that's the basic architecture.  Seems like
> something is very broken if one can't use SBD's the same way.
> 
> Does this caveat also apply to mounting iSCSI or NBD volumes?
> 
> Miles Fidelman

Yes, there is identical deadlock problem for iscsi, nbd, sheepdog, nfs. It is
just a blind solution, actually you can run SBD and storage node at the same
node, with careful tunning of dirty pages, probably you can minimize the
deadlock problem if you even can't get rid of it. see /proc/sys/vm.

Thanks
Yuan