[sheepdog-users] About node cable disconnection

Valerio Pachera sirio81 at gmail.com
Thu Dec 19 11:26:16 CET 2013


Hi, this is the test I run

cluster 4 nodes 2 nic each.
A guest running on node id 0 (that is also zookeeper leader).
eth0 is bridged, and user by zookeeper.
eth1 is I/O nic.

I removed the cable from eth0.
The guest continues to work as nothing happened.
I write a small file (inside the guest) of 10M and I get no errors.
I run 'sync' and it hangs.
I can stop it by ctrl+c and no errors are shown.

After 120 seconds it pops up the usual message
  'Task XXX blocked for more than 120 seconds'

I reconnect the cable and, as soon the node rejoin the cluster, I get lots
of
  'Buffer I/O error on device vda1, logical block 511XX'
in the console.

0) is it correct to receive the Buffer I/O error when the node is back?

In /var/log/messages I get this:

Dec 19 10:27:52 wheezytemplate kernel: [  360.741188] jbd2/vda1-8     D
ffff88001fd13780     0   150      2 0x00000000
Dec 19 10:27:52 wheezytemplate kernel: [  360.741202]  ffff88001c24d6d0
0000000000000046 ffff88001e392898 ffff88001fa06f60
Dec 19 10:27:52 wheezytemplate kernel: [  360.741215]  0000000000013780
ffff88001be01fd8 ffff88001be01fd8 ffff88001c24d6d0
Dec 19 10:27:52 wheezytemplate kernel: [  360.741225]  ffff88001e392898
ffffffff81065dbd ffff88001f914980 ffff88001fd13fd0
Dec 19 10:27:52 wheezytemplate kernel: [  360.741235] Call Trace:
Dec 19 10:27:52 wheezytemplate kernel: [  360.741254]  [<ffffffff81065dbd>]
? timekeeping_get_ns+0xd/0x2a
Dec 19 10:27:52 wheezytemplate kernel: [  360.741266]  [<ffffffff8111c835>]
? wait_on_buffer+0x28/0x28
Dec 19 10:27:52 wheezytemplate kernel: [  360.741276]  [<ffffffff8134bf7c>]
? io_schedule+0x59/0x71
Dec 19 10:27:52 wheezytemplate kernel: [  360.741285]  [<ffffffff8111c83b>]
? sleep_on_buffer+0x6/0xa
Dec 19 10:27:52 wheezytemplate kernel: [  360.741293]  [<ffffffff8134c3bf>]
? __wait_on_bit+0x3e/0x71
Dec 19 10:27:52 wheezytemplate kernel: [  360.741301]  [<ffffffff8134c461>]
? out_of_line_wait_on_bit+0x6f/0x78
Dec 19 10:27:52 wheezytemplate kernel: [  360.741309]  [<ffffffff8111c835>]
? wait_on_buffer+0x28/0x28
Dec 19 10:27:52 wheezytemplate kernel: [  360.741320]  [<ffffffff8105f87d>]
? autoremove_wake_function+0x2a/0x2a
Dec 19 10:27:52 wheezytemplate kernel: [  360.741371]  [<ffffffffa010b9c0>]
? jbd2_journal_commit_transaction+0xb9b/0x1057 [jbd2]
Dec 19 10:27:52 wheezytemplate kernel: [  360.741389]  [<ffffffff8100d02f>]
? load_TLS+0x7/0xa
Dec 19 10:27:52 wheezytemplate kernel: [  360.741403]  [<ffffffff8100d69e>]
? __switch_to+0x133/0x258
Dec 19 10:27:52 wheezytemplate kernel: [  360.741418]  [<ffffffff8134d03c>]
? _raw_spin_lock_irqsave+0x9/0x25
Dec 19 10:27:52 wheezytemplate kernel: [  360.741433]  [<ffffffff81052212>]
? lock_timer_base.isra.29+0x23/0x47
Dec 19 10:27:52 wheezytemplate kernel: [  360.741456]  [<ffffffffa010f0ae>]
? kjournald2+0xc0/0x20a [jbd2]
Dec 19 10:27:52 wheezytemplate kernel: [  360.741471]  [<ffffffff8105f853>]
? add_wait_queue+0x3c/0x3c
Dec 19 10:27:52 wheezytemplate kernel: [  360.741490]  [<ffffffffa010efee>]
? commit_timeout+0x5/0x5 [jbd2]
Dec 19 10:27:52 wheezytemplate kernel: [  360.741506]  [<ffffffff8105f201>]
? kthread+0x76/0x7e
Dec 19 10:27:52 wheezytemplate kernel: [  360.741521]  [<ffffffff81354174>]
? kernel_thread_helper+0x4/0x10
Dec 19 10:27:52 wheezytemplate kernel: [  360.741537]  [<ffffffff8105f18b>]
? kthread_worker_fn+0x139/0x139
Dec 19 10:27:52 wheezytemplate kernel: [  360.741550]  [<ffffffff81354170>]
? gs_change+0x13/0x13
Dec 19 10:29:00 wheezytemplate kernel: vda1, logical block 49398
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
524288 size 524288 starting block 49664)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
1048576 size 524288 starting block 49792)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
1572864 size 524288 starting block 49920)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
2097152 size 524288 starting block 50048)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
2621440 size 524288 starting block 50176)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
3145728 size 524288 starting block 50304)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
3670016 size 524288 starting block 50432)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
4194304 size 524288 starting block 50560)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
4718592 size 524288 starting block 50688)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
5242880 size 524288 starting block 50816)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
5767168 size 524288 starting block 50944)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
6291456 size 524288 starting block 51072)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
6815744 size 524288 starting block 51200)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
7340032 size 524288 starting block 51328)
Dec 19 10:29:00 wheezytemplate kernel: [  426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
7864320 size 524288 starting block 51456)
Dec 19 10:29:00 wheezytemplate kernel: [  428.812824] JBD2: Detected IO
errors while flushing file data on vda1-8

1) I've been writing two more file and rebooted the guest and, after ti,
they were there.
     That's the most important thing :-)


Now I wish to focus on sheep.log

First look at cluster info

Epoch Time
Version

2013-12-19 10:28:57     14 [192.168.10.4:7000, 192.168.10.5:7000,
192.168.10.6:7000, 192.168.10.7:7000]

2013-12-19 10:23:44     13 [192.168.10.5:7000, 192.168.10.6:7000,
192.168.10.7:7000]

2013-12-19 08:39:58     12 [192.168.10.4:7000, 192.168.10.5:7000,
192.168.10.6:7000, 192.168.10.7:7000]

This is sheep.log of the disconnected node
(...) means the last row repeats many times.

Dec 19 10:24:17  ERROR [gway 8654] wait_forward_request(461) fail
5ddf8800000000, Request has an old epoch
Dec 19 10:24:17  ERROR [gway 8654] wait_forward_request(461) fail
5ddf8800000000, Request has an old epoch
Dec 19 10:24:17  ERROR [gway 8655] wait_forward_request(461) fail
5ddf8800000400, Request has an old epoch
Dec 19 10:24:17  ERROR [gway 8653] wait_forward_request(461) fail
5ddf8800000600, Request has an old epoch
Dec 19 10:24:17  ERROR [gway 8685] wait_forward_request(461) fail
5ddf8800000420, Request has an old epoch
Dec 19 10:24:17  ERROR [gway 8685] wait_forward_request(461) fail
5ddf8800000420, Request has an old epoch
Dec 19 10:24:41  ERROR [gway 8683] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 10:24:41  ERROR [gway 8683] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 10:24:41  ERROR [gway 8654] sheep_exec_req(1008) failed Request has
an old epoc
...
Dec 19 10:28:57  ERROR [main] zk_event_handler(1194) detect a session
timeout. reconnecting...
Dec 19 10:28:57   INFO [main] send_join_request(777) IPv4 ip:192.168.10.4
port:7000
Dec 19 10:28:57   INFO [main] zk_find_master(803) detect master leave,
start to compete master
Dec 19 10:28:57   INFO [main] zk_event_handler(1196) reconnected
Dec 19 10:28:57   INFO [main] recover_object_main(841) object
5ddf8800000400 is recovered (1/216)
Dec 19 10:28:57   INFO [main] recover_object_main(841) object
5ddf8800000600 is recovered (2/216)
Dec 19 10:28:58   INFO [main] recover_object_main(841) object
5ddf88000000b1 is recovered (3/216)
...

2) There are tons of the same message:

grep -c 'sheep_exec_req(1008) failed Request has an old epoch'
/var/sheep/sheep.log

1141544


This is sheep.log of another node:

Dec 19 10:24:17  ERROR [main] check_request_epoch(151) old node version 13,
12 (WRITE_PEER)
Dec 19 10:24:41  ERROR [main] check_request_epoch(151) old node version 13,
12 (READ_PEER)
Dec 19 10:24:41  ERROR [main] check_request_epoch(151) old node version 13,
12 (READ_PEER)
...
Dec 19 10:28:57   INFO [main] recover_object_main(841) object
5ddf8800000052 is recovered (1/197)
Dec 19 10:28:57   INFO [main] recover_object_main(841) object
5ddf8800000094 is recovered (2/197)
...

sheep.log size is 96M


3) also this message is repeated many many times

grep -c 'check_request_epoch(151) old node version 13, 12 (READ_PEER)'
/var/sheep/sheep.log
223375

sheep.log size is 50M



4) I fear that, if the cable stay disconnecter for long time, sheep.log
will grow to much.


Make your considerations and tell me if there's something that may be
consider as a bug or wish list and I'll file it in launch pad.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131219/644ec30c/attachment-0004.html>


More information about the sheepdog-users mailing list