[sheepdog-users] About node cable disconnection
Valerio Pachera
sirio81 at gmail.com
Thu Dec 19 11:26:16 CET 2013
Hi, this is the test I run
cluster 4 nodes 2 nic each.
A guest running on node id 0 (that is also zookeeper leader).
eth0 is bridged, and user by zookeeper.
eth1 is I/O nic.
I removed the cable from eth0.
The guest continues to work as nothing happened.
I write a small file (inside the guest) of 10M and I get no errors.
I run 'sync' and it hangs.
I can stop it by ctrl+c and no errors are shown.
After 120 seconds it pops up the usual message
'Task XXX blocked for more than 120 seconds'
I reconnect the cable and, as soon the node rejoin the cluster, I get lots
of
'Buffer I/O error on device vda1, logical block 511XX'
in the console.
0) is it correct to receive the Buffer I/O error when the node is back?
In /var/log/messages I get this:
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741188] jbd2/vda1-8 D
ffff88001fd13780 0 150 2 0x00000000
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741202] ffff88001c24d6d0
0000000000000046 ffff88001e392898 ffff88001fa06f60
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741215] 0000000000013780
ffff88001be01fd8 ffff88001be01fd8 ffff88001c24d6d0
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741225] ffff88001e392898
ffffffff81065dbd ffff88001f914980 ffff88001fd13fd0
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741235] Call Trace:
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741254] [<ffffffff81065dbd>]
? timekeeping_get_ns+0xd/0x2a
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741266] [<ffffffff8111c835>]
? wait_on_buffer+0x28/0x28
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741276] [<ffffffff8134bf7c>]
? io_schedule+0x59/0x71
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741285] [<ffffffff8111c83b>]
? sleep_on_buffer+0x6/0xa
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741293] [<ffffffff8134c3bf>]
? __wait_on_bit+0x3e/0x71
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741301] [<ffffffff8134c461>]
? out_of_line_wait_on_bit+0x6f/0x78
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741309] [<ffffffff8111c835>]
? wait_on_buffer+0x28/0x28
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741320] [<ffffffff8105f87d>]
? autoremove_wake_function+0x2a/0x2a
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741371] [<ffffffffa010b9c0>]
? jbd2_journal_commit_transaction+0xb9b/0x1057 [jbd2]
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741389] [<ffffffff8100d02f>]
? load_TLS+0x7/0xa
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741403] [<ffffffff8100d69e>]
? __switch_to+0x133/0x258
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741418] [<ffffffff8134d03c>]
? _raw_spin_lock_irqsave+0x9/0x25
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741433] [<ffffffff81052212>]
? lock_timer_base.isra.29+0x23/0x47
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741456] [<ffffffffa010f0ae>]
? kjournald2+0xc0/0x20a [jbd2]
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741471] [<ffffffff8105f853>]
? add_wait_queue+0x3c/0x3c
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741490] [<ffffffffa010efee>]
? commit_timeout+0x5/0x5 [jbd2]
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741506] [<ffffffff8105f201>]
? kthread+0x76/0x7e
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741521] [<ffffffff81354174>]
? kernel_thread_helper+0x4/0x10
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741537] [<ffffffff8105f18b>]
? kthread_worker_fn+0x139/0x139
Dec 19 10:27:52 wheezytemplate kernel: [ 360.741550] [<ffffffff81354170>]
? gs_change+0x13/0x13
Dec 19 10:29:00 wheezytemplate kernel: vda1, logical block 49398
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
524288 size 524288 starting block 49664)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
1048576 size 524288 starting block 49792)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
1572864 size 524288 starting block 49920)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
2097152 size 524288 starting block 50048)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
2621440 size 524288 starting block 50176)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
3145728 size 524288 starting block 50304)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
3670016 size 524288 starting block 50432)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
4194304 size 524288 starting block 50560)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
4718592 size 524288 starting block 50688)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
5242880 size 524288 starting block 50816)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
5767168 size 524288 starting block 50944)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
6291456 size 524288 starting block 51072)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
6815744 size 524288 starting block 51200)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
7340032 size 524288 starting block 51328)
Dec 19 10:29:00 wheezytemplate kernel: [ 426.107078] EXT4-fs warning
(device vda1): ext4_end_bio:251: I/O error writing to inode 260693 (offset
7864320 size 524288 starting block 51456)
Dec 19 10:29:00 wheezytemplate kernel: [ 428.812824] JBD2: Detected IO
errors while flushing file data on vda1-8
1) I've been writing two more file and rebooted the guest and, after ti,
they were there.
That's the most important thing :-)
Now I wish to focus on sheep.log
First look at cluster info
Epoch Time
Version
2013-12-19 10:28:57 14 [192.168.10.4:7000, 192.168.10.5:7000,
192.168.10.6:7000, 192.168.10.7:7000]
2013-12-19 10:23:44 13 [192.168.10.5:7000, 192.168.10.6:7000,
192.168.10.7:7000]
2013-12-19 08:39:58 12 [192.168.10.4:7000, 192.168.10.5:7000,
192.168.10.6:7000, 192.168.10.7:7000]
This is sheep.log of the disconnected node
(...) means the last row repeats many times.
Dec 19 10:24:17 ERROR [gway 8654] wait_forward_request(461) fail
5ddf8800000000, Request has an old epoch
Dec 19 10:24:17 ERROR [gway 8654] wait_forward_request(461) fail
5ddf8800000000, Request has an old epoch
Dec 19 10:24:17 ERROR [gway 8655] wait_forward_request(461) fail
5ddf8800000400, Request has an old epoch
Dec 19 10:24:17 ERROR [gway 8653] wait_forward_request(461) fail
5ddf8800000600, Request has an old epoch
Dec 19 10:24:17 ERROR [gway 8685] wait_forward_request(461) fail
5ddf8800000420, Request has an old epoch
Dec 19 10:24:17 ERROR [gway 8685] wait_forward_request(461) fail
5ddf8800000420, Request has an old epoch
Dec 19 10:24:41 ERROR [gway 8683] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 10:24:41 ERROR [gway 8683] sheep_exec_req(1008) failed Request has
an old epoch
Dec 19 10:24:41 ERROR [gway 8654] sheep_exec_req(1008) failed Request has
an old epoc
...
Dec 19 10:28:57 ERROR [main] zk_event_handler(1194) detect a session
timeout. reconnecting...
Dec 19 10:28:57 INFO [main] send_join_request(777) IPv4 ip:192.168.10.4
port:7000
Dec 19 10:28:57 INFO [main] zk_find_master(803) detect master leave,
start to compete master
Dec 19 10:28:57 INFO [main] zk_event_handler(1196) reconnected
Dec 19 10:28:57 INFO [main] recover_object_main(841) object
5ddf8800000400 is recovered (1/216)
Dec 19 10:28:57 INFO [main] recover_object_main(841) object
5ddf8800000600 is recovered (2/216)
Dec 19 10:28:58 INFO [main] recover_object_main(841) object
5ddf88000000b1 is recovered (3/216)
...
2) There are tons of the same message:
grep -c 'sheep_exec_req(1008) failed Request has an old epoch'
/var/sheep/sheep.log
1141544
This is sheep.log of another node:
Dec 19 10:24:17 ERROR [main] check_request_epoch(151) old node version 13,
12 (WRITE_PEER)
Dec 19 10:24:41 ERROR [main] check_request_epoch(151) old node version 13,
12 (READ_PEER)
Dec 19 10:24:41 ERROR [main] check_request_epoch(151) old node version 13,
12 (READ_PEER)
...
Dec 19 10:28:57 INFO [main] recover_object_main(841) object
5ddf8800000052 is recovered (1/197)
Dec 19 10:28:57 INFO [main] recover_object_main(841) object
5ddf8800000094 is recovered (2/197)
...
sheep.log size is 96M
3) also this message is repeated many many times
grep -c 'check_request_epoch(151) old node version 13, 12 (READ_PEER)'
/var/sheep/sheep.log
223375
sheep.log size is 50M
4) I fear that, if the cable stay disconnecter for long time, sheep.log
will grow to much.
Make your considerations and tell me if there's something that may be
consider as a bug or wish list and I'll file it in launch pad.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20131219/644ec30c/attachment-0004.html>
More information about the sheepdog-users
mailing list