[Sheepdog] Sheepdog+iscsi high availability

joby xavier jobycxa at gmail.com
Mon Apr 16 13:29:58 CEST 2012


when i shutdown my netwoking on "node a" or completely shutdown, ucarp
switches its Virtual IP to "node b". so the communication of iscsi should
done through "node b" , both nodes have same iqn.

Following are logs

*node a*


Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.91:7000:
Network is unreachable
Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.222:7000:
Network is unreachable
Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.117:7000:
Network is unreachable
Apr 16 16:50:42 check_majority(709) the majority of nodes are not alive
Apr 16 16:50:42 __sd_leave(736) perhaps a network partition has occurred?
Apr 16 16:50:42 log_sigexit(361) sheep pid 8954 exiting.
*

node b


*Apr 16 16:50:42 recover_object(1412) done:0 count:159, oid:65958b000000db
Apr 16 16:50:48 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:48 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:49 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:49 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:51 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:51 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:51 connect_to(227) failed to connect to 192.168.1.29:7000:
Connection refused
Apr 16 16:50:51 recover_object_from_replica(1240) failed to connect to
192.168.1.29:7000
Apr 16 16:50:51 do_recover_object(1363) can not recover oid 65958b000000db
Apr 16 16:50:52 recover_object(1412) done:1 count:159, oid:65958b00000143
Apr 16 16:50:52 connect_to(227) failed to connect to 192.168.1.29:7000:
Connection refused
Apr 16 16:50:52 recover_object_from_replica(1240) failed to connect to
192.168.1.29:7000
Apr 16 16:50:52 do_recover_object(1363) can not recover oid 65958b00000143
Apr 16 16:50:52 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:54 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:54 recover_object(1412) done:2 count:159, oid:65958b000000d6
Apr 16 16:50:54 connect_to(227) failed to connect to 192.168.1.29:7000:
Connection refused
Apr 16 16:50:54 recover_object_from_replica(1240) failed to connect to
192.168.1.29:7000
Apr 16 16:50:54 do_recover_object(1363) can not recover oid 65958b000000d6
Apr 16 16:50:54 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:56 recover_object(1412) done:3 count:159, oid:65958b000000e7
Apr 16 16:50:56 connect_to(227) failed to connect to 192.168.1.29:7000:
Connection refused
Apr 16 16:50:56 recover_object_from_replica(1240) failed to connect to
192.168.1.29:7000
Apr 16 16:50:56 do_recover_object(1363) can not recover oid 65958b000000e7
Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:57 fix_object_consistency(738) failed to read object 66
Apr 16 16:50:58 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2

Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 recover_object(1412) done:4 count:159, oid:65958b00000117
Apr 16 16:50:59 connect_to(227) failed to connect to 192.168.1.29:7000:
Connection refused
Apr 16 16:50:59 recover_object_from_replica(1240) failed to connect to
192.168.1.29:7000
Apr 16 16:50:59 do_recover_object(1363) can not recover oid 65958b00000117
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
Apr 16 16:51:00 recover_object(1412) done:5 count:159, oid:65958b000000dc
Apr 16 16:51:00 do_recover_object(1363) can not recover oid 65958b000000dc
Apr 16 16:51:00 recover_object(1412) done:6 count:159, oid:65958b000000cc
Apr 16 16:51:00 do_recover_object(1363) can not recover oid 65958b000000cc
Apr 16 16:51:01 recover_object(1412) done:7 count:159, oid:65958b00000145
Apr 16 16:51:01 recover_object(1412) done:8 count:159, oid:65958b0000017b
Apr 16 16:51:01 recover_object(1412) done:9 count:159, oid:65958b0000000b
Apr 16 16:51:01 recover_object(1412) done:10 count:159, oid:65958b000000d5
Apr 16 16:51:01 recover_object(1412) done:11 count:159, oid:65958b00000022
Apr 16 16:51:01 do_recover_object(1363) can not recover oid 65958b00000022
Apr 16 16:51:02 recover_object(1412) done:12 count:159, oid:65958b00000131
Apr 16 16:51:02 do_recover_object(1363) can not recover oid 65958b00000131
Apr 16 16:51:02 fix_object_consistency(738) failed to read object 2
Apr 16 16:51:03 recover_object(1412) done:13 count:159, oid:65958b00000101
Apr 16 16:51:03 do_recover_object(1363) can not recover oid 65958b00000101
Apr 16 16:51:04 recover_object(1412) done:14 count:159, oid:65958b00000159
Apr 16 16:51:04 do_recover_object(1363) can not recover oid 65958b00000159
Apr 16 16:51:05 recover_object(1412) done:15 count:159, oid:65958b00000115
Apr 16 16:51:05 recover_object(1412) done:16 count:159, oid:65958b000000f7
Apr 16 16:51:05 do_recover_object(1363) can not recover oid 65958b000000f7
Apr 16 16:51:06 recover_object(1412) done:17 count:159, oid:65958b000000c7
Apr 16 16:51:06 do_recover_object(1363) can not recover oid 65958b000000c7
Apr 16 16:51:06 fix_object_consistency(738) failed to read object 2
Apr 16 16:51:07 recover_object(1412) done:18 count:159, oid:65958b00000182
Apr 16 16:51:07 do_recover_object(1363) can not recover oid 65958b00000182
Apr 16 16:51:08 recover_object(1412) done:19 count:159, oid:65958b00000129
Apr 16 16:51:08 do_recover_object(1363) can not recover oid 65958b00000129


Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:44 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:46 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2

Apr 16 16:59:39 fix_object_consistency(738) failed to read object 2
Apr 16 16:59:39 fix_object_consistency(738) failed to read object 2



Thanks,
Joby Xavier

On Mon, Apr 16, 2012 at 3:07 PM, Huxinwei <huxinwei at huawei.com> wrote:

>  When the fail-over failed, have you used the hook for ucarp  to restart
> the scsi target on ‘node b’?****
>
> Also, do you have logs from both target nodes. It’ll be very helpful.****
>
> ** **
>
> Thanks.****
>
> ** **
>
> *From:* sheepdog-bounces at lists.wpkg.org [mailto:
> sheepdog-bounces at lists.wpkg.org] *On Behalf Of *joby xavier
> *Sent:* Monday, April 16, 2012 4:59 PM
> *To:* sheepdog at lists.wpkg.org
> *Subject:* [Sheepdog] Sheepdog+iscsi high availability****
>
> ** **
>
> HI,****
>
> We would like to set up a iscsi high availability with sheepdog
> distributed
> storage . ****
>
> Here is our system set up: OS - Ubuntu. Four nodes with sheepdog
> distributed storage and we are sharing this storage through iscsi using
> two nodes as well as using a virtual ip set up using ucarp.Two nodes  are
> using same iqn. And mounted the iscsi storage as lvm partition (sdc) ****
>
> node a
> node b
> node c
> node d
> node x is the initiator
> node a and b having common virtual ip because if 'node a' fails 'node
> b' should serve as iscsi target, both have same iqn. ****
>
> Problem: when a failover happens ie iscsi switching from node one to
> two, the iscsi disk fails on initiator 'node x' ****
>
> ** **
>
> Here is  the /var/log/messeage ****
>
> Apr 16 10:57:14 prox1 kernel: scsi7 : iSCSI Initiator over TCP/IP
> Apr 16 10:57:14 prox1 kernel: scsi 7:0:0:0: RAID              IET
> Controller       0001 PQ: 0 ANSI: 5
> Apr 16 10:57:14 prox1 kernel: scsi 7:0:0:1: Direct-Access     IET
> VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] 2252800 512-byte logical
> blocks: (1.15 GB/1.07 GiB)
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Write Protect is off
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> Apr 16 10:57:14 prox1 kernel: sdc: unknown partition table
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Attached SCSI disk
>
> Apr 16 10:59:47 prox1 kernel: connection2:0: detected conn error (1020)
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 00 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 00 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 08 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 00 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code****
>
> root at prox1:~# pvdisplay
>   /dev/sdc: read failed after 0 of 4096 at 1153368064: Input/output error
>   /dev/sdc: read failed after 0 of 4096 at 1153425408: Input/output error*
> ***
>
> sheepdog with single node iscsi (
> https://github.com/collie/sheepdog/wiki/General-protocol-support) works
> well****
>
> should we do anything on lvm.conf? should we use multipath-tools? is this
> the right procedure?****
>
>
> Thanks,
>
>
> Joby Xavier****
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120416/ae7e4dd9/attachment.html>


More information about the sheepdog mailing list