[Sheepdog] Sheepdog+iscsi high availability

joby xavier jobycxa at gmail.com
Tue Apr 17 11:39:14 CEST 2012


I have 5 sheepdog nodes with "collie cluster format --copies=2" After one
node fails all other nodes are showing as active

root at nodeb:~# collie node list
M   Id   Host:Port         V-Nodes       Zone
-    0   192.168.1.27:7000       64  453093568
-    1   192.168.1.91:7000       64 1526835392
-    2   192.168.1.117:7000      64 1963043008
-    3   192.168.1.222:7000      64 -570316608
root at prox3:~# collie vdi list
  Name        Id    Size    Used  Shared    Creation time   VDI id
  tom          1  500 MB  500 MB  0.0 MB 2012-04-17 11:29   65958b

Thanks,


On Tue, Apr 17, 2012 at 6:40 AM, Huxinwei <huxinwei at huawei.com> wrote:

>  How many replications you have in the cluster ?****
>
> The log indicates that recovery failed due to “ No object found”****
>
> ** **
>
> *From:* joby xavier [mailto:jobycxa at gmail.com]
> *Sent:* Monday, April 16, 2012 7:30 PM
> *To:* Huxinwei
> *Cc:* sheepdog at lists.wpkg.org
> *Subject:* Re: [Sheepdog] Sheepdog+iscsi high availability****
>
> ** **
>
> when i shutdown my netwoking on "node a" or completely shutdown, ucarp
> switches its Virtual IP to "node b". so the communication of iscsi should
> done through "node b" , both nodes have same iqn.
>
> Following are logs
>
> *node a*
>
>
> Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.91:7000:
> Network is unreachable
> Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.222:7000:
> Network is unreachable
> Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.117:7000:
> Network is unreachable
> Apr 16 16:50:42 check_majority(709) the majority of nodes are not alive
> Apr 16 16:50:42 __sd_leave(736) perhaps a network partition has occurred?
> Apr 16 16:50:42 log_sigexit(361) sheep pid 8954 exiting.
> *
>
> node b
>
>
> *Apr 16 16:50:42 recover_object(1412) done:0 count:159, oid:65958b000000db
> Apr 16 16:50:48 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:48 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:49 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:49 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:51 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:51 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:51 connect_to(227) failed to connect to 192.168.1.29:7000:
> Connection refused
> Apr 16 16:50:51 recover_object_from_replica(1240) failed to connect to
> 192.168.1.29:7000
> Apr 16 16:50:51 do_recover_object(1363) can not recover oid 65958b000000db
> Apr 16 16:50:52 recover_object(1412) done:1 count:159, oid:65958b00000143
> Apr 16 16:50:52 connect_to(227) failed to connect to 192.168.1.29:7000:
> Connection refused
> Apr 16 16:50:52 recover_object_from_replica(1240) failed to connect to
> 192.168.1.29:7000
> Apr 16 16:50:52 do_recover_object(1363) can not recover oid 65958b00000143
> Apr 16 16:50:52 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:54 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:54 recover_object(1412) done:2 count:159, oid:65958b000000d6
> Apr 16 16:50:54 connect_to(227) failed to connect to 192.168.1.29:7000:
> Connection refused
> Apr 16 16:50:54 recover_object_from_replica(1240) failed to connect to
> 192.168.1.29:7000
> Apr 16 16:50:54 do_recover_object(1363) can not recover oid 65958b000000d6
> Apr 16 16:50:54 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:56 recover_object(1412) done:3 count:159, oid:65958b000000e7
> Apr 16 16:50:56 connect_to(227) failed to connect to 192.168.1.29:7000:
> Connection refused
> Apr 16 16:50:56 recover_object_from_replica(1240) failed to connect to
> 192.168.1.29:7000
> Apr 16 16:50:56 do_recover_object(1363) can not recover oid 65958b000000e7
> Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:57 fix_object_consistency(738) failed to read object 66
> Apr 16 16:50:58 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
>
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 recover_object(1412) done:4 count:159, oid:65958b00000117
> Apr 16 16:50:59 connect_to(227) failed to connect to 192.168.1.29:7000:
> Connection refused
> Apr 16 16:50:59 recover_object_from_replica(1240) failed to connect to
> 192.168.1.29:7000
> Apr 16 16:50:59 do_recover_object(1363) can not recover oid 65958b00000117
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2
> Apr 16 16:51:00 recover_object(1412) done:5 count:159, oid:65958b000000dc
> Apr 16 16:51:00 do_recover_object(1363) can not recover oid 65958b000000dc
> Apr 16 16:51:00 recover_object(1412) done:6 count:159, oid:65958b000000cc
> Apr 16 16:51:00 do_recover_object(1363) can not recover oid 65958b000000cc
> Apr 16 16:51:01 recover_object(1412) done:7 count:159, oid:65958b00000145
> Apr 16 16:51:01 recover_object(1412) done:8 count:159, oid:65958b0000017b
> Apr 16 16:51:01 recover_object(1412) done:9 count:159, oid:65958b0000000b
> Apr 16 16:51:01 recover_object(1412) done:10 count:159, oid:65958b000000d5
> Apr 16 16:51:01 recover_object(1412) done:11 count:159, oid:65958b00000022
> Apr 16 16:51:01 do_recover_object(1363) can not recover oid 65958b00000022
> Apr 16 16:51:02 recover_object(1412) done:12 count:159, oid:65958b00000131
> Apr 16 16:51:02 do_recover_object(1363) can not recover oid 65958b00000131
> Apr 16 16:51:02 fix_object_consistency(738) failed to read object 2
> Apr 16 16:51:03 recover_object(1412) done:13 count:159, oid:65958b00000101
> Apr 16 16:51:03 do_recover_object(1363) can not recover oid 65958b00000101
> Apr 16 16:51:04 recover_object(1412) done:14 count:159, oid:65958b00000159
> Apr 16 16:51:04 do_recover_object(1363) can not recover oid 65958b00000159
> Apr 16 16:51:05 recover_object(1412) done:15 count:159, oid:65958b00000115
> Apr 16 16:51:05 recover_object(1412) done:16 count:159, oid:65958b000000f7
> Apr 16 16:51:05 do_recover_object(1363) can not recover oid 65958b000000f7
> Apr 16 16:51:06 recover_object(1412) done:17 count:159, oid:65958b000000c7
> Apr 16 16:51:06 do_recover_object(1363) can not recover oid 65958b000000c7
> Apr 16 16:51:06 fix_object_consistency(738) failed to read object 2
> Apr 16 16:51:07 recover_object(1412) done:18 count:159, oid:65958b00000182
> Apr 16 16:51:07 do_recover_object(1363) can not recover oid 65958b00000182
> Apr 16 16:51:08 recover_object(1412) done:19 count:159, oid:65958b00000129
> Apr 16 16:51:08 do_recover_object(1363) can not recover oid 65958b00000129
>
>
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:44 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:46 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
> Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2
>
> Apr 16 16:59:39 fix_object_consistency(738) failed to read object 2
> Apr 16 16:59:39 fix_object_consistency(738) failed to read object 2
>
>
>
> Thanks,
> Joby Xavier****
>
> On Mon, Apr 16, 2012 at 3:07 PM, Huxinwei <huxinwei at huawei.com> wrote:****
>
> When the fail-over failed, have you used the hook for ucarp  to restart
> the scsi target on ‘node b’?****
>
> Also, do you have logs from both target nodes. It’ll be very helpful.****
>
>  ****
>
> Thanks.****
>
>  ****
>
> *From:* sheepdog-bounces at lists.wpkg.org [mailto:
> sheepdog-bounces at lists.wpkg.org] *On Behalf Of *joby xavier
> *Sent:* Monday, April 16, 2012 4:59 PM
> *To:* sheepdog at lists.wpkg.org
> *Subject:* [Sheepdog] Sheepdog+iscsi high availability****
>
>  ****
>
> HI,****
>
> We would like to set up a iscsi high availability with sheepdog
> distributed
> storage . ****
>
> Here is our system set up: OS - Ubuntu. Four nodes with sheepdog
> distributed storage and we are sharing this storage through iscsi using
> two nodes as well as using a virtual ip set up using ucarp.Two nodes  are
> using same iqn. And mounted the iscsi storage as lvm partition (sdc) ****
>
> node a
> node b
> node c
> node d
> node x is the initiator
> node a and b having common virtual ip because if 'node a' fails 'node
> b' should serve as iscsi target, both have same iqn. ****
>
> Problem: when a failover happens ie iscsi switching from node one to
> two, the iscsi disk fails on initiator 'node x' ****
>
>  ****
>
> Here is  the /var/log/messeage ****
>
> Apr 16 10:57:14 prox1 kernel: scsi7 : iSCSI Initiator over TCP/IP
> Apr 16 10:57:14 prox1 kernel: scsi 7:0:0:0: RAID              IET
> Controller       0001 PQ: 0 ANSI: 5
> Apr 16 10:57:14 prox1 kernel: scsi 7:0:0:1: Direct-Access     IET
> VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] 2252800 512-byte logical
> blocks: (1.15 GB/1.07 GiB)
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Write Protect is off
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> Apr 16 10:57:14 prox1 kernel: sdc: unknown partition table
> Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Attached SCSI disk
>
> Apr 16 10:59:47 prox1 kernel: connection2:0: detected conn error (1020)
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 00 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 00 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 08 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error
> [current]
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered
> read error
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00
> 00 00 00 00 08 00
> Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code****
>
> root at prox1:~# pvdisplay
>   /dev/sdc: read failed after 0 of 4096 at 1153368064: Input/output error
>   /dev/sdc: read failed after 0 of 4096 at 1153425408: Input/output error*
> ***
>
> sheepdog with single node iscsi (
> https://github.com/collie/sheepdog/wiki/General-protocol-support) works
> well****
>
> should we do anything on lvm.conf? should we use multipath-tools? is this
> the right procedure?****
>
>
> Thanks,
>
>
> Joby Xavier****
>
>
>
> ****
>



-- 
Joby C Xavier
Mob +919895150333
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120417/ee999486/attachment-0003.html>


More information about the sheepdog mailing list