How many replications you have in the cluster ? The log indicates that recovery failed due to “ No object found” From: joby xavier [mailto:jobycxa at gmail.com] Sent: Monday, April 16, 2012 7:30 PM To: Huxinwei Cc: sheepdog at lists.wpkg.org Subject: Re: [Sheepdog] Sheepdog+iscsi high availability when i shutdown my netwoking on "node a" or completely shutdown, ucarp switches its Virtual IP to "node b". so the communication of iscsi should done through "node b" , both nodes have same iqn. Following are logs node a Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.91:7000<http://192.168.1.91:7000>: Network is unreachable Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.222:7000<http://192.168.1.222:7000>: Network is unreachable Apr 16 16:50:42 connect_to(227) failed to connect to 192.168.1.117:7000<http://192.168.1.117:7000>: Network is unreachable Apr 16 16:50:42 check_majority(709) the majority of nodes are not alive Apr 16 16:50:42 __sd_leave(736) perhaps a network partition has occurred? Apr 16 16:50:42 log_sigexit(361) sheep pid 8954 exiting. node b Apr 16 16:50:42 recover_object(1412) done:0 count:159, oid:65958b000000db Apr 16 16:50:48 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:48 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:49 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:49 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:50 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:51 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:51 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:51 connect_to(227) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000>: Connection refused Apr 16 16:50:51 recover_object_from_replica(1240) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000> Apr 16 16:50:51 do_recover_object(1363) can not recover oid 65958b000000db Apr 16 16:50:52 recover_object(1412) done:1 count:159, oid:65958b00000143 Apr 16 16:50:52 connect_to(227) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000>: Connection refused Apr 16 16:50:52 recover_object_from_replica(1240) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000> Apr 16 16:50:52 do_recover_object(1363) can not recover oid 65958b00000143 Apr 16 16:50:52 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:54 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:54 recover_object(1412) done:2 count:159, oid:65958b000000d6 Apr 16 16:50:54 connect_to(227) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000>: Connection refused Apr 16 16:50:54 recover_object_from_replica(1240) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000> Apr 16 16:50:54 do_recover_object(1363) can not recover oid 65958b000000d6 Apr 16 16:50:54 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:56 recover_object(1412) done:3 count:159, oid:65958b000000e7 Apr 16 16:50:56 connect_to(227) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000>: Connection refused Apr 16 16:50:56 recover_object_from_replica(1240) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000> Apr 16 16:50:56 do_recover_object(1363) can not recover oid 65958b000000e7 Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:56 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:57 fix_object_consistency(738) failed to read object 66 Apr 16 16:50:58 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 recover_object(1412) done:4 count:159, oid:65958b00000117 Apr 16 16:50:59 connect_to(227) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000>: Connection refused Apr 16 16:50:59 recover_object_from_replica(1240) failed to connect to 192.168.1.29:7000<http://192.168.1.29:7000> Apr 16 16:50:59 do_recover_object(1363) can not recover oid 65958b00000117 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:50:59 fix_object_consistency(738) failed to read object 2 Apr 16 16:51:00 recover_object(1412) done:5 count:159, oid:65958b000000dc Apr 16 16:51:00 do_recover_object(1363) can not recover oid 65958b000000dc Apr 16 16:51:00 recover_object(1412) done:6 count:159, oid:65958b000000cc Apr 16 16:51:00 do_recover_object(1363) can not recover oid 65958b000000cc Apr 16 16:51:01 recover_object(1412) done:7 count:159, oid:65958b00000145 Apr 16 16:51:01 recover_object(1412) done:8 count:159, oid:65958b0000017b Apr 16 16:51:01 recover_object(1412) done:9 count:159, oid:65958b0000000b Apr 16 16:51:01 recover_object(1412) done:10 count:159, oid:65958b000000d5 Apr 16 16:51:01 recover_object(1412) done:11 count:159, oid:65958b00000022 Apr 16 16:51:01 do_recover_object(1363) can not recover oid 65958b00000022 Apr 16 16:51:02 recover_object(1412) done:12 count:159, oid:65958b00000131 Apr 16 16:51:02 do_recover_object(1363) can not recover oid 65958b00000131 Apr 16 16:51:02 fix_object_consistency(738) failed to read object 2 Apr 16 16:51:03 recover_object(1412) done:13 count:159, oid:65958b00000101 Apr 16 16:51:03 do_recover_object(1363) can not recover oid 65958b00000101 Apr 16 16:51:04 recover_object(1412) done:14 count:159, oid:65958b00000159 Apr 16 16:51:04 do_recover_object(1363) can not recover oid 65958b00000159 Apr 16 16:51:05 recover_object(1412) done:15 count:159, oid:65958b00000115 Apr 16 16:51:05 recover_object(1412) done:16 count:159, oid:65958b000000f7 Apr 16 16:51:05 do_recover_object(1363) can not recover oid 65958b000000f7 Apr 16 16:51:06 recover_object(1412) done:17 count:159, oid:65958b000000c7 Apr 16 16:51:06 do_recover_object(1363) can not recover oid 65958b000000c7 Apr 16 16:51:06 fix_object_consistency(738) failed to read object 2 Apr 16 16:51:07 recover_object(1412) done:18 count:159, oid:65958b00000182 Apr 16 16:51:07 do_recover_object(1363) can not recover oid 65958b00000182 Apr 16 16:51:08 recover_object(1412) done:19 count:159, oid:65958b00000129 Apr 16 16:51:08 do_recover_object(1363) can not recover oid 65958b00000129 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:44 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:46 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2 Apr 16 16:52:49 fix_object_consistency(738) failed to read object 2 Apr 16 16:59:39 fix_object_consistency(738) failed to read object 2 Apr 16 16:59:39 fix_object_consistency(738) failed to read object 2 Thanks, Joby Xavier On Mon, Apr 16, 2012 at 3:07 PM, Huxinwei <huxinwei at huawei.com<mailto:huxinwei at huawei.com>> wrote: When the fail-over failed, have you used the hook for ucarp to restart the scsi target on ‘node b’? Also, do you have logs from both target nodes. It’ll be very helpful. Thanks. From: sheepdog-bounces at lists.wpkg.org<mailto:sheepdog-bounces at lists.wpkg.org> [mailto:sheepdog-bounces at lists.wpkg.org<mailto:sheepdog-bounces at lists.wpkg.org>] On Behalf Of joby xavier Sent: Monday, April 16, 2012 4:59 PM To: sheepdog at lists.wpkg.org<mailto:sheepdog at lists.wpkg.org> Subject: [Sheepdog] Sheepdog+iscsi high availability HI, We would like to set up a iscsi high availability with sheepdog distributed storage . Here is our system set up: OS - Ubuntu. Four nodes with sheepdog distributed storage and we are sharing this storage through iscsi using two nodes as well as using a virtual ip set up using ucarp.Two nodes are using same iqn. And mounted the iscsi storage as lvm partition (sdc) node a node b node c node d node x is the initiator node a and b having common virtual ip because if 'node a' fails 'node b' should serve as iscsi target, both have same iqn. Problem: when a failover happens ie iscsi switching from node one to two, the iscsi disk fails on initiator 'node x' Here is the /var/log/messeage Apr 16 10:57:14 prox1 kernel: scsi7 : iSCSI Initiator over TCP/IP Apr 16 10:57:14 prox1 kernel: scsi 7:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5 Apr 16 10:57:14 prox1 kernel: scsi 7:0:0:1: Direct-Access IET VIRTUAL-DISK 0001 PQ: 0 ANSI: 5 Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] 2252800 512<tel:2252800%20512>-byte logical blocks: (1.15 GB/1.07 GiB) Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Write Protect is off Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 16 10:57:14 prox1 kernel: sdc: unknown partition table Apr 16 10:57:14 prox1 kernel: sd 7:0:0:1: [sdc] Attached SCSI disk Apr 16 10:59:47 prox1 kernel: connection2:0: detected conn error (1020) Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid driverbyte=DRIVER_SENSE Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error [current] Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered read error Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00 Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid driverbyte=DRIVER_SENSE Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error [current] Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered read error Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00 Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid driverbyte=DRIVER_SENSE Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error [current] Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered read error Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00 00 08 00 00 08 00 Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Result: hostbyte=invalid driverbyte=DRIVER_SENSE Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Sense Key : Medium Error [current] Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Add. Sense: Unrecovered read error Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00 Apr 16 10:59:51 prox1 kernel: sd 7:0:0:1: [sdc] Unhandled sense code root at prox1:~# pvdisplay /dev/sdc: read failed after 0 of 4096 at 1153368064: Input/output error /dev/sdc: read failed after 0 of 4096 at 1153425408: Input/output error sheepdog with single node iscsi ( https://github.com/collie/sheepdog/wiki/General-protocol-support) works well should we do anything on lvm.conf? should we use multipath-tools? is this the right procedure? Thanks, Joby Xavier -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120417/a560f6fa/attachment.html> |