[sheepdog] Issue with "-m unsafe", copies and zones

Tue Oct 2 16:20:53 CEST 2012

I have been testing the 0.5.0 release and believe I have found
regression issues related to "mode unsafe" as well as just one zone
out of three causing issues.  The last time I know this worked was
when the option was "-H" for no halt before the "-m OPTION".

I have 6 nodes (2 per zone with 3 zones).  Each zone is on it's own
switch with the switch for zone 0 bringing them all together.
  # collie node list
  M   Id   Host:Port         V-Nodes       Zone
  -    0   172.16.1.151:7000 	64          0
  -    1   172.16.1.152:7000 	64          0
  -    2   172.16.1.153:7000 	64          1
  -    3   172.16.1.154:7000 	64          1
  -    4   172.16.1.155:7000 	64          2
  -    5   172.16.1.159:7000 	64          2

The cluster was formatted as follows:
  # collie cluster format -b farm -c 3 -m unsafe
  # collie cluster info
  Cluster status: running
  Cluster created at Mon Oct  1 15:40:55 2012
  Epoch Time           Version
  2012-10-01 15:40:55      1 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000, 172.16.1.155:7000,
172.16.1.159:7000]

I created a 40GB vdi via each node.
  # collie vdi list
  Name        Id    Size    Used  Shared    Creation time   VDI id  Copies  Tag
  test159      1   40 GB   40 GB  0.0 MB 2012-10-01 16:46   279f76
3
  test153      1   40 GB   40 GB  0.0 MB 2012-10-01 16:46   27a9a8
3
  test152      1   40 GB   40 GB  0.0 MB 2012-10-01 16:46   27ab5b
3
  test151      1   40 GB   40 GB  0.0 MB 2012-10-01 16:46   27ad0e
3
  test155      1   40 GB   40 GB  0.0 MB 2012-10-01 16:46   27b3da
3
  test154      1   40 GB   40 GB  0.0 MB 2012-10-01 16:46   27b58d     3
  # collie node info
  Id	Size	Used	Use%
   0	476 GB	117 GB	 24%
   1	476 GB	123 GB	 25%
   2	476 GB	136 GB	 28%
   3	476 GB	104 GB	 21%
   4	476 GB	117 GB	 24%
   5	476 GB	123 GB	 25%
  Total	2.8 TB	720 GB	 25%

Then I kill the uplink interface for zone 2 from the zone 0 switch.
This leaves zones 0/1 talking to each other and zone 2 talking only to
itself.
  # collie cluster info
  Cluster status: running
  Cluster created at Mon Oct  1 15:40:55 2012
  Epoch Time           Version
  2012-10-02 09:04:28      3 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000]
  2012-10-02 09:04:28      2 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000, 172.16.1.159:7000]
  2012-10-01 15:40:55      1 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000, 172.16.1.155:7000,
172.16.1.159:7000]
  # collie node info
  Id	Size	Used	Use%
   0	476 GB	117 GB	 24%
   1	476 GB	123 GB	 25%
   2	476 GB	136 GB	 28%
   3	476 GB	104 GB	 21%
  Total	1.9 TB	480 GB	 25%
At this point every node in zones 0/1 start writing every second:
  Oct 02 09:04:28 [rw 128323] get_vdi_copy_number(82) No VDI copy
entry for 0 found
The command below hangs till killed on every vdi:
  # collie vdi object test151
So I try to check the vdi's and they all do:
  # collie vdi check test151
  [main] get_vnode_next_idx(106) PANIC: can't find next new idx
  Aborted

When I bring back the interface between zone 0/1 and 2, the sheep
processes have died stating:
  Oct 02 09:04:28 [main] cdrv_cpg_confchg(599) PANIC: Network
partition is detected
  Oct 02 09:04:28 [main] crash_handler(439) sheep pid 6780 exited unexpectedly.
Shouldn't zone two have remained running due to the "-m unsafe"
option?  I understand about network partitioning and want this issue
as I can handle it myself.
And I can't understand why zones 0/1 were affected at all with copies
2 and especially with "-m unsafe".

Let me know if you need anymore information or would like me to re-run
the test a different way.