[sheepdog] Issue with "-m unsafe", copies and zones
Shawn Moore
smmoore at gmail.com
Tue Oct 2 16:20:53 CEST 2012
I have been testing the 0.5.0 release and believe I have found
regression issues related to "mode unsafe" as well as just one zone
out of three causing issues. The last time I know this worked was
when the option was "-H" for no halt before the "-m OPTION".
I have 6 nodes (2 per zone with 3 zones). Each zone is on it's own
switch with the switch for zone 0 bringing them all together.
# collie node list
M Id Host:Port V-Nodes Zone
- 0 172.16.1.151:7000 64 0
- 1 172.16.1.152:7000 64 0
- 2 172.16.1.153:7000 64 1
- 3 172.16.1.154:7000 64 1
- 4 172.16.1.155:7000 64 2
- 5 172.16.1.159:7000 64 2
The cluster was formatted as follows:
# collie cluster format -b farm -c 3 -m unsafe
# collie cluster info
Cluster status: running
Cluster created at Mon Oct 1 15:40:55 2012
Epoch Time Version
2012-10-01 15:40:55 1 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000, 172.16.1.155:7000,
172.16.1.159:7000]
I created a 40GB vdi via each node.
# collie vdi list
Name Id Size Used Shared Creation time VDI id Copies Tag
test159 1 40 GB 40 GB 0.0 MB 2012-10-01 16:46 279f76
3
test153 1 40 GB 40 GB 0.0 MB 2012-10-01 16:46 27a9a8
3
test152 1 40 GB 40 GB 0.0 MB 2012-10-01 16:46 27ab5b
3
test151 1 40 GB 40 GB 0.0 MB 2012-10-01 16:46 27ad0e
3
test155 1 40 GB 40 GB 0.0 MB 2012-10-01 16:46 27b3da
3
test154 1 40 GB 40 GB 0.0 MB 2012-10-01 16:46 27b58d 3
# collie node info
Id Size Used Use%
0 476 GB 117 GB 24%
1 476 GB 123 GB 25%
2 476 GB 136 GB 28%
3 476 GB 104 GB 21%
4 476 GB 117 GB 24%
5 476 GB 123 GB 25%
Total 2.8 TB 720 GB 25%
Then I kill the uplink interface for zone 2 from the zone 0 switch.
This leaves zones 0/1 talking to each other and zone 2 talking only to
itself.
# collie cluster info
Cluster status: running
Cluster created at Mon Oct 1 15:40:55 2012
Epoch Time Version
2012-10-02 09:04:28 3 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000]
2012-10-02 09:04:28 2 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000, 172.16.1.159:7000]
2012-10-01 15:40:55 1 [172.16.1.151:7000, 172.16.1.152:7000,
172.16.1.153:7000, 172.16.1.154:7000, 172.16.1.155:7000,
172.16.1.159:7000]
# collie node info
Id Size Used Use%
0 476 GB 117 GB 24%
1 476 GB 123 GB 25%
2 476 GB 136 GB 28%
3 476 GB 104 GB 21%
Total 1.9 TB 480 GB 25%
At this point every node in zones 0/1 start writing every second:
Oct 02 09:04:28 [rw 128323] get_vdi_copy_number(82) No VDI copy
entry for 0 found
The command below hangs till killed on every vdi:
# collie vdi object test151
So I try to check the vdi's and they all do:
# collie vdi check test151
[main] get_vnode_next_idx(106) PANIC: can't find next new idx
Aborted
When I bring back the interface between zone 0/1 and 2, the sheep
processes have died stating:
Oct 02 09:04:28 [main] cdrv_cpg_confchg(599) PANIC: Network
partition is detected
Oct 02 09:04:28 [main] crash_handler(439) sheep pid 6780 exited unexpectedly.
Shouldn't zone two have remained running due to the "-m unsafe"
option? I understand about network partitioning and want this issue
as I can handle it myself.
And I can't understand why zones 0/1 were affected at all with copies
2 and especially with "-m unsafe".
Let me know if you need anymore information or would like me to re-run
the test a different way.
More information about the sheepdog
mailing list