[Sheepdog] [PATCH v2 1/3] sheep: introduce SD_STATUS_HALT

Thu Oct 13 21:26:58 CEST 2011

We have been experimenting with sheepdog, and have been impressed by what we have seen so far.  However, we have been taking advantage of this "bug" (running while desired number of copies cannot be maintained), and see it as a critical feature.  Could this at least become a user configuration option at the cluster level during the cluster format?  We have the following situation:

+ good nodes, - failed nodes, 2 zones (Z1,Z2) each with two nodes, copies=2
Design requirements - system must be able to completely function off of only zone1, or zone2 (in a failure situation), they are redundant data centers, so to speak, but help performance when all is up.

   0       1      2     3
Z1 +       -      -     +
Z1 +       -      -     +
Z2 +  -->  + -->  - --> #
Z2 +       +      -     # <-- permanently down.           
these nodes have the latest data in zone 2
at stage 3, we will have a cluster recovered without the data tracked at stage 1.

For the case you describe, we would do one of two things:
1 - Restore our data from backups we maintain
2 - Accept loss of data and revert to last good data in other zone

Our decision would be based on local factors, such as how long we operated off of one data center, lost productivity time versus time to restore data and bring system back up.

For us, in situations like this, it would be very helpful to have the collie cluster info command actually show the date/time of the creation of each epoch, instead of showing the same date/time for all epochs.  It would be helpful to know this to determine the amount of elapsed time between each epoch which would relate to the amount of data change in our production environment.  This could prove helpful in determining when reversion to previous epoch versus restore of backup would be least costly in our environment.

For us, a more likely scenario is as follows:

   0       1      2     
Z1 +       -      +
Z1 +       -      +
Z2 +  -->  + -->  + 
Z2 +       +      +

Epoch 0, everything working normally. 
Epoch 1, Zone 1 goes down for a short period of time for maintenance, or an unexpected disaster.  Data continuity OK, customers continue to work with minimal downtime, as some vms are restarted.
Epoch 2, zone 2 is brought back into system, data is replicated and redundancy is restored, customers don't know or care, because situation happened in the background and did not keep them from working.

Another feature that would be extremely helpful in our environment would be to be able to have a number of copies specified per zone.  So, in our case, we have two zones, data center 1 and data center 2.  They are both connected via fiber, and for all practical purposes are one location when we don't have a data center down for upgrades or failures.  However, we would like to be able to have several copies per zone.  For example, we could have three copies per zone, like a RAID6 per location.  That way, if Z1 (data center 1) is down for maintenance (or crashes unexpectedly, or fiber cut, etc), and we had a host failure, or several host failures (or drives) up to the number of copies per zone minus one in zone 2 (data center 2), we would still be able to operate, like a degraded RAID6 until zone 2 (data center 2) was able to achieve the correct number of copies for that zone, at which point we would have inner zone redundancy again.  When Z1 (data center 1) is brought back up, it could then be synched with Zone 2 and we would eventually have redundancy at the zone level (data center level) restored.

Another issue we have encountered is related to cleanup of unused objects.  It appears that currently, many orphaned objects are left in previous epochs.  For us this is a huge waste of space we cannot afford, because we are trying to use SSDs which have limited space, and will wear out quicker replicating data that was possibly already on the disk in an older epoch after node recovery.

We would be glad to work with you in any way possible to help with testing any of these changes in our environment.  We are eager to implement this into our production environment as soon as the features and stability are ready.

-----Original Message-----
From: sheepdog-bounces at lists.wpkg.org [mailto:sheepdog-bounces at lists.wpkg.org] On Behalf Of Liu Yuan
Sent: Tuesday, October 11, 2011 5:27 AM
To: morita.kazutaka at lab.ntt.co.jp
Cc: sheepdog at lists.wpkg.org
Subject: [Sheepdog] [PATCH v2 1/3] sheep: introduce SD_STATUS_HALT

From: Liu Yuan <tailai.ly at taobao.com>

Currently, sheepdog will serve IO requests even if number of nodes is less than 'copies'.

When the number of the nodes (or zones) is less than the copies specified by
collie-cluster-format command, the sheepdog cluster should stop serving IO requests.

This is necessary to solve the below subtle case:

+ good nodes, - failed nodes.

0       1      2     3
+       -      -     +
+  -->  - -->  - --> +
+       +      -     # <-- permanently down.
        ^
        |
this node has the latest data
at stage 3, we will have a cluster recovered without the data tracked at stage 1.
When the nodes are in the SD_STATUS_HALT, the sheepdog can also serve configuration change
and do the recovery job.