[Sheepdog] [PATCH v2 1/3] sheep: introduce SD_STATUS_HALT

Fri Oct 14 11:38:17 CEST 2011

At Fri, 14 Oct 2011 14:49:55 +0800,
Liu Yuan wrote:
> 
> On 10/14/2011 03:26 AM, Rymer, Rodney R wrote:
> > We have been experimenting with sheepdog, and have been impressed by what we have seen so far.  However, we have been taking advantage of this "bug" (running while desired number of copies cannot be maintained), and see it as a critical feature.  Could this at least become a user configuration option at the cluster level during the cluster format?  We have the following situation:
> >
> 
> Yes, it is better to have it as an option. I agree. Kazum, how about you?

Looks good to me.

> 
> > + good nodes, - failed nodes, 2 zones (Z1,Z2) each with two nodes, copies=2
> > Design requirements - system must be able to completely function off of only zone1, or zone2 (in a failure situation), they are redundant data centers, so to speak, but help performance when all is up.
> >
> >     0       1      2     3
> > Z1 +       -      -     +
> > Z1 +       -      -     +
> > Z2 +  -->   + -->   - -->  #
> > Z2 +       +      -     #<-- permanently down.
> > these nodes have the latest data in zone 2
> > at stage 3, we will have a cluster recovered without the data tracked at stage 1.
> >
> > For the case you describe, we would do one of two things:
> > 1 - Restore our data from backups we maintain
> 
> For local image or dedicated backup node, we have discussed it 
> previously. We somewhat came into an agreement that
> 
> 1 ) we would use read-aread to implement local image *after* write-back 
> is implemented. This will improve availability on node basis.
> 2) dedicated nodes for back-up the whole cluster would be very costly 
> and bring trouble when the data grows into a considerably large scale. 
> But it is okay for small cluster as kazum said. and he mentioned he 
> would accept a patch for it.
> 
> > 2 - Accept loss of data and revert to last good data in other zone
> 
> So for this method, I am not sure if we can restore *safely* into the 
> previous epoch if the incremental data are permanently unrecoverable. 
> kazum, would you please elaborate it a bit?

Sheepdog flushes all the outstanding I/Os before incrementing epoch,
so the previous epoch data is something like a global snapshot of the
entire Sheepdog cluster; I think it is safe to revert to the previous
epoch.

> 
> > Our decision would be based on local factors, such as how long we operated off of one data center, lost productivity time versus time to restore data and bring system back up.
> >
> > For us, in situations like this, it would be very helpful to have the collie cluster info command actually show the date/time of the creation of each epoch, instead of showing the same date/time for all epochs.  It would be helpful to know this to determine the amount of elapsed time between each epoch which would relate to the amount of data change in our production environment.  This could prove helpful in determining when reversion to previous epoch versus restore of backup would be least costly in our environment.
> >
> > For us, a more likely scenario is as follows:
> >
> >     0       1      2
> > Z1 +       -      +
> > Z1 +       -      +
> > Z2 +  -->   + -->   +
> > Z2 +       +      +
> >
> > Epoch 0, everything working normally.
> > Epoch 1, Zone 1 goes down for a short period of time for maintenance, or an unexpected disaster.  Data continuity OK, customers continue to work with minimal downtime, as some vms are restarted.
> > Epoch 2, zone 2 is brought back into system, data is replicated and redundancy is restored, customers don't know or care, because situation happened in the background and did not keep them from working.
> >
> > Another feature that would be extremely helpful in our environment would be to be able to have a number of copies specified per zone.  So, in our case, we have two zones, data center 1 and data center 2.  They are both connected via fiber, and for all practical purposes are one location when we don't have a data center down for upgrades or failures.  However, we would like to be able to have several copies per zone.  For example, we could have three copies per zone, like a RAID6 per location.  That way, if Z1 (data center 1) is down for maintenance (or crashes unexpectedly, or fiber cut, etc), and we had a host failure, or several host failures (or drives) up to the number of copies per zone minus one in zone 2 (data center 2), we would still be able to operate, like a degraded RAID6 until zone 2 (data center 2) was able to achieve the correct number of copies for that zone, at which point we would have inner zone redundancy again.  When Z1 (data center 1) is brought back
  u
>  p,
> >    it could then be synched with Zone 2 and we would eventually have redundancy at the zone level (data center level) restored.
> >
> > Another issue we have encountered is related to cleanup of unused objects.  It appears that currently, many orphaned objects are left in previous epochs.  For us this is a huge waste of space we cannot afford, because we are trying to use SSDs which have limited space, and will wear out quicker replicating data that was possibly already on the disk in an older epoch after node recovery.
> >
> > We would be glad to work with you in any way possible to help with testing any of these changes in our environment.  We are eager to implement this into our production environment as soon as the features and stability are ready.
> >
> 
> Good idea. I am totally fond of this feature, it makes sheepdog more 
> symmetrically attractive. IIUC, we need to implement 'sub sheepdog 
> cluster' concept.
> 
> zone                 0                     1
>                          +                     +
>                           |                      |
> subzone     ------------         -----------
>                     |     |     |          |     |     |
>                    +    +   +        +    +   +
>                    0'   1'    2'        '0    '1   '2
> 
> I can not simply guess how hard it is to implement. I'll look at this 
> concept after Kazum refactor the membership code, probably after 0.3.0 
> release.
> Kazum, any comment for this idea?

I don't yet fully understand the idea.  When we want to have 2 copies
in zone 0 and 1 copy in zone1, how to specify it with subzone?

Thanks,

Kazutaka