[sheepdog] [PATCH 2/2] collie: add a check&repair command

Mon Jun 25 13:01:01 CEST 2012

On 06/25/2012 06:51 PM, Christoph Hellwig wrote:
> On Wed, Jun 20, 2012 at 06:16:02PM +0800, Liu Yuan wrote:
>>   "With the following scenarios, object replicas could have the different
>>    contents:
>>
>>   - a gateway node fails while forwarding write requests
>>   - total node failure happens while writing objects
>>
>>   In the such cases, it is okay for VMs not to read the latest data from
>>   the inconsistent objects because the VMs received EIO from them
>>   before.  However, it is still needed to fix the objects' inconsistency
>>   so that the VMs won't read the different data from the objects next
>>   time."
>>
>> So when those two case happens, uesrs are expected to run:
>>
>> $ collie check affected_vdi_name
> 
> Requiring manual user intervention when a node goes down in a
> distributed storage system is entirely unacceptable.  I'm happy to kill
> the dumb version of the consistency fix, but in exchange sheepdog needs
> to have a better internal method to deal with this failure instead of
> bailing out.
> 

This command is the last resort to fix consistency like fsck and it
doesn't go against adding automatic recovery mechanism built-in sheep core.

The biggest reason to remove fix_object_consistency() is it is simply
broken in some cases.

Thanks,
Yuan