[sheepdog] questions about sheepdog write policy

passedwind bailovereal at qq.com
Sat Jun 11 04:18:52 CEST 2016


root at tgs3:~# dog vdi check galera-db-data-01
 
 98.4 % [=======================================================================================================================================>  ] 49 GB / 50 GB      object ea1c7900000000 is inconsistent
 
 98.4 % [=======================================================================================================================================>  ] 49 GB / 50 GB      object ea1c7900000002 is inconsistent
 
 98.4 % [=======================================================================================================================================>  ] 49 GB / 50 GB      object ea1c7900000003 is inconsistent
 
 98.5 % [=======================================================================================================================================>  ] 49 GB / 50 GB      object ea1c7900000001 is inconsistent
 
 98.8 % [========================================================================================================================================> ] 49 GB / 50 GB      object ea1c7900000034 is inconsistent
 
 98.8 % [========================================================================================================================================> ] 49 GB / 50 GB      object ea1c7900000035 is inconsistent
 
 98.8 % [========================================================================================================================================> ] 49 GB / 50 GB      object ea1c7900000036 is inconsistent
 
 98.9 % [========================================================================================================================================> ] 49 GB / 50 GB      object ea1c790000003e is inconsistent
 
 98.9 % [========================================================================================================================================> ] 49 GB / 50 GB      object ea1c790000003f is inconsistent
 
 98.9 % [========================================================================================================================================> ] 49 GB / 50 GB      object ea1c7900000040 is inconsistent
 
 99.1 % [========================================================================================================================================> ] 50 GB / 50 GB      object ea1c7900000056 is inconsistent
 
 99.1 % [========================================================================================================================================> ] 50 GB / 50 GB      object ea1c7900000057 is inconsistent
 
 99.1 % [========================================================================================================================================> ] 50 GB / 50 GB      object ea1c7900000058 is inconsistent
 
 99.2 % [========================================================================================================================================> ] 50 GB / 50 GB      object ea1c7900000068 is inconsistent
 
 99.2 % [========================================================================================================================================> ] 50 GB / 50 GB      object ea1c7900000076 is inconsistent
 
 99.3 % [========================================================================================================================================> ] 50 GB / 50 GB      object ea1c7900000078 is inconsistent
 
 99.3 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900000c80 is inconsistent
 
 99.3 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900000c86 is inconsistent
 
 99.3 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900000c87 is inconsistent
 
 99.4 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900000c88 is inconsistent
 
 99.4 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001900 is inconsistent
 
 99.4 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001905 is inconsistent
 
 99.4 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001906 is inconsistent
 
 99.4 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001909 is inconsistent
 
 99.5 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001913 is inconsistent
 
 99.6 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001911 is inconsistent
 
 99.6 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001920 is inconsistent
 
 99.8 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900001937 is inconsistent
 
 99.8 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900002580 is inconsistent
 
 99.9 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c790000258c is inconsistent
 
 99.9 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900002590 is inconsistent
 
100.0 % [=========================================================================================================================================>] 50 GB / 50 GB      object ea1c7900002595 is inconsistent
 
100.0 % [==========================================================================================================================================] 50 GB / 50 GB      
 
finish check&repair galera-db-data-01

------------

what can i do something for eliminate these  inconsistent object?  run "dog vdi check" no effect issuer

ths!!! 






 
------------------ Original ------------------
From:  "Dong Wu";<archer.wudong at gmail.com>;
Date:  May 27, 2016
To:  "Hitoshi Mitake"<mitake.hitoshi at gmail.com>; 
Cc:  "sheepdog"<sheepdog at lists.wpkg.org>; 
Subject:  Re: [sheepdog] questions about sheepdog write policy



2016-05-27 16:14 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:
>
>
> On Thu, May 26, 2016 at 4:19 PM, Dong Wu <archer.wudong at gmail.com> wrote:
>>
>> Thanks for your reply.
>>
>> 2016-05-26 10:34 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:
>> >
>> >
>> > On Tue, May 24, 2016 at 6:46 PM, Dong Wu <archer.wudong at gmail.com>
>> > wrote:
>> >>
>> >> hi,mitake
>> >> I have questions about sheepdog write policy.
>> >> for replication, sheepdog write default 3 copies, and is strong
>> >> consistency.
>> >> my doubt is
>> >> 1) if some  replicas write successfully, others fail, then it will
>> >> retry write anyway until all the 3 replicas success? but if there are
>> >> only less than 3 nodes left, will it write only less than 3 replicas
>> >> and return success to client?
>> >
>> >
>> > In a case of disk and network I/O error, sheep returns an error to its
>> > client immediately. In some case (e.g. epoch increasing caused by node
>> > join/leave), it will retry.
>>
>> will the client retry? If the error is caused by only one of the
>> replica(eg, the replica's disk is error), and another two is ok, and
>> writed success,  then return to client error is reasonable? Why not
>> just return to client success, and then recover the errored replica?
>
>
> It is for ensuring 3 replica is consistent. Sheepdog's interface is virtual
> disk so consistency is more important than availability.

sheepdog has no journal to guarantee the replicas's consistency, so it
should just return to client error when any replica write failed and
wait to recover to consistency again and then can continue write.

without any consistent log, sheepdog recover logic will scan all the
replica's objects and check if the object need to recover,  am i
right?

>
>>
>>
>> >
>> >>
>> >> 2) if some replicas write success, others write fail, and return fail
>> >> to client, how to deal with these replicas's data consistency(write
>> >> success node has new data, but write fail node has old data)? if
>> >> client read the same block, will it read new data  or old data?
>> >
>> >
>> > In such a case, we need to repair consistency with "dog vdi check"
>> > command.
>> > Note that in such a case the failed VDIs won't be accessed from VMs
>> > anymore
>> > because they will be used in read-only mode.
>>
>> This meas can't read data from this VDI until it recover done?
>> I remember in old version sheepdog, in the read I/O path, it first
>> check the replicas's consistency, then read data;
>> but i can't find the logic anymore in the lastest version.
>
>
> The data can be read and actually it would work well in many cases.

Consider such a case: a write request write success on replicaA, but
failed on replicaB, then return error to client(so replicaC did not
receive the write req), so replicaA has new data, replicaC has old
data, then client read the VDI, will it always read the new data or
old data, or just sometime read new data, sometime read old data
before "dog vdi check"?

And object on replicaB will be part new and part old? How to guarantee
write object atomic?

> I'm not sure about the feature of the old version, but it seems to be costly
> for ordinal read path. But reviving it as an option would be a reasonable.
> How do you think?

yes, it is costly.

>
> Thanks,
> Hitoshi
>
>>
>>
>> >
>> > Thanks,
>> > Hitoshi
>> >
>> >>
>> >>
>> >> Thanks a lot.
>> >
>> >
>
>

Thanks,
Dong Wu
-- 
sheepdog mailing list
sheepdog at lists.wpkg.org
https://lists.wpkg.org/mailman/listinfo/sheepdog
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20160611/d5c1019e/attachment-0001.html>


More information about the sheepdog mailing list