[sheepdog] questions about sheepdog write policy

Dong Wu archer.wudong at gmail.com
Thu Jun 16 11:34:28 CEST 2016


Thank you very much for your answers.

2016-06-14 10:09 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:
> Hi Dong, really sorry for my late reply.
>
> On Fri, May 27, 2016 at 5:53 PM, Dong Wu <archer.wudong at gmail.com> wrote:
>>
>> 2016-05-27 16:14 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:
>> >
>> >
>> > On Thu, May 26, 2016 at 4:19 PM, Dong Wu <archer.wudong at gmail.com>
>> > wrote:
>> >>
>> >> Thanks for your reply.
>> >>
>> >> 2016-05-26 10:34 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:
>> >> >
>> >> >
>> >> > On Tue, May 24, 2016 at 6:46 PM, Dong Wu <archer.wudong at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> hi,mitake
>> >> >> I have questions about sheepdog write policy.
>> >> >> for replication, sheepdog write default 3 copies, and is strong
>> >> >> consistency.
>> >> >> my doubt is
>> >> >> 1) if some  replicas write successfully, others fail, then it will
>> >> >> retry write anyway until all the 3 replicas success? but if there
>> >> >> are
>> >> >> only less than 3 nodes left, will it write only less than 3 replicas
>> >> >> and return success to client?
>> >> >
>> >> >
>> >> > In a case of disk and network I/O error, sheep returns an error to
>> >> > its
>> >> > client immediately. In some case (e.g. epoch increasing caused by
>> >> > node
>> >> > join/leave), it will retry.
>> >>
>> >> will the client retry? If the error is caused by only one of the
>> >> replica(eg, the replica's disk is error), and another two is ok, and
>> >> writed success,  then return to client error is reasonable? Why not
>> >> just return to client success, and then recover the errored replica?
>> >
>> >
>> > It is for ensuring 3 replica is consistent. Sheepdog's interface is
>> > virtual
>> > disk so consistency is more important than availability.
>>
>> sheepdog has no journal to guarantee the replicas's consistency, so it
>> should just return to client error when any replica write failed and
>> wait to recover to consistency again and then can continue write.
>>
>> without any consistent log, sheepdog recover logic will scan all the
>> replica's objects and check if the object need to recover,  am i
>> right?
>
>
> Yes, right. sheepdog provides very minimal metadata transaction (assigning
> VIDs to VDIs) with virtual synchrony but it basically doesn't provide
> metadata transaction.
>
>>
>>
>> >
>> >>
>> >>
>> >> >
>> >> >>
>> >> >> 2) if some replicas write success, others write fail, and return
>> >> >> fail
>> >> >> to client, how to deal with these replicas's data consistency(write
>> >> >> success node has new data, but write fail node has old data)? if
>> >> >> client read the same block, will it read new data  or old data?
>> >> >
>> >> >
>> >> > In such a case, we need to repair consistency with "dog vdi check"
>> >> > command.
>> >> > Note that in such a case the failed VDIs won't be accessed from VMs
>> >> > anymore
>> >> > because they will be used in read-only mode.
>> >>
>> >> This meas can't read data from this VDI until it recover done?
>> >> I remember in old version sheepdog, in the read I/O path, it first
>> >> check the replicas's consistency, then read data;
>> >> but i can't find the logic anymore in the lastest version.
>> >
>> >
>> > The data can be read and actually it would work well in many cases.
>>
>> Consider such a case: a write request write success on replicaA, but
>> failed on replicaB, then return error to client(so replicaC did not
>> receive the write req), so replicaA has new data, replicaC has old
>> data, then client read the VDI, will it always read the new data or
>> old data, or just sometime read new data, sometime read old data
>> before "dog vdi check"?
>
>
> The read results inconsistent state, the result will be sometime old and
> sometime latest. It is because sheepdog issues read request in a random
> manner for load balancing.
>
>>
>>
>> And object on replicaB will be part new and part old? How to guarantee
>> write object atomic?
>
>
> Updating of each replica is atomic. It is simply guaranteed by atomic update
> of file with rename(2) technique. Mixed state of old part and new part in
> single replica cannot happen.
>
> Thanks,
> Hitoshi
>
>>
>>
>> > I'm not sure about the feature of the old version, but it seems to be
>> > costly
>> > for ordinal read path. But reviving it as an option would be a
>> > reasonable.
>> > How do you think?
>>
>> yes, it is costly.
>>
>> >
>> > Thanks,
>> > Hitoshi
>> >
>> >>
>> >>
>> >> >
>> >> > Thanks,
>> >> > Hitoshi
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks a lot.
>> >> >
>> >> >
>> >
>> >
>>
>> Thanks,
>> Dong Wu
>
>


More information about the sheepdog mailing list