[sheepdog] questions about sheepdog write policy

Tue Jun 14 04:09:59 CEST 2016

Hi Dong, really sorry for my late reply.

On Fri, May 27, 2016 at 5:53 PM, Dong Wu <archer.wudong at gmail.com> wrote:

> 2016-05-27 16:14 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:
> >
> >
> > On Thu, May 26, 2016 at 4:19 PM, Dong Wu <archer.wudong at gmail.com>
> wrote:
> >>
> >> Thanks for your reply.
> >>
> >> 2016-05-26 10:34 GMT+08:00 Hitoshi Mitake <mitake.hitoshi at gmail.com>:
> >> >
> >> >
> >> > On Tue, May 24, 2016 at 6:46 PM, Dong Wu <archer.wudong at gmail.com>
> >> > wrote:
> >> >>
> >> >> hi,mitake
> >> >> I have questions about sheepdog write policy.
> >> >> for replication, sheepdog write default 3 copies, and is strong
> >> >> consistency.
> >> >> my doubt is
> >> >> 1) if some  replicas write successfully, others fail, then it will
> >> >> retry write anyway until all the 3 replicas success? but if there are
> >> >> only less than 3 nodes left, will it write only less than 3 replicas
> >> >> and return success to client?
> >> >
> >> >
> >> > In a case of disk and network I/O error, sheep returns an error to its
> >> > client immediately. In some case (e.g. epoch increasing caused by node
> >> > join/leave), it will retry.
> >>
> >> will the client retry? If the error is caused by only one of the
> >> replica(eg, the replica's disk is error), and another two is ok, and
> >> writed success,  then return to client error is reasonable? Why not
> >> just return to client success, and then recover the errored replica?
> >
> >
> > It is for ensuring 3 replica is consistent. Sheepdog's interface is
> virtual
> > disk so consistency is more important than availability.
>
> sheepdog has no journal to guarantee the replicas's consistency, so it
> should just return to client error when any replica write failed and
> wait to recover to consistency again and then can continue write.
>
> without any consistent log, sheepdog recover logic will scan all the
> replica's objects and check if the object need to recover,  am i
> right?
>

Yes, right. sheepdog provides very minimal metadata transaction (assigning
VIDs to VDIs) with virtual synchrony but it basically doesn't provide
metadata transaction.

>
> >
> >>
> >>
> >> >
> >> >>
> >> >> 2) if some replicas write success, others write fail, and return fail
> >> >> to client, how to deal with these replicas's data consistency(write
> >> >> success node has new data, but write fail node has old data)? if
> >> >> client read the same block, will it read new data  or old data?
> >> >
> >> >
> >> > In such a case, we need to repair consistency with "dog vdi check"
> >> > command.
> >> > Note that in such a case the failed VDIs won't be accessed from VMs
> >> > anymore
> >> > because they will be used in read-only mode.
> >>
> >> This meas can't read data from this VDI until it recover done?
> >> I remember in old version sheepdog, in the read I/O path, it first
> >> check the replicas's consistency, then read data;
> >> but i can't find the logic anymore in the lastest version.
> >
> >
> > The data can be read and actually it would work well in many cases.
>
> Consider such a case: a write request write success on replicaA, but
> failed on replicaB, then return error to client(so replicaC did not
> receive the write req), so replicaA has new data, replicaC has old
> data, then client read the VDI, will it always read the new data or
> old data, or just sometime read new data, sometime read old data
> before "dog vdi check"?
>

The read results inconsistent state, the result will be sometime old and
sometime latest. It is because sheepdog issues read request in a random
manner for load balancing.

>
> And object on replicaB will be part new and part old? How to guarantee
> write object atomic?
>

Updating of each replica is atomic. It is simply guaranteed by atomic
update of file with rename(2) technique. Mixed state of old part and new
part in single replica cannot happen.

Thanks,
Hitoshi

>
> > I'm not sure about the feature of the old version, but it seems to be
> costly
> > for ordinal read path. But reviving it as an option would be a
> reasonable.
> > How do you think?
>
> yes, it is costly.
>
> >
> > Thanks,
> > Hitoshi
> >
> >>
> >>
> >> >
> >> > Thanks,
> >> > Hitoshi
> >> >
> >> >>
> >> >>
> >> >> Thanks a lot.
> >> >
> >> >
> >
> >
>
> Thanks,
> Dong Wu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20160614/1901f9e3/attachment.html>