<div dir="ltr">Hi Dong, really sorry for my late reply.<div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 27, 2016 at 5:53 PM, Dong Wu <span dir="ltr"><<a href="mailto:archer.wudong@gmail.com" target="_blank">archer.wudong@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">2016-05-27 16:14 GMT+08:00 Hitoshi Mitake <<a href="mailto:mitake.hitoshi@gmail.com">mitake.hitoshi@gmail.com</a>>:<br>

><br>

><br>

> On Thu, May 26, 2016 at 4:19 PM, Dong Wu <<a href="mailto:archer.wudong@gmail.com">archer.wudong@gmail.com</a>> wrote:<br>

>><br>

>> Thanks for your reply.<br>

>><br>

>> 2016-05-26 10:34 GMT+08:00 Hitoshi Mitake <<a href="mailto:mitake.hitoshi@gmail.com">mitake.hitoshi@gmail.com</a>>:<br>

>> ><br>

>> ><br>

>> > On Tue, May 24, 2016 at 6:46 PM, Dong Wu <<a href="mailto:archer.wudong@gmail.com">archer.wudong@gmail.com</a>><br>

>> > wrote:<br>

>> >><br>

>> >> hi,mitake<br>

>> >> I have questions about sheepdog write policy.<br>

>> >> for replication, sheepdog write default 3 copies, and is strong<br>

>> >> consistency.<br>

>> >> my doubt is<br>

>> >> 1) if some  replicas write successfully, others fail, then it will<br>

>> >> retry write anyway until all the 3 replicas success? but if there are<br>

>> >> only less than 3 nodes left, will it write only less than 3 replicas<br>

>> >> and return success to client?<br>

>> ><br>

>> ><br>

>> > In a case of disk and network I/O error, sheep returns an error to its<br>

>> > client immediately. In some case (e.g. epoch increasing caused by node<br>

>> > join/leave), it will retry.<br>

>><br>

>> will the client retry? If the error is caused by only one of the<br>

>> replica(eg, the replica's disk is error), and another two is ok, and<br>

>> writed success,  then return to client error is reasonable? Why not<br>

>> just return to client success, and then recover the errored replica?<br>

><br>

><br>

> It is for ensuring 3 replica is consistent. Sheepdog's interface is virtual<br>

> disk so consistency is more important than availability.<br>

<br>

</span>sheepdog has no journal to guarantee the replicas's consistency, so it<br>

should just return to client error when any replica write failed and<br>

wait to recover to consistency again and then can continue write.<br>

<br>

without any consistent log, sheepdog recover logic will scan all the<br>

replica's objects and check if the object need to recover,  am i<br>

right?<br></blockquote><div><br></div><div>Yes, right. sheepdog provides very minimal metadata transaction (assigning VIDs to VDIs) with virtual synchrony but it basically doesn't provide metadata transaction.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

><br>

>><br>

>><br>

>> ><br>

>> >><br>

>> >> 2) if some replicas write success, others write fail, and return fail<br>

>> >> to client, how to deal with these replicas's data consistency(write<br>

>> >> success node has new data, but write fail node has old data)? if<br>

>> >> client read the same block, will it read new data  or old data?<br>

>> ><br>

>> ><br>

>> > In such a case, we need to repair consistency with "dog vdi check"<br>

>> > command.<br>

>> > Note that in such a case the failed VDIs won't be accessed from VMs<br>

>> > anymore<br>

>> > because they will be used in read-only mode.<br>

>><br>

>> This meas can't read data from this VDI until it recover done?<br>

>> I remember in old version sheepdog, in the read I/O path, it first<br>

>> check the replicas's consistency, then read data;<br>

>> but i can't find the logic anymore in the lastest version.<br>

><br>

><br>

> The data can be read and actually it would work well in many cases.<br>

<br>

</span>Consider such a case: a write request write success on replicaA, but<br>

failed on replicaB, then return error to client(so replicaC did not<br>

receive the write req), so replicaA has new data, replicaC has old<br>

data, then client read the VDI, will it always read the new data or<br>

old data, or just sometime read new data, sometime read old data<br>

before "dog vdi check"?<br></blockquote><div><br></div><div>The read results inconsistent state, the result will be sometime old and sometime latest. It is because sheepdog issues read request in a random manner for load balancing.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

And object on replicaB will be part new and part old? How to guarantee<br>

write object atomic?<br></blockquote><div><br></div><div>Updating of each replica is atomic. It is simply guaranteed by atomic update of file with rename(2) technique. Mixed state of old part and new part in single replica cannot happen.</div><div><br></div><div>Thanks,</div><div>Hitoshi</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> I'm not sure about the feature of the old version, but it seems to be costly<br>

> for ordinal read path. But reviving it as an option would be a reasonable.<br>

> How do you think?<br>

<br>

</span>yes, it is costly.<br>

<span class=""><br>

><br>

> Thanks,<br>

> Hitoshi<br>

><br>

>><br>

>><br>

>> ><br>

>> > Thanks,<br>

>> > Hitoshi<br>

>> ><br>

>> >><br>

>> >><br>

>> >> Thanks a lot.<br>

>> ><br>

>> ><br>

><br>

><br>

<br>

</span>Thanks,<br>

Dong Wu<br>

</blockquote></div><br></div></div>