<div dir="ltr">Hi Hitoshi,<div><br></div><div>Thanks for your suggestion.</div><div><br></div><div>I will update "dog cluster info" soon.<br><div class="gmail_extra"><br><br><div class="gmail_quote">2014-05-16 22:37 GMT+08:00 Hitoshi Mitake <span dir="ltr"><<a href="mailto:mitake.hitoshi@gmail.com" target="_blank">mitake.hitoshi@gmail.com</a>></span>:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">At Fri, 16 May 2014 16:17:20 +0800,<br>

Robin Dong wrote:<br>

><br>

> [1  <text/plain; UTF-8 (7bit)>]<br>

<div class="">> Hi, Kazutaka and Hitosh<br>

><br>

> Could you give some suggestions about this patchset ?<br>

<br>

</div>I really like your idea. It must be useful for machines with bunch of<br>

disks. Of course there are some points for improvements<br>

(e.g. introduced #ifdefs should be removed in the future), but it<br>

seems to be a good first step.<br>

<br>

Reviewed-by: Hitoshi Mitake <<a href="mailto:mitake.hitoshi@lab.ntt.co.jp">mitake.hitoshi@lab.ntt.co.jp</a>><br>

<br>

BTW, I have one request: could you update "dog cluster info" for<br>

printing information of node changing?<br>

<br>

Thanks,<br>

Hitoshi<br>

<div><div class="h5"><br>

><br>

><br>

> 2014-05-09 16:18 GMT+08:00 Liu Yuan <<a href="mailto:namei.unix@gmail.com">namei.unix@gmail.com</a>>:<br>

><br>

> > On Wed, May 07, 2014 at 06:25:37PM +0800, Robin Dong wrote:<br>

> > > From: Robin Dong <<a href="mailto:sanbai@taobao.com">sanbai@taobao.com</a>><br>

> > ><br>

> > > When a disk is fail in a sheepdog cluster, it will only moving data in<br>

> > one node<br>

> > > to recovery data at present. This progress is very slow if the corrupted<br>

> > disk is<br>

> > > very large (for example, 4TB).<br>

> > ><br>

> > > For example, the cluster have three nodes(node A, B, C), every node have<br>

> > two<br>

> > > disks, every disk's size is 4TB. The cluster is using 8:4 erasure-code.<br>

> > > When a disk on node A is corrupted, node A will try to get 8 copies to<br>

> > > re-generate one corrupted data. For generating 4TB data, it will fetch 4<br>

> > * 8 =<br>

> > > 32TB data from remote nodes which is very inefficient.<br>

> > ><br>

> > > The solution to accelerate the speed of recovering is using disk to<br>

> > generate<br>

> > > vnodes so the failing of one disk will cause whole cluster to reweight<br>

> > and<br>

> > > moving data.<br>

> > ><br>

> > > Take the example above, all the vnodes in hashing-ring is generated by<br>

> > disk.<br>

> > > Therefore when a disk is gone, all the vnodes after it should do the<br>

> > recovery<br>

> > > work, that is, almost all the disks in the cluster will undertake the<br>

> > 4TB data.<br>

> > > It means, the cluster will use 5 disks to store re-generating data, so<br>

> > one disk<br>

> > > only need to receive 4 / 5 = 0.8TB data.<br>

> > ><br>

> ><br>

> > Kazutaka and Hitosh, any comments? Provide an means to allow users to<br>

> > configure<br>

> > disk instead of the whole node as the basic ring unit might be intereted<br>

> > to some<br>

> > users who care about more recovery performance.<br>

> ><br>

> > Thanks<br>

> > Yuan<br>

> ><br>

><br>

><br>

><br>

> --<br>

> --<br>

> Best Regard<br>

> Robin Dong<br>

</div></div>> [2  <text/html; UTF-8 (quoted-printable)>]<br>

><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br>--<br>Best Regard<br>Robin Dong

</div></div></div>