[sheepdog] [PATCH v2 0/5] using disks to generate vnodes instead of nodes

Fri May 16 16:37:13 CEST 2014

At Fri, 16 May 2014 16:17:20 +0800,
Robin Dong wrote:
> 
> [1  <text/plain; UTF-8 (7bit)>]
> Hi, Kazutaka and Hitosh
> 
> Could you give some suggestions about this patchset ?

I really like your idea. It must be useful for machines with bunch of
disks. Of course there are some points for improvements
(e.g. introduced #ifdefs should be removed in the future), but it
seems to be a good first step.

Reviewed-by: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>

BTW, I have one request: could you update "dog cluster info" for
printing information of node changing?

Thanks,
Hitoshi

> 
> 
> 2014-05-09 16:18 GMT+08:00 Liu Yuan <namei.unix at gmail.com>:
> 
> > On Wed, May 07, 2014 at 06:25:37PM +0800, Robin Dong wrote:
> > > From: Robin Dong <sanbai at taobao.com>
> > >
> > > When a disk is fail in a sheepdog cluster, it will only moving data in
> > one node
> > > to recovery data at present. This progress is very slow if the corrupted
> > disk is
> > > very large (for example, 4TB).
> > >
> > > For example, the cluster have three nodes(node A, B, C), every node have
> > two
> > > disks, every disk's size is 4TB. The cluster is using 8:4 erasure-code.
> > > When a disk on node A is corrupted, node A will try to get 8 copies to
> > > re-generate one corrupted data. For generating 4TB data, it will fetch 4
> > * 8 =
> > > 32TB data from remote nodes which is very inefficient.
> > >
> > > The solution to accelerate the speed of recovering is using disk to
> > generate
> > > vnodes so the failing of one disk will cause whole cluster to reweight
> > and
> > > moving data.
> > >
> > > Take the example above, all the vnodes in hashing-ring is generated by
> > disk.
> > > Therefore when a disk is gone, all the vnodes after it should do the
> > recovery
> > > work, that is, almost all the disks in the cluster will undertake the
> > 4TB data.
> > > It means, the cluster will use 5 disks to store re-generating data, so
> > one disk
> > > only need to receive 4 / 5 = 0.8TB data.
> > >
> >
> > Kazutaka and Hitosh, any comments? Provide an means to allow users to
> > configure
> > disk instead of the whole node as the basic ring unit might be intereted
> > to some
> > users who care about more recovery performance.
> >
> > Thanks
> > Yuan
> >
> 
> 
> 
> -- 
> --
> Best Regard
> Robin Dong
> [2  <text/html; UTF-8 (quoted-printable)>]
>