On Fri, May 4, 2012 at 11:36 AM, Huxinwei <huxinwei at huawei.com> wrote: > I tried Strategy 2 like this:**** > > for (v=0, o = 0; v < SD_MAX_VNODES; v++, o++) {**** > > if (o >= nr) o = 0;**** > > vnodes[v].id = UINT64_MAX / SD_MAX_VNODES * v;**** > > vnodes[v].node = &(nodes[o]);**** > > vnodes[v].node_idx = o;**** > > }**** > > for (v=0; v < SD_MAX_VNODES; v++) {**** > > c = ((unsigned int)(v+random()))%SD_MAX_VNODES;**** > > vnode = vnodes[v];**** > > vnodes[v] = vnodes[c];**** > > vnodes[c] = vnode;**** > > }**** > > ** ** > > It surely helps to ensure an equally distribution.**** > > But on the other hand, it’s important to track the mapping from vnode to > node then, to minimize the data movement when join/leave happens.**** > > It’s basically how swift (from openstack) do. And I found it overwhelm for > sheepdog. > With strategy 2, data movement are are distributed to all of nodes when node join/leave. I have not found idea to decrease the impact. > **** > > ** ** > > Strategy 3 is basically how glusterfs do.**** > > But, recording hash space also means you’ll face a lot of data movement > again.**** > > glusterfs use “link” extensively to avoid data moving. Should we do that > too ? > I think recording hash space like this. Then the data movement maybe a little. Use one 4MB object to store the hash information. The information include hash token and bitmap for every node. The bitmap marks how much hash space has been used. When new node joins the cluster, check all of the bitmaps of every node and find the right hash space for the new node. The object can have multi copies with the same contents. For example, nodes A B C D E. Every bitmap have 8 bits. Node C bitmap: 1 1 1 1 0 0 0 0 Node D bitmap: 0 0 1 1 0 0 0 0 Then I can assign the blank hash space between C and D to the new coming node. It is just a simple assumption. I have not got the right solution. With this assumption, all of the nodes read/write one object that records hash infomation, so there is race condition probably. > **** > > ** ** > > BTW: I found the easiest way to improve distribution while maintain the > simplicity of current algo. Is to increase VNODES_PER_NODE.**** > > The downside is, of course, more memory to use for the vnode_list. L > Yes, I agreed with you. Perhaps every strategy has its proper condition and has different configurations to tune its efficiency. Thanks Haiti > **** > > ** ** > > FYI.**** > ------------------------------ > > Xinwei Hu > 华为技术有限公司 Huawei Technologies Co., Ltd. > **[image: Company_logo]** > > Phone: > Fax: > Mobile: > Email: huxinwei at huawei.com > 地址:深圳市龙岗区坂田华为基地 邮编:518129 > Huawei Technologies Co., Ltd. > Bantian, Longgang District,Shenzhen 518129, P.R.China > http://www.huawei.com **** > ------------------------------ > > 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁 > 止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中 > 的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件! > This e-mail and its attachments contain confidential information from > HUAWEI, which > is intended only for the person or entity whose address is listed above. > Any use of the > information contained herein in any way (including, but not limited to, > total or partial > disclosure, reproduction, or dissemination) by persons other than the > intended > recipient(s) is prohibited. If you receive this e-mail in error, please > notify the sender by > phone or email immediately and delete it!**** > > *From:* sheepdog-bounces at lists.wpkg.org [mailto: > sheepdog-bounces at lists.wpkg.org] *On Behalf Of *HaiTing Yao > *Sent:* Friday, May 04, 2012 10:24 AM > *To:* sheepdog at lists.wpkg.org > *Subject:* [Sheepdog] Add more hash strategy**** > > ** ** > > Hi,**** > > **** > > I read codes about hash now and tried to do something. I need some > comments to do it properly. **** > > **** > > Strategy 1 (We used now)**** > > **** > > Random tokens per node**** > > **** > > Partition by random token. The partition token is produced by hash > function with node IP/port.**** > > **** > > Strategy 2**** > > **** > > Random tokens per node**** > > **** > > Equal sized partitions. The hash space is divided into Q equally sized > partitions. Q is the number of nodes or vnodes.**** > > **** > > Strategy 3**** > > **** > > Random tokens per node**** > > **** > > Equal sized partitions. When there is node join/leaving, the equal size > may be broken to minimize the data movement.**** > > **** > > For strategy 3, need write hash space to disk. I think I can use the > object to record the infomation.**** > > **** > > Do we need add the strategy 2/3, or there is better strategies? **** > > **** > > Thanks **** > > Haiti**** > > **** > > **** > > **** > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.wpkg.org/pipermail/sheepdog/attachments/20120504/1512fb9d/attachment.html> |