[sheepdog] [PATCH V2 00/11] INTRODUCE

Yunkai Zhang yunkai.me at gmail.com
Tue Aug 21 19:16:49 CEST 2012


On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka
<morita.kazutaka at lab.ntt.co.jp> wrote:
> At Mon, 20 Aug 2012 23:34:10 +0800,
> Yunkai Zhang wrote:
>>
>> In fact, I have thought this method, but we should face nearly the same problem:
>>
>> After sheep joined back, it should known which objects is dirty, and
>> should do the clear work(because there are old version object stay in
>> it's working directory). This method seems not save the steps, but
>> will do extra recovery works.
>
> Can you give me a concrete example?
>
> I created a really naive patch to disable object recovery with my
> idea:


Hi Kazum:

I have read and do simple test with this patch, it works at most time.

But write operation will be blocked in wait_forward_request(), I think
there are some corner case we should handle.

I think I have understood this good idea, it's simple and clever.


Could you give a mature patch? We really want to use it in our cluster
as soon as possible.


Thank you!

>
> ==
> diff --git a/sheep/recovery.c b/sheep/recovery.c
> index 5164aa7..8bf032f 100644
> --- a/sheep/recovery.c
> +++ b/sheep/recovery.c
> @@ -35,6 +35,7 @@ struct recovery_work {
>         uint64_t *oids;
>         uint64_t *prio_oids;
>         int nr_prio_oids;
> +       int nr_scheduled_oids;
>
>         struct vnode_info *old_vinfo;
>         struct vnode_info *cur_vinfo;
> @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid)
>                                 oid);
>                         return;
>                 }
> -       /* The oid is currently being recovered */
> -       if (rw->oids[rw->done] == oid)
> -               return;
>         rw->nr_prio_oids++;
>         rw->prio_oids = xrealloc(rw->prio_oids,
>                                  rw->nr_prio_oids * sizeof(uint64_t));
> @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct recovery_work *rw)
>  done:
>         free(rw->prio_oids);
>         rw->prio_oids = NULL;
> +       rw->nr_scheduled_oids += rw->nr_prio_oids;
>         rw->nr_prio_oids = 0;
>  }
>
> +static struct timer recovery_timer;
> +
> +static void recover_next_object(void *arg)
> +{
> +       struct recovery_work *rw = arg;
> +
> +       if (rw->nr_prio_oids)
> +               finish_schedule_oids(rw);
> +
> +       if (rw->done < rw->nr_scheduled_oids) {
> +               /* Try recover next object */
> +               queue_work(sys->recovery_wqueue, &rw->work);
> +               return;
> +       }
> +
> +       /* There is no objects to be recovered.  Try again later */
> +       recovery_timer.callback = recover_next_object;
> +       recovery_timer.data = rw;
> +       add_timer(&recovery_timer, 1); /* FIXME */
> +}
> +
>  static void recover_object_main(struct work *work)
>  {
>         struct recovery_work *rw = container_of(work, struct recovery_work,
> @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work)
>         resume_wait_obj_requests(rw->oids[rw->done++]);
>
>         if (rw->done < rw->count) {
> -               if (rw->nr_prio_oids)
> -                       finish_schedule_oids(rw);
> -
> -               /* Try recover next object */
> -               queue_work(sys->recovery_wqueue, &rw->work);
> +               recover_next_object(rw);
>                 return;
>         }
>
> @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work)
>         resume_wait_recovery_requests();
>         rw->work.fn = recover_object_work;
>         rw->work.done = recover_object_main;
> -       queue_work(sys->recovery_wqueue, &rw->work);
> +       recover_next_object(rw);
>         return;
>  }
>
> ==
>
> I ran the following test, and object recovery was disabled correctly
> for both join and leave case.
>
> ==
> #!/bin/bash
>
> for i in 0 1 2 3; do
>     ./sheep/sheep /store/$i -z $i -p 700$i -c local
> done
>
> sleep 1
> ./collie/collie cluster format
>
> ./collie/collie vdi create test 4G
>
> echo " * objects will be created on node[0-2] *"
> md5sum /store/[0,1,2,3]/obj/807c2b2500000000
>
> pkill -f "./sheep/sheep /store/1"
> sleep 3
>
> echo " * recovery doesn't start until the object is touched *"
> md5sum /store/[0,2,3]/obj/807c2b2500000000
>
> ./collie/collie vdi snapshot test  # invoke recovery of the vdi object
> echo " * the object is recovered *"
> md5sum /store/[0,2,3]/obj/807c2b2500000000
>
> ./sheep/sheep /store/1 -z 1 -p 7001 -c local
> sleep 3
>
> echo " * recovery doesn't start until the object is touched *"
> md5sum /store/[0,1,2,3]/obj/807c2b2500000000
>
> ./collie/collie vdi list -p 7001  # invoke recovery of the vdi object
> echo " * the object is recovered *"
> md5sum /store/[0,1,2,3]/obj/807c2b2500000000
> ==
>
> [Output]
>
> using backend farm store
>  * objects will be created on node[0-2] *
> 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b2500000000
> 701e77eab6002c9a48f7ba72c8d9bfe9  /store/1/obj/807c2b2500000000
> 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b2500000000
>  * recovery doesn't start until the object is touched *
> 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b2500000000
> 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b2500000000
>  * the object is recovered *
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b2500000000
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b2500000000
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/3/obj/807c2b2500000000
>  * recovery doesn't start until the object is touched *
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b2500000000
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b2500000000
>   Name        Id    Size    Used  Shared    Creation time   VDI id  Tag
> s test         1  4.0 GB  0.0 MB  0.0 MB 2012-08-21 02:49   7c2b25
>   test         2  4.0 GB  0.0 MB  0.0 MB 2012-08-21 02:49   7c2b26
>  * the object is recovered *
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b2500000000
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/1/obj/807c2b2500000000
> 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b2500000000
>
>
> I couldn't read an old object at all.
>
> Thanks,
>
> Kazutaka



-- 
Yunkai Zhang
Work at Taobao



More information about the sheepdog mailing list