[sheepdog] Is it necessary for outstanding io block leave/join event?

Sat May 19 12:02:31 CEST 2012

On 05/19/2012 02:01 AM, MORITA Kazutaka wrote:

> At Thu, 17 May 2012 15:48:07 +0800,
> Liu Yuan wrote:
>>
>> On 05/17/2012 03:40 PM, MORITA Kazutaka wrote:
>>
>>> I thought that one advantage of the simple_store driver was that it
>>> uses a syscall link() to copy objects from local older epochs to the
>>> current epoch, so we could avoid many I/Os in the recovery process.
>>> However, it seems that the link operation of the farm driver is not
>>> called at all on my environment.  Does Farm do the recovery process in
>>> the different way from the simple_store driver?
>>
>>
>> farm_link() will be called for multiple nodes events and in a very
>> unusual corner cases. Actually, for the case you describe Farm works in
>> a more optimal way: there isn't any operations for the object that isn't
>> to be migrated to other nodes, save a system call of link() than simple
>> store.
> 
> If it is true, I wanted to see the implementation in the recovery core
> code instead of in the farm driver.  But does the optimization work
> correctly?  I couldn't find the code which tries to avoid the
> redundant link calls, and actually the farm driver couldn't recover
> objects correctly with the following testcase:
> 

I can't simply code it in the core recovery code because simple store
and farm doesn't agree on the underlying layout.(object assume
epoch/oid, while farm assumes only oid as its naming method). But if we
remove simple store, we might get a better core code.

> [Testcase script]
> ==
> #!/bin/bash
> 
> set -ex
> 
> STORE=$1
> 
> # start three sheep daemons
> for i in 0 1 2; do
>     ./sheep/sheep /store/$i -z $i -p 700$i -W
> done
> 
> sleep 1
> ./collie/collie cluster format -c 2 -b $STORE
> 
> # create a pre-allocated vdi
> ./collie/collie vdi create test 80M -P
> 
> # stop the 3rd sheep
> pkill -f "sheep /store/2"
> 
> # write data to the vdi
> cat /dev/urandom | ./collie/collie vdi write test
> 
> # restart the 3rd sheep
> ./sheep/sheep /store/2 -z 2 -p 7002 -W
> 
> # wait for object recovery to finish
> sleep 10
> 
> # show md5sum of the vdi on each node
> for i in 0 1 2; do
>     ./collie/collie vdi read test -p 700$i | md5sum
> done
> ==
>

Very good test script, I've drafted a patch for it, with this patch,
farm can work as nice as expected.

> [Results]
> 
>  $ ./testcase.sh simple
>  ...
>  (snip)
>  ...
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7000
>  + md5sum
>  6ebd372401d0848734293709bb7b3cb7  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7001
>  + md5sum
>  6ebd372401d0848734293709bb7b3cb7  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7002
>  + md5sum
>  6ebd372401d0848734293709bb7b3cb7  -
> 
>  $ ./testcase.sh farm
>  ...
>  (snip)
>  ...
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7000
>  + md5sum
>  ef8bd9bbc1f140979405ac08abd24541  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7001
>  + md5sum
>  dee273206981c7f821061310eac90cd3  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7002
>  + md5sum
>  ca74a3b2e031a20b03c3baa4af9ab9c5  -
> 
>>
>> This contributes to Farm to outperform simple store for recovery,
>> because most objects are not to be migrated at all for a recovery.
> 
> I'm fine with dropping the simple driver if the above kinds of
> problems are planed to be fixed in the farm driver.  I wish the
> correctness would be regarded as more important than the performance.
> 

Sure, I think farm can meet the needs of correctness, there might be
some bug hanging over like above example, but doesn't necessarily mean
farm can't fix them.

Thanks,
Yuan