[sheepdog] Is it necessary for outstanding io block leave/join event?

Liu Yuan namei.unix at gmail.com
Sat May 19 12:02:31 CEST 2012


On 05/19/2012 02:01 AM, MORITA Kazutaka wrote:

> At Thu, 17 May 2012 15:48:07 +0800,
> Liu Yuan wrote:
>>
>> On 05/17/2012 03:40 PM, MORITA Kazutaka wrote:
>>
>>> I thought that one advantage of the simple_store driver was that it
>>> uses a syscall link() to copy objects from local older epochs to the
>>> current epoch, so we could avoid many I/Os in the recovery process.
>>> However, it seems that the link operation of the farm driver is not
>>> called at all on my environment.  Does Farm do the recovery process in
>>> the different way from the simple_store driver?
>>
>>
>> farm_link() will be called for multiple nodes events and in a very
>> unusual corner cases. Actually, for the case you describe Farm works in
>> a more optimal way: there isn't any operations for the object that isn't
>> to be migrated to other nodes, save a system call of link() than simple
>> store.
> 
> If it is true, I wanted to see the implementation in the recovery core
> code instead of in the farm driver.  But does the optimization work
> correctly?  I couldn't find the code which tries to avoid the
> redundant link calls, and actually the farm driver couldn't recover
> objects correctly with the following testcase:
> 


I can't simply code it in the core recovery code because simple store
and farm doesn't agree on the underlying layout.(object assume
epoch/oid, while farm assumes only oid as its naming method). But if we
remove simple store, we might get a better core code.

> [Testcase script]
> ==
> #!/bin/bash
> 
> set -ex
> 
> STORE=$1
> 
> # start three sheep daemons
> for i in 0 1 2; do
>     ./sheep/sheep /store/$i -z $i -p 700$i -W
> done
> 
> sleep 1
> ./collie/collie cluster format -c 2 -b $STORE
> 
> # create a pre-allocated vdi
> ./collie/collie vdi create test 80M -P
> 
> # stop the 3rd sheep
> pkill -f "sheep /store/2"
> 
> # write data to the vdi
> cat /dev/urandom | ./collie/collie vdi write test
> 
> # restart the 3rd sheep
> ./sheep/sheep /store/2 -z 2 -p 7002 -W
> 
> # wait for object recovery to finish
> sleep 10
> 
> # show md5sum of the vdi on each node
> for i in 0 1 2; do
>     ./collie/collie vdi read test -p 700$i | md5sum
> done
> ==
>


Very good test script, I've drafted a patch for it, with this patch,
farm can work as nice as expected.

 
> [Results]
> 
>  $ ./testcase.sh simple
>  ...
>  (snip)
>  ...
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7000
>  + md5sum
>  6ebd372401d0848734293709bb7b3cb7  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7001
>  + md5sum
>  6ebd372401d0848734293709bb7b3cb7  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7002
>  + md5sum
>  6ebd372401d0848734293709bb7b3cb7  -
> 
>  $ ./testcase.sh farm
>  ...
>  (snip)
>  ...
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7000
>  + md5sum
>  ef8bd9bbc1f140979405ac08abd24541  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7001
>  + md5sum
>  dee273206981c7f821061310eac90cd3  -
>  + for i in 0 1 2
>  + ./collie/collie vdi read test -p 7002
>  + md5sum
>  ca74a3b2e031a20b03c3baa4af9ab9c5  -
> 
>>
>> This contributes to Farm to outperform simple store for recovery,
>> because most objects are not to be migrated at all for a recovery.
> 
> I'm fine with dropping the simple driver if the above kinds of
> problems are planed to be fixed in the farm driver.  I wish the
> correctness would be regarded as more important than the performance.
> 


Sure, I think farm can meet the needs of correctness, there might be
some bug hanging over like above example, but doesn't necessarily mean
farm can't fix them.

Thanks,
Yuan



More information about the sheepdog mailing list