[sheepdog] Is it necessary for outstanding io block leave/join event?

Thu May 17 03:56:15 CEST 2012

On 05/17/2012 01:57 AM, MORITA Kazutaka wrote:

> At Wed, 16 May 2012 13:27:31 -0400,
> Christoph Hellwig wrote:
>>
>> On Thu, May 17, 2012 at 01:33:14AM +0900, MORITA Kazutaka wrote:
>>>> I think the rational is to not change the cluster configuration while
>>>> I/O is in progress.  With the vnode_info structure making the cluster
>>>> state during and I/O operation explicit (together with hdr->epoch)
>>>> I suspect this isn't needed any more, but I want to do a full blown
>>>> audit of the I/O path first.
>>>
>>> The reason is that the recovery algorithm assumes that all objects in
>>> the older epoch are immutable, which means only the objects in the
>>> current epoch are writable.  If outstanding I/Os update objects in the
>>> previous epoch after they are recovered to the current epoch, their
>>> replicas result in inconsistent state.
>>

This assumption seems not necessary, at least to Farm, where I/O will
always be routed into objects in the working directory.

Let outstanding IO blocks confchg will risk both dead lock and live lock.

We met a very catastrophic problem: doing heavy IO on each node while
cluster is in recovery.

Every node is issuing IO request while doing recovery. Both outstanding
IO and unfinished confchg event blocks each other (nearly dead lock),
all nodes are busy retrying those pending I/Os (live lock), and recovery
requests are mostly denied of service, neither outstanding IO nor
recovery moves on to completion.

We are trying to solve the problem, code is half baked, I guess when the
code is done and public, we can continue discussion on it.

I think both recovery code and any assumptions need to be revisited and
possibly this is a long term issue.

Thanks,
Yuan