[Sheepdog] [PATCH 1/2] sheep: sheep: handle node change event first

Sun Apr 1 07:24:23 CEST 2012

On 04/01/2012 12:16 PM, MORITA Kazutaka wrote:

> At Sun, 01 Apr 2012 11:58:00 +0800,
> Liu Yuan wrote:
>>
>> On 04/01/2012 11:41 AM, MORITA Kazutaka wrote:
>>
>>> At Sat, 31 Mar 2012 18:31:00 +0800,
>>> Liu Yuan wrote:
>>>>
>>>> On 03/31/2012 06:23 PM, MORITA Kazutaka wrote:
>>>>
>>>>> Many bad effects.  For example, imagine that join messages are
>>>>> processed in the different order with other nodes.
>>>>
>>>>
>>>> Maybe not. I notice that every call to start_cpg_event_work() will drain
>>>> the cpg queue, So this change will assure us that confchg will be
>>>> handled for sure, despite of other requests.
>>>
>>> No, membership change events are blocked until all outstanding I/O
>>> requests are flushed or the previous change membership event are
>>> finished.  There exists the case that the cpg queue is not empty after
>>> start_cpg_event_work() was called.
>>>
>>>>
>>>> We both do DD in guests and do a loop for creating new vid and deleting
>>>> that vdi during the join/leave test.
>>>>
>>>> All seems good so far... look the sequence for joining 60 nodes
>>>
>>> This is a timing problem.  I think the problem would happen on other
>>> environments.
>>>
>>> Let's take another approach.  Here is my suggestion:
>>>
>>>  - Use different queues for I/O requests and membership events.
>>>  - When membership queue is empty, we can process I/O requests as
>>>    usual.
>>>  - When membership queue is not empty, flush all outstanding I/Os.
>>>    New I/O requests are blocked until the membership queue becomes
>>>    empty.
>>>  - SD_OP_SHUTDOWN and SD_OP_MAKE_FS should be pushed to the membership
>>>    queue, and other operations are pushed to the I/O request queue.
>>>
>>
>>
>> I considered split queues. Initially, I planned to solve it that way.
>> But after analysis, I don't think it is necessary.
>>
>> I think we need to firstly handle membership change before flushing IO
>> requests, because IO request doesn't know the routing, we need to feed
>> them with freshest membership.
>>
>> The timing problem doesn't exist at all. The cluster driver would assure
>> us the order of events, and I just recorder notify & confchg with IO
>> events. The internal order of notify and confchg is maintained, this two
>> patch and the whole cpg working mechanism will allow both notify &
>> confchg to be working as 'once it happens , it is handled immediately'.
> 
> Consider the following simple case:
> 
>  1. There are two nodes, A and B.
>  2. Only node A has the outstanding I/O.
>  3. New nodes, C and D, join to Sheepdog.
>  4. Node B processes two join messeages, "join C" and "join D".
>  5. Node A doesn't processes the join messages until the outstanding
>     I/O is finished.  If you push the join messages to the head of the
>     cpg queue, the messages are processed in the reverse order ("join
>     D" first) because start_cpg_event_work process the event from the
>     head of the queue.
> 

No, this doesn't happen.

With my patch, the step 5 is:
 5. Node A process 'join C' first, then 'join D', because
	1) cluster driver broadcasts the join message in this order
	2) start_cpg_event_work() will strictly serialize the sequence that
guarantee processing 'join C' before 'join D'. That is, you can insert D
before C until the C is processed.

Thanks,
Yuan