[Sheepdog] [PATCH 1/2] sheep: sheep: handle node change event first

Sun Apr 1 06:16:26 CEST 2012

At Sun, 01 Apr 2012 11:58:00 +0800,
Liu Yuan wrote:
> 
> On 04/01/2012 11:41 AM, MORITA Kazutaka wrote:
> 
> > At Sat, 31 Mar 2012 18:31:00 +0800,
> > Liu Yuan wrote:
> >>
> >> On 03/31/2012 06:23 PM, MORITA Kazutaka wrote:
> >>
> >>> Many bad effects.  For example, imagine that join messages are
> >>> processed in the different order with other nodes.
> >>
> >>
> >> Maybe not. I notice that every call to start_cpg_event_work() will drain
> >> the cpg queue, So this change will assure us that confchg will be
> >> handled for sure, despite of other requests.
> > 
> > No, membership change events are blocked until all outstanding I/O
> > requests are flushed or the previous change membership event are
> > finished.  There exists the case that the cpg queue is not empty after
> > start_cpg_event_work() was called.
> > 
> >>
> >> We both do DD in guests and do a loop for creating new vid and deleting
> >> that vdi during the join/leave test.
> >>
> >> All seems good so far... look the sequence for joining 60 nodes
> > 
> > This is a timing problem.  I think the problem would happen on other
> > environments.
> > 
> > Let's take another approach.  Here is my suggestion:
> > 
> >  - Use different queues for I/O requests and membership events.
> >  - When membership queue is empty, we can process I/O requests as
> >    usual.
> >  - When membership queue is not empty, flush all outstanding I/Os.
> >    New I/O requests are blocked until the membership queue becomes
> >    empty.
> >  - SD_OP_SHUTDOWN and SD_OP_MAKE_FS should be pushed to the membership
> >    queue, and other operations are pushed to the I/O request queue.
> > 
> 
> 
> I considered split queues. Initially, I planned to solve it that way.
> But after analysis, I don't think it is necessary.
> 
> I think we need to firstly handle membership change before flushing IO
> requests, because IO request doesn't know the routing, we need to feed
> them with freshest membership.
> 
> The timing problem doesn't exist at all. The cluster driver would assure
> us the order of events, and I just recorder notify & confchg with IO
> events. The internal order of notify and confchg is maintained, this two
> patch and the whole cpg working mechanism will allow both notify &
> confchg to be working as 'once it happens , it is handled immediately'.

Consider the following simple case:

 1. There are two nodes, A and B.
 2. Only node A has the outstanding I/O.
 3. New nodes, C and D, join to Sheepdog.
 4. Node B processes two join messeages, "join C" and "join D".
 5. Node A doesn't processes the join messages until the outstanding
    I/O is finished.  If you push the join messages to the head of the
    cpg queue, the messages are processed in the reverse order ("join
    D" first) because start_cpg_event_work process the event from the
    head of the queue.

Thanks,

Kazutaka