[Sheepdog] Need inputs on performing some operations atomically

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Mon Oct 4 20:08:35 CEST 2010


At Sun, 3 Oct 2010 21:03:08 +0530,
Narendra Prasad Madanapalli wrote:
> 
> Thanks Kazutaka for explaining relationships on epoch and objs. I
> still have gray areas to be clarified. Please see my comments below
> inline.
> 
> >>
> >> I need some more clarifications on epoch, objects & cluster nodes. I
> >> have observed the following
> >>
> >> 1. When a single node is started, /store_dir contains only epoch/,
> >> obj/, & sheep.log
> >> 2. Thereby I ran format copies=3, it created a file epoch/00000001 and
> >> a dir obj/00000001/
> >> 3. Now I started a new node, it created the following files in the new node:
> >> /store_dir/epoch/00000001
> >> /store_dir/epoch/00000002
> >> /store_dir/obj/00000002
> >> /store_dir/epoch/00000002/list
> >
> > `epoch` is a version number of the node membership.
> >
> > In your case,
> >
> >  epoch   node membership
> >     1   only one node (1st node)
> >     2   two nodes (1st node and 2nd node)
> >
> > and /store_dir/epoch/{epoch number} contains this information.
> >
> > Sheepdog objects are stored to /store_dir/obj/{current epoch}/
> > directory.  If the node membership is changed and the epoch number is
> > incremented, all objects are re-stored to the newer directory
> > according to the current epoch number.  This behavior supports strong
> > data consistency and prevents VMs reads older objects after writing
> > newer data.
> >
> > /store_dir/epoch/{epoch number}/list contains all the object IDs which
> > must be recovered from the older epoch.  This information is created
> > just after node membership is changed.
> >
> >> 4. At this stage, created a new VM. This creates a vdi in either nodes.
> >> /store_dir/epoch/00000002/80a5fae100000000
> >
> > 80a5fae100000000 is a vdi object file of the newly created VM.  The
> > filename is its object id.
> >
> >> 5. On the 2nd node I made the n/w down. The following additional files
> >> are created
> >> 1st Node:
> >> /store_dir/obj/00000003/80a5fae100000000
> >> /store_dir/obj/00000003/list
> >>
> >> 2nd Node:
> >> /store_dir/obj/00000004
> >> /store_dir/obj/00000004
> >
> > In the 1st node, epoch information should be
> >
> >  epoch   node membership
> >     1   only one node (1st node)
> >     2   two nodes (1st node and 2nd node)
> >     3   only one node (1st node)
> >
> > and in the 2nd node,
> >
> >  epoch   node membership
> >     1   only one node (1st node)
> >     2   two nodes (1st node and 2nd node)
> >     3   only one node (1st node or 2nd node)
> >     4   no node in the sheepdog cluster
> >
> > This situation is same as network partition, and sheepdog cannot
> > handle that now...  Current sheepdog implementation assumes that all
> > nodes have the same epoch information.  In this case, the 2nd node
> > should die just after network becomes unavailable.
> 
> I noticed the sheep daemon runs even after n/w failure. Do we need to
> exit sheep gracefully?

Yes, it is a simple approach.  A much better approach is to stop
incrementing epoch of the unreachable sheep daemon and make the daemon
unusable until the network is back.

> 
> >>
> >> However, this is still a gray area to me as I cannot get clear idea on
> >> this. I unerstand sd_deliver is reposible for cluster events and
> >> sd_conchg is for any configuration changes in cluster nodes.
> >>
> >> It would be great if you provide insights into the algorithm and
> >> details of relashionship among epoch, obj & cluster-nodes. I believe
> >> this would shed lights in solving atomic operation problems.
> >
> > The atomic update problems are how to recover the object file to the
> > consistent state (the object file may be partially updated if total
> > node failure occurs).  I think implementing logging features could
> > solve the problem.
> 
> 1. Total Node Failure:
> Is this case something as follows?
> Assume n+1 (one master node) and all of n nodes are failed and the
> object being updated is not available in master node.

No.  Total node failure means all the nodes in the sheepdog cluster
are failed at the same time because of sudden power failure, etc.
Even if some nodes in the cluster are failed during I/O operations,
Sheepdog selects other nodes and can continue the operations.  But if
all nodes are failed at the same time, Sheepdog must recover the
inconsistent state when it starts up next time.

> Or this is something different than above?
> 
> 2. Logging Features:
> Current code base already logs list of VDIs to be recovered when node
> membership changes.
> Any other information is required to be logged when node membership
> changes that can be used to solve atomic operations problems?
> 

Sorry, I mean something like write-ahead logging or journaling in file
systems.

> 
> I would appreciate if you provide pointers on logging features and how
> to use them later for taking appropriate action. Please provide code
> level details to start with as I bit confused and cannot visualize on
> what condition recovery should be performed when node failure occurs
> while performing create_vdi_obj() or verify_object()

Performing create_vdi_obj() atomically is difficult because the
operation access the objects on the different nodes.  Currently, I
don't have a smart idea to handle that.

Updating vdi object atomically in store_queue_request_local() wouldn't
be difficult.  JBD codes in the linux kernel (fs/jbd) and WAL codes in
SQLite (src/wal.c) could be good examples.  Moreover, I think we can
implement atomicity in much simpler way because we can assume that
there is no concurrent access to the same vdi object at the same time.

Here is an example pseudo code:

==

    /* in store_queue_request_local */

    if (journal_file_exists() && jounal_has_end_mark()) {

        apply_journal_to_targe_vdi_object();

        remove_journal_file();
    }

    if (opcode == SD_OP_READ_OBJ) {

        read_data_from_target_vdi_object();

    } else if (opcode == SD_OP_WRITE_OBJ)

        create_jounal_file();
        write_offset_and_size_to_journal();
        write_data_to_journal();
        write_end_mark_to_journal();

        write_data_to_target_vdi_object();

        remove_journal_file();
    }

==

Thanks,

Kazutaka



More information about the sheepdog mailing list