[Sheepdog] Need inputs on performing some operations atomically

MORITA Kazutaka morita.kazutaka at lab.ntt.co.jp
Tue Sep 21 11:40:10 CEST 2010


At Sun, 19 Sep 2010 08:30:16 +0530,
Narendra Prasad Madanapalli wrote:
> 
> Thanks Kazutaka.
> 
> I need some more clarifications on epoch, objects & cluster nodes. I
> have observed the following
> 
> 1. When a single node is started, /store_dir contains only epoch/,
> obj/, & sheep.log
> 2. Thereby I ran format copies=3, it created a file epoch/00000001 and
> a dir obj/00000001/
> 3. Now I started a new node, it created the following files in the new node:
> /store_dir/epoch/00000001
> /store_dir/epoch/00000002
> /store_dir/obj/00000002
> /store_dir/epoch/00000002/list

`epoch` is a version number of the node membership.

In your case,

 epoch   node membership
     1   only one node (1st node)
     2   two nodes (1st node and 2nd node)

and /store_dir/epoch/{epoch number} contains this information.

Sheepdog objects are stored to /store_dir/obj/{current epoch}/
directory.  If the node membership is changed and the epoch number is
incremented, all objects are re-stored to the newer directory
according to the current epoch number.  This behavior supports strong
data consistency and prevents VMs reads older objects after writing
newer data.

/store_dir/epoch/{epoch number}/list contains all the object IDs which
must be recovered from the older epoch.  This information is created
just after node membership is changed.

> 4. At this stage, created a new VM. This creates a vdi in either nodes.
> /store_dir/epoch/00000002/80a5fae100000000

80a5fae100000000 is a vdi object file of the newly created VM.  The
filename is its object id.

> 5. On the 2nd node I made the n/w down. The following additional files
> are created
> 1st Node:
> /store_dir/obj/00000003/80a5fae100000000
> /store_dir/obj/00000003/list
> 
> 2nd Node:
> /store_dir/obj/00000004
> /store_dir/obj/00000004

In the 1st node, epoch information should be

 epoch   node membership
     1   only one node (1st node)
     2   two nodes (1st node and 2nd node)
     3   only one node (1st node)

and in the 2nd node,

 epoch   node membership
     1   only one node (1st node)
     2   two nodes (1st node and 2nd node)
     3   only one node (1st node or 2nd node)
     4   no node in the sheepdog cluster

This situation is same as network partition, and sheepdog cannot
handle that now...  Current sheepdog implementation assumes that all
nodes have the same epoch information.  In this case, the 2nd node
should die just after network becomes unavailable.

> 
> However, this is still a gray area to me as I cannot get clear idea on
> this. I unerstand sd_deliver is reposible for cluster events and
> sd_conchg is for any configuration changes in cluster nodes.
> 
> It would be great if you provide insights into the algorithm and
> details of relashionship among epoch, obj & cluster-nodes. I believe
> this would shed lights in solving atomic operation problems.

The atomic update problems are how to recover the object file to the
consistent state (the object file may be partially updated if total
node failure occurs).  I think implementing logging features could
solve the problem.


Thanks,

Kazutaka


> Thanks,
> Narendra.
> 
> On Wed, Sep 15, 2010 at 2:42 AM, MORITA Kazutaka
> <morita.kazutaka at lab.ntt.co.jp> wrote:
> > At Sun, 12 Sep 2010 19:41:34 +0530,
> > Narendra Prasad Madanapalli wrote:
> >>
> >> Hi,
> >>
> >> I found there are two functions that are to be executed atomically in
> >> sheep. These functions are below:
> >>
> >> 1. sheep/store.c:
> >>                         /* FIXME: need to update atomically */
> >> /*                      ret = verify_object(fd, NULL, 0, 1); */
> >> /*                      if (ret < 0) { */
> >> /*                              eprintf("failed to set checksum,
> >> %"PRIx64"\n", oid); */
> >> /*                              ret = SD_RES_EIO; */
> >> /*                              goto out; */
> >> /*                      } */
> >>
> >> 2. sheep/vdi.c:
> >> /* TODO: should be performed atomically */
> >> static int create_vdi_obj(uint32_t epoch, char *name, uint32_t
> >> new_vid, uint64_t size,
> >>                           uint32_t base_vid, uint32_t cur_vid, uint32_t copies,
> >>                           uint32_t snapid, int is_snapshot)
> >> {
> >>
> >> My understanding is that these two functions get executed in
> >> worker_routine() in response to queue_request() & queue_work().
> >>
> >> Solution for verify_object()
> >> Since this operates on file descriptor, I think this can be performed
> >> with the help of file locking mechanism.
> >
> > No.  Basically we don't need a lock mechanism for sheepdog objects;
> > all objects are categorized into the following two groups.
> >  - only one virtual machine can access the object
> >  - all virtual machines can read the object, but no one update it
> >
> > What we need to do here is atomic update of the vdi objects.  For
> > example, if total node failure happens during updating vdi objects, we
> > need to roll-back to the previous right state.
> >
> >>
> >> Solution for create_vdi_obj()
> >> This can be fixed by introducing global pthread lock.
> >
> > Same as above.  If a master node fails during creating a new vdi
> > object, the next master need to take over the work, or more easily,
> > delete the object and return error code to the administrator.
> >
> >
> > Thanks,
> >
> > Kazutaka
> >
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog



More information about the sheepdog mailing list