[Sheepdog] Need inputs on performing some operations atomically

Sun Oct 3 17:33:08 CEST 2010

Thanks Kazutaka for explaining relationships on epoch and objs. I
still have gray areas to be clarified. Please see my comments below
inline.

>>
>> I need some more clarifications on epoch, objects & cluster nodes. I
>> have observed the following
>>
>> 1. When a single node is started, /store_dir contains only epoch/,
>> obj/, & sheep.log
>> 2. Thereby I ran format copies=3, it created a file epoch/00000001 and
>> a dir obj/00000001/
>> 3. Now I started a new node, it created the following files in the new node:
>> /store_dir/epoch/00000001
>> /store_dir/epoch/00000002
>> /store_dir/obj/00000002
>> /store_dir/epoch/00000002/list
>
> `epoch` is a version number of the node membership.
>
> In your case,
>
>  epoch   node membership
>     1   only one node (1st node)
>     2   two nodes (1st node and 2nd node)
>
> and /store_dir/epoch/{epoch number} contains this information.
>
> Sheepdog objects are stored to /store_dir/obj/{current epoch}/
> directory.  If the node membership is changed and the epoch number is
> incremented, all objects are re-stored to the newer directory
> according to the current epoch number.  This behavior supports strong
> data consistency and prevents VMs reads older objects after writing
> newer data.
>
> /store_dir/epoch/{epoch number}/list contains all the object IDs which
> must be recovered from the older epoch.  This information is created
> just after node membership is changed.
>
>> 4. At this stage, created a new VM. This creates a vdi in either nodes.
>> /store_dir/epoch/00000002/80a5fae100000000
>
> 80a5fae100000000 is a vdi object file of the newly created VM.  The
> filename is its object id.
>
>> 5. On the 2nd node I made the n/w down. The following additional files
>> are created
>> 1st Node:
>> /store_dir/obj/00000003/80a5fae100000000
>> /store_dir/obj/00000003/list
>>
>> 2nd Node:
>> /store_dir/obj/00000004
>> /store_dir/obj/00000004
>
> In the 1st node, epoch information should be
>
>  epoch   node membership
>     1   only one node (1st node)
>     2   two nodes (1st node and 2nd node)
>     3   only one node (1st node)
>
> and in the 2nd node,
>
>  epoch   node membership
>     1   only one node (1st node)
>     2   two nodes (1st node and 2nd node)
>     3   only one node (1st node or 2nd node)
>     4   no node in the sheepdog cluster
>
> This situation is same as network partition, and sheepdog cannot
> handle that now...  Current sheepdog implementation assumes that all
> nodes have the same epoch information.  In this case, the 2nd node
> should die just after network becomes unavailable.

I noticed the sheep daemon runs even after n/w failure. Do we need to
exit sheep gracefully?

>>
>> However, this is still a gray area to me as I cannot get clear idea on
>> this. I unerstand sd_deliver is reposible for cluster events and
>> sd_conchg is for any configuration changes in cluster nodes.
>>
>> It would be great if you provide insights into the algorithm and
>> details of relashionship among epoch, obj & cluster-nodes. I believe
>> this would shed lights in solving atomic operation problems.
>
> The atomic update problems are how to recover the object file to the
> consistent state (the object file may be partially updated if total
> node failure occurs).  I think implementing logging features could
> solve the problem.

1. Total Node Failure:
Is this case something as follows?
Assume n+1 (one master node) and all of n nodes are failed and the
object being updated is not available in master node.
Or this is something different than above?

2. Logging Features:
Current code base already logs list of VDIs to be recovered when node
membership changes.
Any other information is required to be logged when node membership
changes that can be used to solve atomic operations problems?

I would appreciate if you provide pointers on logging features and how
to use them later for taking appropriate action. Please provide code
level details to start with as I bit confused and cannot visualize on
what condition recovery should be performed when node failure occurs
while performing create_vdi_obj() or verify_object()

Thanks,
Narendra.