[sheepdog] [RFC] create onode before uploading object completed

Mon Jan 6 10:30:49 CET 2014

On Mon, Jan 06, 2014 at 05:16:22PM +0800, Robin Dong wrote:
> Hi All,
> 
> At present, the implemention of swift interface for creating object in
> sheepdog is:
> 
> 1. lock container
> 2. check whether the onode with same object name is exists.
> 3. unlock container
> 4. upload object
> 5. create onode
> 
> this sequence have a problem: if two clients uploading same objects
> concurrently, it will create two
> objects with same names in container.To avoid duplicated names, we must put
> "create onode"
> operation in container lock regions.
> 
> Therefore we need to change the processes of creating object to:
> 
> 1. lock container
> 2. check whether the onode is exists.
> 3. allocate data space for object, and create onode, then write it done
> 4. unlock container
> 5. upload object
> 
> this routine will avoid uploading duplicated objects.
> 
> There is an exception on the new routine: if the client halt the uploading
> progress, we will have a
> "uploading uncompleted" onode.
> I think this problem is easy to solve: we can add code for onode to
> identify its status.
> A new onode will be set to "INIT", and after uploading completed, the onode
> will be set to  "COMPLETED".
> So, when users try to use swift interface to GET a "uncompleted" object,
> sheep will find out the onode is
> "INIT" which means "not completed", so sheep will return "partial
> content"for http request, and user could
> DELETE the object and upload it again.
> 
> 
> Is there any suggestion for this new design ?
> 

I think this RFC would mean much more than solving the problem. It baisicly
changes the PUT semantics for simultaneous PUT and GET semantics a bit.

"
The Amazon's semantic of simultaneous PUT:

Amazon S3 never adds partial objects; if you receive a success response, Amazon S3 added the entire object to the bucket.

Amazon S3 is a distributed system. If it receives multiple write requests for the same object simultaneously, it overwrites all but the last object written. Amazon S3 does not provide object locking; if you need this, make sure to build it into your application layer or use versioning instead.
"

But with this RFC,
- we add a basic object locking for multiple write, that only *first* one can
  succeed.
- we add an extra error code for GET, that it can 'partial content' error code
  if the object is not fully created.

This change looks good to me because
- we don't need to handle partial write (due to power failure, system error, etc)
  and return 'partial content' to client directly
- save client from taking care of putting the same objects and make sure only
  the earlist put will succeed.

How others think of it?

Thanks
Yuan