[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Liu Yuan namei.unix at gmail.com
Thu Jan 30 03:07:37 CET 2014


On Thu, Jan 30, 2014 at 10:53:39AM +0900, Hitoshi Mitake wrote:
> At Thu, 30 Jan 2014 02:21:54 +0800,
> Liu Yuan wrote:
> > 
> > On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > 
> > > Current sheepdog cannot handle corruption of inode objects. For
> > > example, members like name or nr_copies of sd_inode are broken by
> > > silent data corruption of disks, even initialization of sheep
> > > processes fail. Because sheep and dog themselves interpret the content
> > > of inode objects.
> > 
> > any resource to confirm so called 'silent data corruption'? Modern disk has
> > built-in correction code (RS) for each sector. So either EIO or full data will
> > return from disks as far as I know. I've never seen a real 'silent data corruption'
> > yet in person. I know many people suspect it would happen, but I think we need
> > real proof of it because most of time, it is false positive.
> 
> This paper is a major source of the "silent data corruption'":
> https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
> 
> Of course the corruption happens rarely. But it can happen so we
> should handle it. So we have the majority voting mechanism of "dog vdi
> check", no?
> 
> > 
> > > For detecting such a corruption of inode objects, this patch adds a
> > > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > > passed as an option of cluster format (dog cluster format -i), sheep
> > > processes belong to the cluster do below actions:
> > > 
> > > - when the sheep updates inode objects, it stores sha1 value of the
> > >   object to xattr (default_write())
> > > - when the sheep reads an inode object, it caliculates sha1 value of
> > >   the inode object. Then it compares the caliculated value with the
> > >   stored one. If these values differ, the reading causes error
> > >   (default_read()).
> > > 
> > > This checking mechanism prevents interpretation of corrupted inode
> > > objects by sheep.
> > 
> > I don't think we should implement this check in the sheep. It's better to do
> > it in dog as a 'check' plugin because
> > 
> > - no need to introduce imcompatible physical layout (extra xattr)
> 
> This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".
> 
> In addition, the check is done on a cluster which is formatted with
> the new -i option. It doesn't affect existing clusters.

> > - dog is in a better position to quorum recovery. even if all the hashes are
> >   valid, you can't make sure the consistency between different
> >   copies.
> 
> No. As I described in the commit log, dog cannot detect corruption of
> inode object by itself. e.g. if store_policy is corrupted, ordinal
> replicated vdi can be seen as a hypervolume. If nr_copies is
> corrupted, vdi checking itself must be bogus.

why not? our check firstly get the hash of each inode, so any data corruption
in some inode(s) will be detected and possibly fixed if it is minority, no?

> > - as you mentioned, performance is awful. only inode checking is basically
> >   useless, users care about their data, not your internal metadata. If we add
> >   it to data, performance is unacceptably bad.
> 
> The point is that we have to let sheep not to interpret corrupted
> inode objects. If inode object is corrupted, vdi checking cannot be
> done as I described in the above.

I don't see any reason dog can't accomplish to check courrupted inode.

> > - current backend store (plain sotre + MD) is far more complex than it appears
> >   I have recently spent days to fix a tricky data lost bug. So basically I don't
> >   want to add more tricky code in it without compelling reason(s). Right now,
> >   we should keep backend store as 100% reliable before we can do anything useful
> 
> Which bug? I want to see the patch for it.

Patch is ready and I'm trying to write a test for it.

>
> BTW, on the second thought, I think current patch is okay for
> detecting corruption of inode objects. Voluntary recovery by sheep
> process (which is mentioned as a future work) wouldn't
> required. Simply producing emergency message and exiting is enough.
> Because the corruption can happen rarely, administrators should remove
> the corrupted inodes manually when it happens. How do you think?

Yes for manual removal. But checking can be done (should be done) by dog because
we should try our best to keep sheep code as simple as possible for maintainence

Thanks
Yuan



More information about the sheepdog mailing list