[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Thu Jan 30 02:53:39 CET 2014

At Thu, 30 Jan 2014 02:21:54 +0800,
Liu Yuan wrote:
> 
> On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > 
> > Current sheepdog cannot handle corruption of inode objects. For
> > example, members like name or nr_copies of sd_inode are broken by
> > silent data corruption of disks, even initialization of sheep
> > processes fail. Because sheep and dog themselves interpret the content
> > of inode objects.
> 
> any resource to confirm so called 'silent data corruption'? Modern disk has
> built-in correction code (RS) for each sector. So either EIO or full data will
> return from disks as far as I know. I've never seen a real 'silent data corruption'
> yet in person. I know many people suspect it would happen, but I think we need
> real proof of it because most of time, it is false positive.

This paper is a major source of the "silent data corruption'":
https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

Of course the corruption happens rarely. But it can happen so we
should handle it. So we have the majority voting mechanism of "dog vdi
check", no?

> 
> > For detecting such a corruption of inode objects, this patch adds a
> > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > passed as an option of cluster format (dog cluster format -i), sheep
> > processes belong to the cluster do below actions:
> > 
> > - when the sheep updates inode objects, it stores sha1 value of the
> >   object to xattr (default_write())
> > - when the sheep reads an inode object, it caliculates sha1 value of
> >   the inode object. Then it compares the caliculated value with the
> >   stored one. If these values differ, the reading causes error
> >   (default_read()).
> > 
> > This checking mechanism prevents interpretation of corrupted inode
> > objects by sheep.
> 
> I don't think we should implement this check in the sheep. It's better to do
> it in dog as a 'check' plugin because
> 
> - no need to introduce imcompatible physical layout (extra xattr)

This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".

In addition, the check is done on a cluster which is formatted with
the new -i option. It doesn't affect existing clusters.

> - dog is in a better position to quorum recovery. even if all the hashes are
>   valid, you can't make sure the consistency between different
>   copies.

No. As I described in the commit log, dog cannot detect corruption of
inode object by itself. e.g. if store_policy is corrupted, ordinal
replicated vdi can be seen as a hypervolume. If nr_copies is
corrupted, vdi checking itself must be bogus.

> - as you mentioned, performance is awful. only inode checking is basically
>   useless, users care about their data, not your internal metadata. If we add
>   it to data, performance is unacceptably bad.

The point is that we have to let sheep not to interpret corrupted
inode objects. If inode object is corrupted, vdi checking cannot be
done as I described in the above.

> - current backend store (plain sotre + MD) is far more complex than it appears
>   I have recently spent days to fix a tricky data lost bug. So basically I don't
>   want to add more tricky code in it without compelling reason(s). Right now,
>   we should keep backend store as 100% reliable before we can do anything useful

Which bug? I want to see the patch for it.

BTW, on the second thought, I think current patch is okay for
detecting corruption of inode objects. Voluntary recovery by sheep
process (which is mentioned as a future work) wouldn't
required. Simply producing emergency message and exiting is enough.
Because the corruption can happen rarely, administrators should remove
the corrupted inodes manually when it happens. How do you think?

Thanks,
Hitoshi