[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Thu Jan 30 07:01:22 CET 2014

On Thu, Jan 30, 2014 at 02:50:21PM +0900, Hitoshi Mitake wrote:
> At Thu, 30 Jan 2014 10:07:37 +0800,
> Liu Yuan wrote:
> > 
> > On Thu, Jan 30, 2014 at 10:53:39AM +0900, Hitoshi Mitake wrote:
> > > At Thu, 30 Jan 2014 02:21:54 +0800,
> > > Liu Yuan wrote:
> > > > 
> > > > On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > > > > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > > > 
> > > > > Current sheepdog cannot handle corruption of inode objects. For
> > > > > example, members like name or nr_copies of sd_inode are broken by
> > > > > silent data corruption of disks, even initialization of sheep
> > > > > processes fail. Because sheep and dog themselves interpret the content
> > > > > of inode objects.
> > > > 
> > > > any resource to confirm so called 'silent data corruption'? Modern disk has
> > > > built-in correction code (RS) for each sector. So either EIO or full data will
> > > > return from disks as far as I know. I've never seen a real 'silent data corruption'
> > > > yet in person. I know many people suspect it would happen, but I think we need
> > > > real proof of it because most of time, it is false positive.
> > > 
> > > This paper is a major source of the "silent data corruption'":
> > > https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
> > > 
> > > Of course the corruption happens rarely. But it can happen so we
> > > should handle it. So we have the majority voting mechanism of "dog vdi
> > > check", no?
> > > 
> > > > 
> > > > > For detecting such a corruption of inode objects, this patch adds a
> > > > > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > > > > passed as an option of cluster format (dog cluster format -i), sheep
> > > > > processes belong to the cluster do below actions:
> > > > > 
> > > > > - when the sheep updates inode objects, it stores sha1 value of the
> > > > >   object to xattr (default_write())
> > > > > - when the sheep reads an inode object, it caliculates sha1 value of
> > > > >   the inode object. Then it compares the caliculated value with the
> > > > >   stored one. If these values differ, the reading causes error
> > > > >   (default_read()).
> > > > > 
> > > > > This checking mechanism prevents interpretation of corrupted inode
> > > > > objects by sheep.
> > > > 
> > > > I don't think we should implement this check in the sheep. It's better to do
> > > > it in dog as a 'check' plugin because
> > > > 
> > > > - no need to introduce imcompatible physical layout (extra xattr)
> > > 
> > > This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".
> > > 
> > > In addition, the check is done on a cluster which is formatted with
> > > the new -i option. It doesn't affect existing clusters.
> > 
> > > > - dog is in a better position to quorum recovery. even if all the hashes are
> > > >   valid, you can't make sure the consistency between different
> > > >   copies.
> > > 
> > > No. As I described in the commit log, dog cannot detect corruption of
> > > inode object by itself. e.g. if store_policy is corrupted, ordinal
> > > replicated vdi can be seen as a hypervolume. If nr_copies is
> > > corrupted, vdi checking itself must be bogus.
> > 
> > why not? our check firstly get the hash of each inode, so any data corruption
> > in some inode(s) will be detected and possibly fixed if it is minority, no?
> 
> Because "dog vdi check" depends on an inode object. Look at line 1955
> (master branch) of dog/vdi.c. If the inode object read by the
> read_vdi_obj() is corrupted, the checking fails. Assume the member
> nr_copies is broken and has 0 as its value. In such a case, the
> majority voting cannot be done, no?
> 

so can't you write code in dog to handle this case like following:

  read all the replication and check if they are consistency.

Above is the only way you can rely on. Suppose following case:

inode A is corrupted, and nr_copies = 1, and also it is hash happen to be corrupted
too, and the coincidence is that with double corruption, the hash value is
correct on this courrupted inode. so you any inode reader will read this poor
inode A and assume it is valid. How do you handle this?

Thanks
Yuan