[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Wed Jan 29 19:21:54 CET 2014

On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> 
> Current sheepdog cannot handle corruption of inode objects. For
> example, members like name or nr_copies of sd_inode are broken by
> silent data corruption of disks, even initialization of sheep
> processes fail. Because sheep and dog themselves interpret the content
> of inode objects.

any resource to confirm so called 'silent data corruption'? Modern disk has
built-in correction code (RS) for each sector. So either EIO or full data will
return from disks as far as I know. I've never seen a real 'silent data corruption'
yet in person. I know many people suspect it would happen, but I think we need
real proof of it because most of time, it is false positive.

> For detecting such a corruption of inode objects, this patch adds a
> new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> passed as an option of cluster format (dog cluster format -i), sheep
> processes belong to the cluster do below actions:
> 
> - when the sheep updates inode objects, it stores sha1 value of the
>   object to xattr (default_write())
> - when the sheep reads an inode object, it caliculates sha1 value of
>   the inode object. Then it compares the caliculated value with the
>   stored one. If these values differ, the reading causes error
>   (default_read()).
> 
> This checking mechanism prevents interpretation of corrupted inode
> objects by sheep.

I don't think we should implement this check in the sheep. It's better to do
it in dog as a 'check' plugin because

- no need to introduce imcompatible physical layout (extra xattr)
- dog is in a better position to quorum recovery. even if all the hashes are
  valid, you can't make sure the consistency between different copies.
- as you mentioned, performance is awful. only inode checking is basically
  useless, users care about their data, not your internal metadata. If we add
  it to data, performance is unacceptably bad.
- current backend store (plain sotre + MD) is far more complex than it appears
  I have recently spent days to fix a tricky data lost bug. So basically I don't
  want to add more tricky code in it without compelling reason(s). Right now,
  we should keep backend store as 100% reliable before we can do anything useful

Thanks
Yuan