[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Fri Jan 31 02:59:57 CET 2014

At Thu, 30 Jan 2014 14:08:52 +0800,
Liu Yuan wrote:
> 
> On Thu, Jan 30, 2014 at 01:57:21PM +0800, Liu Yuan wrote:
> > On Thu, Jan 30, 2014 at 02:45:09PM +0900, Hitoshi Mitake wrote:
> > > At Thu, 30 Jan 2014 12:48:18 +0800,
> > > Liu Yuan wrote:
> > > > 
> > > > On Thu, Jan 30, 2014 at 10:07:37AM +0800, Liu Yuan wrote:
> > > > > On Thu, Jan 30, 2014 at 10:53:39AM +0900, Hitoshi Mitake wrote:
> > > > > > At Thu, 30 Jan 2014 02:21:54 +0800,
> > > > > > Liu Yuan wrote:
> > > > > > > 
> > > > > > > On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > > > > > > > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > > > > > > 
> > > > > > > > Current sheepdog cannot handle corruption of inode objects. For
> > > > > > > > example, members like name or nr_copies of sd_inode are broken by
> > > > > > > > silent data corruption of disks, even initialization of sheep
> > > > > > > > processes fail. Because sheep and dog themselves interpret the content
> > > > > > > > of inode objects.
> > > > > > > 
> > > > > > > any resource to confirm so called 'silent data corruption'? Modern disk has
> > > > > > > built-in correction code (RS) for each sector. So either EIO or full data will
> > > > > > > return from disks as far as I know. I've never seen a real 'silent data corruption'
> > > > > > > yet in person. I know many people suspect it would happen, but I think we need
> > > > > > > real proof of it because most of time, it is false positive.
> > > > > > 
> > > > > > This paper is a major source of the "silent data corruption'":
> > > > > > https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
> > > > > > 
> > > > > > Of course the corruption happens rarely. But it can happen so we
> > > > > > should handle it. So we have the majority voting mechanism of "dog vdi
> > > > > > check", no?
> > > > > > 
> > > > > > > 
> > > > > > > > For detecting such a corruption of inode objects, this patch adds a
> > > > > > > > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > > > > > > > passed as an option of cluster format (dog cluster format -i), sheep
> > > > > > > > processes belong to the cluster do below actions:
> > > > > > > > 
> > > > > > > > - when the sheep updates inode objects, it stores sha1 value of the
> > > > > > > >   object to xattr (default_write())
> > > > > > > > - when the sheep reads an inode object, it caliculates sha1 value of
> > > > > > > >   the inode object. Then it compares the caliculated value with the
> > > > > > > >   stored one. If these values differ, the reading causes error
> > > > > > > >   (default_read()).
> > > > > > > > 
> > > > > > > > This checking mechanism prevents interpretation of corrupted inode
> > > > > > > > objects by sheep.
> > > > > > > 
> > > > > > > I don't think we should implement this check in the sheep. It's better to do
> > > > > > > it in dog as a 'check' plugin because
> > > > > > > 
> > > > > > > - no need to introduce imcompatible physical layout (extra xattr)
> > > > > > 
> > > > > > This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".
> > > > > > 
> > > > 
> > > > how do you handle if sector that holds xattr value is corrupped sliently? can
> > > > this be false positive and how do you handle it?
> > > 
> > > Of course such a case is treated as an error. And it is recovered by
> > > "dog vdi check".
> > > 
> > 
> > More data you store, more chances you suffer from data corruption problems. Your
> > approch add extra burden to dog and sheep for making sure 'checksume' itself is
> > validated or not. you can't rely on simple hash at all, all we have to do is
> > using replication to make sure data reliability, which dog can make use of, not
> > sheep.
> 
> Many advanced filesystem like zfs and btrfs have built-in checksum functinality,
> so the problem you are trying to solve is better to be solved at file system
> level, which has more knowledge to handle it well.

If inodes of sheep is broken, the checksum functionality of file
systems cannot work. e.g. assume data_vdi_id is corrupted.

And we must not have any assumptions about users' file systems. In
addition, how do you think about cases of RESTful interface and NFS?

Thanks,
Hitoshi