[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Hitoshi Mitake mitake.hitoshi at gmail.com
Fri Jan 31 02:59:57 CET 2014


At Thu, 30 Jan 2014 14:08:52 +0800,
Liu Yuan wrote:
> 
> On Thu, Jan 30, 2014 at 01:57:21PM +0800, Liu Yuan wrote:
> > On Thu, Jan 30, 2014 at 02:45:09PM +0900, Hitoshi Mitake wrote:
> > > At Thu, 30 Jan 2014 12:48:18 +0800,
> > > Liu Yuan wrote:
> > > > 
> > > > On Thu, Jan 30, 2014 at 10:07:37AM +0800, Liu Yuan wrote:
> > > > > On Thu, Jan 30, 2014 at 10:53:39AM +0900, Hitoshi Mitake wrote:
> > > > > > At Thu, 30 Jan 2014 02:21:54 +0800,
> > > > > > Liu Yuan wrote:
> > > > > > > 
> > > > > > > On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > > > > > > > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > > > > > > 
> > > > > > > > Current sheepdog cannot handle corruption of inode objects. For
> > > > > > > > example, members like name or nr_copies of sd_inode are broken by
> > > > > > > > silent data corruption of disks, even initialization of sheep
> > > > > > > > processes fail. Because sheep and dog themselves interpret the content
> > > > > > > > of inode objects.
> > > > > > > 
> > > > > > > any resource to confirm so called 'silent data corruption'? Modern disk has
> > > > > > > built-in correction code (RS) for each sector. So either EIO or full data will
> > > > > > > return from disks as far as I know. I've never seen a real 'silent data corruption'
> > > > > > > yet in person. I know many people suspect it would happen, but I think we need
> > > > > > > real proof of it because most of time, it is false positive.
> > > > > > 
> > > > > > This paper is a major source of the "silent data corruption'":
> > > > > > https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
> > > > > > 
> > > > > > Of course the corruption happens rarely. But it can happen so we
> > > > > > should handle it. So we have the majority voting mechanism of "dog vdi
> > > > > > check", no?
> > > > > > 
> > > > > > > 
> > > > > > > > For detecting such a corruption of inode objects, this patch adds a
> > > > > > > > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > > > > > > > passed as an option of cluster format (dog cluster format -i), sheep
> > > > > > > > processes belong to the cluster do below actions:
> > > > > > > > 
> > > > > > > > - when the sheep updates inode objects, it stores sha1 value of the
> > > > > > > >   object to xattr (default_write())
> > > > > > > > - when the sheep reads an inode object, it caliculates sha1 value of
> > > > > > > >   the inode object. Then it compares the caliculated value with the
> > > > > > > >   stored one. If these values differ, the reading causes error
> > > > > > > >   (default_read()).
> > > > > > > > 
> > > > > > > > This checking mechanism prevents interpretation of corrupted inode
> > > > > > > > objects by sheep.
> > > > > > > 
> > > > > > > I don't think we should implement this check in the sheep. It's better to do
> > > > > > > it in dog as a 'check' plugin because
> > > > > > > 
> > > > > > > - no need to introduce imcompatible physical layout (extra xattr)
> > > > > > 
> > > > > > This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".
> > > > > > 
> > > > 
> > > > how do you handle if sector that holds xattr value is corrupped sliently? can
> > > > this be false positive and how do you handle it?
> > > 
> > > Of course such a case is treated as an error. And it is recovered by
> > > "dog vdi check".
> > > 
> > 
> > More data you store, more chances you suffer from data corruption problems. Your
> > approch add extra burden to dog and sheep for making sure 'checksume' itself is
> > validated or not. you can't rely on simple hash at all, all we have to do is
> > using replication to make sure data reliability, which dog can make use of, not
> > sheep.
> 
> Many advanced filesystem like zfs and btrfs have built-in checksum functinality,
> so the problem you are trying to solve is better to be solved at file system
> level, which has more knowledge to handle it well.

If inodes of sheep is broken, the checksum functionality of file
systems cannot work. e.g. assume data_vdi_id is corrupted.

And we must not have any assumptions about users' file systems. In
addition, how do you think about cases of RESTful interface and NFS?

Thanks,
Hitoshi



More information about the sheepdog mailing list