[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Thu Jan 30 06:57:21 CET 2014

On Thu, Jan 30, 2014 at 02:45:09PM +0900, Hitoshi Mitake wrote:
> At Thu, 30 Jan 2014 12:48:18 +0800,
> Liu Yuan wrote:
> > 
> > On Thu, Jan 30, 2014 at 10:07:37AM +0800, Liu Yuan wrote:
> > > On Thu, Jan 30, 2014 at 10:53:39AM +0900, Hitoshi Mitake wrote:
> > > > At Thu, 30 Jan 2014 02:21:54 +0800,
> > > > Liu Yuan wrote:
> > > > > 
> > > > > On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > > > > > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > > > > 
> > > > > > Current sheepdog cannot handle corruption of inode objects. For
> > > > > > example, members like name or nr_copies of sd_inode are broken by
> > > > > > silent data corruption of disks, even initialization of sheep
> > > > > > processes fail. Because sheep and dog themselves interpret the content
> > > > > > of inode objects.
> > > > > 
> > > > > any resource to confirm so called 'silent data corruption'? Modern disk has
> > > > > built-in correction code (RS) for each sector. So either EIO or full data will
> > > > > return from disks as far as I know. I've never seen a real 'silent data corruption'
> > > > > yet in person. I know many people suspect it would happen, but I think we need
> > > > > real proof of it because most of time, it is false positive.
> > > > 
> > > > This paper is a major source of the "silent data corruption'":
> > > > https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
> > > > 
> > > > Of course the corruption happens rarely. But it can happen so we
> > > > should handle it. So we have the majority voting mechanism of "dog vdi
> > > > check", no?
> > > > 
> > > > > 
> > > > > > For detecting such a corruption of inode objects, this patch adds a
> > > > > > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > > > > > passed as an option of cluster format (dog cluster format -i), sheep
> > > > > > processes belong to the cluster do below actions:
> > > > > > 
> > > > > > - when the sheep updates inode objects, it stores sha1 value of the
> > > > > >   object to xattr (default_write())
> > > > > > - when the sheep reads an inode object, it caliculates sha1 value of
> > > > > >   the inode object. Then it compares the caliculated value with the
> > > > > >   stored one. If these values differ, the reading causes error
> > > > > >   (default_read()).
> > > > > > 
> > > > > > This checking mechanism prevents interpretation of corrupted inode
> > > > > > objects by sheep.
> > > > > 
> > > > > I don't think we should implement this check in the sheep. It's better to do
> > > > > it in dog as a 'check' plugin because
> > > > > 
> > > > > - no need to introduce imcompatible physical layout (extra xattr)
> > > > 
> > > > This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".
> > > > 
> > 
> > how do you handle if sector that holds xattr value is corrupped sliently? can
> > this be false positive and how do you handle it?
> 
> Of course such a case is treated as an error. And it is recovered by
> "dog vdi check".
> 

More data you store, more chances you suffer from data corruption problems. Your
approch add extra burden to dog and sheep for making sure 'checksume' itself is
validated or not. you can't rely on simple hash at all, all we have to do is
using replication to make sure data reliability, which dog can make use of, not
sheep.

Thanks
Yuan