[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption

Hitoshi Mitake mitake.hitoshi at gmail.com
Fri Jan 31 02:55:17 CET 2014


At Thu, 30 Jan 2014 13:57:21 +0800,
Liu Yuan wrote:
> 
> On Thu, Jan 30, 2014 at 02:45:09PM +0900, Hitoshi Mitake wrote:
> > At Thu, 30 Jan 2014 12:48:18 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Thu, Jan 30, 2014 at 10:07:37AM +0800, Liu Yuan wrote:
> > > > On Thu, Jan 30, 2014 at 10:53:39AM +0900, Hitoshi Mitake wrote:
> > > > > At Thu, 30 Jan 2014 02:21:54 +0800,
> > > > > Liu Yuan wrote:
> > > > > > 
> > > > > > On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > > > > > > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > > > > > 
> > > > > > > Current sheepdog cannot handle corruption of inode objects. For
> > > > > > > example, members like name or nr_copies of sd_inode are broken by
> > > > > > > silent data corruption of disks, even initialization of sheep
> > > > > > > processes fail. Because sheep and dog themselves interpret the content
> > > > > > > of inode objects.
> > > > > > 
> > > > > > any resource to confirm so called 'silent data corruption'? Modern disk has
> > > > > > built-in correction code (RS) for each sector. So either EIO or full data will
> > > > > > return from disks as far as I know. I've never seen a real 'silent data corruption'
> > > > > > yet in person. I know many people suspect it would happen, but I think we need
> > > > > > real proof of it because most of time, it is false positive.
> > > > > 
> > > > > This paper is a major source of the "silent data corruption'":
> > > > > https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
> > > > > 
> > > > > Of course the corruption happens rarely. But it can happen so we
> > > > > should handle it. So we have the majority voting mechanism of "dog vdi
> > > > > check", no?
> > > > > 
> > > > > > 
> > > > > > > For detecting such a corruption of inode objects, this patch adds a
> > > > > > > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > > > > > > passed as an option of cluster format (dog cluster format -i), sheep
> > > > > > > processes belong to the cluster do below actions:
> > > > > > > 
> > > > > > > - when the sheep updates inode objects, it stores sha1 value of the
> > > > > > >   object to xattr (default_write())
> > > > > > > - when the sheep reads an inode object, it caliculates sha1 value of
> > > > > > >   the inode object. Then it compares the caliculated value with the
> > > > > > >   stored one. If these values differ, the reading causes error
> > > > > > >   (default_read()).
> > > > > > > 
> > > > > > > This checking mechanism prevents interpretation of corrupted inode
> > > > > > > objects by sheep.
> > > > > > 
> > > > > > I don't think we should implement this check in the sheep. It's better to do
> > > > > > it in dog as a 'check' plugin because
> > > > > > 
> > > > > > - no need to introduce imcompatible physical layout (extra xattr)
> > > > > 
> > > > > This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".
> > > > > 
> > > 
> > > how do you handle if sector that holds xattr value is corrupped sliently? can
> > > this be false positive and how do you handle it?
> > 
> > Of course such a case is treated as an error. And it is recovered by
> > "dog vdi check".
> > 
> 
> More data you store, more chances you suffer from data corruption problems. Your
> approch add extra burden to dog and sheep for making sure 'checksume' itself is
> validated or not. you can't rely on simple hash at all, all we have to do is
> using replication to make sure data reliability, which dog can make use of, not
> sheep.

If dog can implement the feature, of course I'll implement it in
dog. But it seems to be impossible. Because dog itself interprets the
data in inode objects.

Assume inode->nr_copies of a 3-replicated vdi is broken and set as
0. If we execute "dog vdi check" for the vdi, the dog process reads
the inode object from a sheep cluster (issue SD_OP_READ_OBJ). If a
sheep process receives the request, it execute
gateway_read_request(). In the function, SD_OP_READ_PEER requests are
issued to other sheep processes. And the number of the requests are
determined by vdi states which is created during the initialization
sequence of the sheep process.

The root cause of the this problem is that dog process cannot know a 
replication policy (a number of copies, erasure coded or not,
hypervolume or not) until it reads at least one correct (not
corrupted) inode object. And dog cannot determine that the inode
object is corrupted or not until it knows the correct replication
policy.

If sheep cluster doesn't allow coexisting of VDIs with various
replication policies, things are simple. But sheep cluster allows
it. So I think this hash checking mechanism is required. Of course I
don't think the approach is neat. If you can provide a better
alternative, I'd like to use it.

Thanks,
Hitoshi



More information about the sheepdog mailing list