[sheepdog] [PATCH RFC] add a new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK for checking inode object corruption
Hitoshi Mitake
mitake.hitoshi at gmail.com
Thu Jan 30 06:50:21 CET 2014
At Thu, 30 Jan 2014 10:07:37 +0800,
Liu Yuan wrote:
>
> On Thu, Jan 30, 2014 at 10:53:39AM +0900, Hitoshi Mitake wrote:
> > At Thu, 30 Jan 2014 02:21:54 +0800,
> > Liu Yuan wrote:
> > >
> > > On Thu, Jan 30, 2014 at 12:20:35AM +0900, Hitoshi Mitake wrote:
> > > > From: Hitoshi Mitake <mitake.hitoshi at lab.ntt.co.jp>
> > > >
> > > > Current sheepdog cannot handle corruption of inode objects. For
> > > > example, members like name or nr_copies of sd_inode are broken by
> > > > silent data corruption of disks, even initialization of sheep
> > > > processes fail. Because sheep and dog themselves interpret the content
> > > > of inode objects.
> > >
> > > any resource to confirm so called 'silent data corruption'? Modern disk has
> > > built-in correction code (RS) for each sector. So either EIO or full data will
> > > return from disks as far as I know. I've never seen a real 'silent data corruption'
> > > yet in person. I know many people suspect it would happen, but I think we need
> > > real proof of it because most of time, it is false positive.
> >
> > This paper is a major source of the "silent data corruption'":
> > https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
> >
> > Of course the corruption happens rarely. But it can happen so we
> > should handle it. So we have the majority voting mechanism of "dog vdi
> > check", no?
> >
> > >
> > > > For detecting such a corruption of inode objects, this patch adds a
> > > > new flag of cluster SD_CLUSTER_FLAG_INODE_HASH_CHECK. If the flag is
> > > > passed as an option of cluster format (dog cluster format -i), sheep
> > > > processes belong to the cluster do below actions:
> > > >
> > > > - when the sheep updates inode objects, it stores sha1 value of the
> > > > object to xattr (default_write())
> > > > - when the sheep reads an inode object, it caliculates sha1 value of
> > > > the inode object. Then it compares the caliculated value with the
> > > > stored one. If these values differ, the reading causes error
> > > > (default_read()).
> > > >
> > > > This checking mechanism prevents interpretation of corrupted inode
> > > > objects by sheep.
> > >
> > > I don't think we should implement this check in the sheep. It's better to do
> > > it in dog as a 'check' plugin because
> > >
> > > - no need to introduce imcompatible physical layout (extra xattr)
> >
> > This patch doesn't produce incompatibility. The used xattr is "user.obj.sha1".
> >
> > In addition, the check is done on a cluster which is formatted with
> > the new -i option. It doesn't affect existing clusters.
>
> > > - dog is in a better position to quorum recovery. even if all the hashes are
> > > valid, you can't make sure the consistency between different
> > > copies.
> >
> > No. As I described in the commit log, dog cannot detect corruption of
> > inode object by itself. e.g. if store_policy is corrupted, ordinal
> > replicated vdi can be seen as a hypervolume. If nr_copies is
> > corrupted, vdi checking itself must be bogus.
>
> why not? our check firstly get the hash of each inode, so any data corruption
> in some inode(s) will be detected and possibly fixed if it is minority, no?
Because "dog vdi check" depends on an inode object. Look at line 1955
(master branch) of dog/vdi.c. If the inode object read by the
read_vdi_obj() is corrupted, the checking fails. Assume the member
nr_copies is broken and has 0 as its value. In such a case, the
majority voting cannot be done, no?
>
> > > - as you mentioned, performance is awful. only inode checking is basically
> > > useless, users care about their data, not your internal metadata. If we add
> > > it to data, performance is unacceptably bad.
> >
> > The point is that we have to let sheep not to interpret corrupted
> > inode objects. If inode object is corrupted, vdi checking cannot be
> > done as I described in the above.
>
> I don't see any reason dog can't accomplish to check courrupted
> inode.
I described the reason in the above.
Thanks,
Hitoshi
More information about the sheepdog
mailing list