FUJITA Tomonori <fujita.tomonori at lab.ntt.co.jp> writes: > We need the data of a single write system call to be applied to a > super object in the "all or nothing" way (I assume BTRFS does but it > does not?). Ah right, okay, I understand. I wasn't aware that btrfs gave that guarantee without BTRFS_IOC_TRANS_START, but looking at the btrfs code, I think you're absolutely right that it does. The copy-on-write implementation of metadata and data makes it quite hard for it to do otherwise, I imagine. Nice property, anyway. [For the benefit of the list archives, the other 'special feature' of the filesystem we're using at the moment is setting the user.sheepdog.copies xattr on the object files, which is why the fs needs to have extended attributes enabled.] > I guess that it takes some time until BTRFS matures so we've been thinking > about other options. One is using Berkeley DB for a super object. A random off-the-wall suggestion: I wonder if it would be possible to use a filesystem directory tree to store the catalogue information instead of a single large database file or the current large block file. rename(2), link(2) and even symlink(2) are atomic on all POSIX filesystems, and are presumably optimised to be reasonably fast (?). If instead of overwriting part of a large file, we wrote a new tiny file and then move() it over the top of the original tiny file, we get atomic behaviour for free pretty much everywhere. I'm not sure how well the sheepdog catalogue fits into such a scheme though, or whether this would perform better or worse than the current approach. Best wishes, Chris. |