[sheepdog] [PATCH 1/2] collie: optimize 'collie vdi check' command

Tue Aug 21 09:09:58 CEST 2012

On 08/16/2012 10:38 PM, Yunkai Zhang wrote:
> Reading all of vdi's objects from cluster when checking them will lead to a lot
> of waste of network bandwith, let's calculate the checksum of objects in backend
> and only send the checksum result to the collie client.
> 
> And I think repairing object automaticly is dangerous, as we don't known which
> replica is correct. In order to let user have a chance to check them if
> necessary, I add a new option: '-F (--force_repair)'. By default, this command
> just do check, not repair(as the command name implies).
> 
> After add '-F' flag, the help looks like:
> $ collie vdi check
> Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] <vdiname>
> Options:
>   -F, --force_repair      force repair object's copies (dangerous)
>   -s, --snapshot          specify a snapshot id or tag name
>   -a, --address           specify the daemon address (default: localhost)
>   -p, --port              specify the daemon port
>   -h, --help              display this help and exit
> 
> Let's show some examples when execute this command:
> * Success:
> $ collie vdi check test.img
> CHECKING VDI:test.img ...
> PASSED
> 
> * Failure (by default not repair):
> $ collie vdi check test.img
> CHECKING VDI:test.img ...
> Failed oid: 9c5e6800000001
>>> >> copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000
>>> >> copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001
>>> >> copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004
> FAILED
> 
> With the output showed above, user can check all copies of this object and
> decide which one is correct (I plan to add a new option: '--oid' to 'collie vdi read'
> in another patch, so that user can specify which copy of object to be exported:
>   $ collie vdi read test.img --oid 9c5e6800000001 at 127.0.0.1:7001 > foo.img
> By testing foo.img, we can known which copy is correct).
> 
> User can do force repair by specify -F or --force_repair flag:
> * Force repair:
> $ collie vdi check -F test.img
> CHECKING VDI:test.img ...
> Failed oid: 9c5e6800000001
>>> >> copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000
>>> >> copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001
>>> >> copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004
>>> >> force repairing ...
>>> >> copy this object from 127.0.0.1:7000 => 127.0.0.1:7001
>>> >> copy this object from 127.0.0.1:7000 => 127.0.0.1:7004
>>> >> repair finished
> FORCE REPAIRED

I think we need forward this to sheepdog user mail list. I am not sure
if this new check&repair process is really acceptable or useful to end
users.

It looks a bit complex to me. I don't know how many copies will be
damaged after a crashed event. This new check & repair method is
constrained to be useful when only small fraction of objects in one
image is corrupted.

Thanks,
Yuan