On 08/16/2012 10:38 PM, Yunkai Zhang wrote: > Reading all of vdi's objects from cluster when checking them will lead to a lot > of waste of network bandwith, let's calculate the checksum of objects in backend > and only send the checksum result to the collie client. > > And I think repairing object automaticly is dangerous, as we don't known which > replica is correct. In order to let user have a chance to check them if > necessary, I add a new option: '-F (--force_repair)'. By default, this command > just do check, not repair(as the command name implies). > > After add '-F' flag, the help looks like: > $ collie vdi check > Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] <vdiname> > Options: > -F, --force_repair force repair object's copies (dangerous) > -s, --snapshot specify a snapshot id or tag name > -a, --address specify the daemon address (default: localhost) > -p, --port specify the daemon port > -h, --help display this help and exit > > Let's show some examples when execute this command: > * Success: > $ collie vdi check test.img > CHECKING VDI:test.img ... > PASSED > > * Failure (by default not repair): > $ collie vdi check test.img > CHECKING VDI:test.img ... > Failed oid: 9c5e6800000001 >>> >> copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000 >>> >> copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001 >>> >> copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004 > FAILED > > With the output showed above, user can check all copies of this object and > decide which one is correct (I plan to add a new option: '--oid' to 'collie vdi read' > in another patch, so that user can specify which copy of object to be exported: > $ collie vdi read test.img --oid 9c5e6800000001 at 127.0.0.1:7001 > foo.img > By testing foo.img, we can known which copy is correct). > > User can do force repair by specify -F or --force_repair flag: > * Force repair: > $ collie vdi check -F test.img > CHECKING VDI:test.img ... > Failed oid: 9c5e6800000001 >>> >> copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000 >>> >> copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001 >>> >> copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004 >>> >> force repairing ... >>> >> copy this object from 127.0.0.1:7000 => 127.0.0.1:7001 >>> >> copy this object from 127.0.0.1:7000 => 127.0.0.1:7004 >>> >> repair finished > FORCE REPAIRED I think we need forward this to sheepdog user mail list. I am not sure if this new check&repair process is really acceptable or useful to end users. It looks a bit complex to me. I don't know how many copies will be damaged after a crashed event. This new check & repair method is constrained to be useful when only small fraction of objects in one image is corrupted. Thanks, Yuan |