[sheepdog] compatibility of dog command between new and old cluster

Fri Jul 18 05:00:53 CEST 2014

On 2014年07月18日 10:35, Ruoyu wrote:
> Hi there,
>
> Should we upgrade SD_PROTO_VER to 0x03 to avoid it because vdi object 
> size is different since ledger object is introduced?
Or we should not check if the size equals to that of expected?

diff --git a/sheep/plain_store.c b/sheep/plain_store.c
index 07bd107..5991bf3 100644
--- a/sheep/plain_store.c
+++ b/sheep/plain_store.c
@@ -345,7 +345,7 @@ static int default_read_from_path(uint64_t oid, 
const char *path,
                 return err_to_sderr(path, oid, errno);

         size = xpread(fd, iocb->buf, iocb->length, iocb->offset);
-       if (unlikely(size != iocb->length)) {
+       if (size < 0) {
                 sd_err("failed to read object %"PRIx64", path=%s, offset=%"
                        PRId32", size=%"PRId32", result=%zd, %m", oid, path,
                        iocb->offset, iocb->length, size);

>
> On 2014年07月15日 17:31, Ruoyu wrote:
>> Once I submit a read request by new version dog command (ledger 
>> object supported) to a old version cluster (ledger object not 
>> supported), the cluster is corrupted.
>>
>> Error messages in sheep.log:
>>
>> Jul 15 11:17:26 ERROR [gway 24285] default_read_from_path(291) failed 
>> to read object 80e4a2b600000000, 
>> path=/mnt/sheepdog/obj/80e4a2b600000000, offset=0, size=12587576, 
>> result=4198976, Success
>> Jul 15 11:17:26 ERROR [gway 24285] err_to_sderr(114) 
>> oid=80e4a2b600000000, Success
>> Jul 15 11:17:26 ERROR [gway 24285] gateway_replication_read(270) 
>> local read 80e4a2b600000000 failed, Network error between sheep
>> Jul 15 11:17:26 INFO [main] md_remove_disk(349) /mnt/sheepdog/obj 
>> from multi-disk array
>> Jul 15 11:17:26 ERROR [gway 24285] sheep_exec_req(1114) failed 
>> Network error between sheep, remote address: 192.168.1.2:7000, op 
>> name: READ_PEER
>>
>> As you can see, vdi object size was changed, expected 12587576, 
>> actually 4198976. As a result, sheep thought the disk had 
>> unrecoverable problem so that it must been removed. And then, a 
>> recovery will be triggered. The behavior is not so robust.
>>
>> Maybe we need something like version control to avoid this issue. 
>> What is your opinion?
>