[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan

Fri Nov 25 11:26:57 CET 2011

I tested again with the latest stable release of corosync, version 1.4.2.

In this case, the behaviour is different, but still odd!

I start with a completely blank cluster on 002{6,7,8}, three O_DIRECT sheep
daemons per host:

0026# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348
     6 - 172.16.101.11:7000     64  191172780
     7 - 172.16.101.11:7001     64  191172780
     8 - 172.16.101.11:7002     64  191172780
0026# collie cluster format --copies=2
0026# collie vdi create test 1G
0026# collie vdi create test2 1G

Now I kill the network on 0028:

0028# ip link set eth1 down
0028# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
[HANG]
^C
0028# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348
     6 - 172.16.101.11:7000     64  191172780
     7 - 172.16.101.11:7001     64  191172780
     8 - 172.16.101.11:7002     64  191172780

Hmm, hasn't noticed it's partitioned. Meanwhile, back on 0026:

0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# sleep 60
0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header

However, if I wait a bit longer:

0026# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348
0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
  test         1  1.0 GB  0.0 MB  0.0 MB 2011-11-25 10:12   7c2b25
  test2        1  1.0 GB  0.0 MB  0.0 MB 2011-11-25 10:12   fd3815

...it's okay again. Time to bring back the machine with the missing network:

0028# ip link set eth1 up
0028# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has an old epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has an old epoch
failed to read a inode header
[wait a bit]
0028# collie vdi list
there is no active sheep daemons [sic]

but they haven't really exited:

0028# ps ax | grep sheep
 1798 ?        Ssl    0:00 sheep -D -p 7000 /mnt/sheep-0028-00
 1801 ?        Ss     0:00 sheep -D -p 7000 /mnt/sheep-0028-00
 1819 ?        Ssl    0:00 sheep -D -p 7001 /mnt/sheep-0028-01
 1822 ?        Ss     0:00 sheep -D -p 7001 /mnt/sheep-0028-01
 1840 ?        Ssl    0:00 sheep -D -p 7002 /mnt/sheep-0028-02
 1842 ?        Ss     0:00 sheep -D -p 7002 /mnt/sheep-0028-02

Presumably they're not forwarding properly, though, if they're not responding
to collie vdi list?

I've popped the log files from this test session at

  http://cdw.me.uk/tmp/sheep-0026-00.log
  http://cdw.me.uk/tmp/sheep-0026-01.log
  http://cdw.me.uk/tmp/sheep-0026-02.log
  http://cdw.me.uk/tmp/sheep-0027-00.log
  http://cdw.me.uk/tmp/sheep-0027-01.log
  http://cdw.me.uk/tmp/sheep-0027-02.log
  http://cdw.me.uk/tmp/sheep-0028-00.log
  http://cdw.me.uk/tmp/sheep-0028-01.log
  http://cdw.me.uk/tmp/sheep-0028-02.log

There doesn't seem to be much helpful in there unfortunately.

I'll try with the latest 1.3.x corosync next to see if the behaviour is the
same.

Best wishes,

Chris.