[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan

Chris Webb chris at arachsys.com
Fri Nov 25 11:26:57 CET 2011


I tested again with the latest stable release of corosync, version 1.4.2.

In this case, the behaviour is different, but still odd!

I start with a completely blank cluster on 002{6,7,8}, three O_DIRECT sheep
daemons per host:

0026# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348
     6 - 172.16.101.11:7000     64  191172780
     7 - 172.16.101.11:7001     64  191172780
     8 - 172.16.101.11:7002     64  191172780
0026# collie cluster format --copies=2
0026# collie vdi create test 1G
0026# collie vdi create test2 1G

Now I kill the network on 0028:

0028# ip link set eth1 down
0028# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
[HANG]
^C
0028# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348
     6 - 172.16.101.11:7000     64  191172780
     7 - 172.16.101.11:7001     64  191172780
     8 - 172.16.101.11:7002     64  191172780

Hmm, hasn't noticed it's partitioned. Meanwhile, back on 0026:

0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# sleep 60
0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header

However, if I wait a bit longer:

0026# collie node list
   Idx - Host:Port          Vnodes       Zone
---------------------------------------------
     0 - 172.16.101.7:7000      64  124063916
     1 - 172.16.101.7:7001      64  124063916
     2 - 172.16.101.7:7002      64  124063916
     3 - 172.16.101.9:7000      64  157618348
     4 - 172.16.101.9:7001      64  157618348
     5 - 172.16.101.9:7002      64  157618348
0026# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
  test         1  1.0 GB  0.0 MB  0.0 MB 2011-11-25 10:12   7c2b25
  test2        1  1.0 GB  0.0 MB  0.0 MB 2011-11-25 10:12   fd3815

...it's okay again. Time to bring back the machine with the missing network:

0028# ip link set eth1 up
0028# collie vdi list
  name        id    size    used  shared    creation time   vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has an old epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has an old epoch
failed to read a inode header
[wait a bit]
0028# collie vdi list
there is no active sheep daemons [sic]

but they haven't really exited:

0028# ps ax | grep sheep
 1798 ?        Ssl    0:00 sheep -D -p 7000 /mnt/sheep-0028-00
 1801 ?        Ss     0:00 sheep -D -p 7000 /mnt/sheep-0028-00
 1819 ?        Ssl    0:00 sheep -D -p 7001 /mnt/sheep-0028-01
 1822 ?        Ss     0:00 sheep -D -p 7001 /mnt/sheep-0028-01
 1840 ?        Ssl    0:00 sheep -D -p 7002 /mnt/sheep-0028-02
 1842 ?        Ss     0:00 sheep -D -p 7002 /mnt/sheep-0028-02

Presumably they're not forwarding properly, though, if they're not responding
to collie vdi list?

I've popped the log files from this test session at

  http://cdw.me.uk/tmp/sheep-0026-00.log
  http://cdw.me.uk/tmp/sheep-0026-01.log
  http://cdw.me.uk/tmp/sheep-0026-02.log
  http://cdw.me.uk/tmp/sheep-0027-00.log
  http://cdw.me.uk/tmp/sheep-0027-01.log
  http://cdw.me.uk/tmp/sheep-0027-02.log
  http://cdw.me.uk/tmp/sheep-0028-00.log
  http://cdw.me.uk/tmp/sheep-0028-01.log
  http://cdw.me.uk/tmp/sheep-0028-02.log

There doesn't seem to be much helpful in there unfortunately.

I'll try with the latest 1.3.x corosync next to see if the behaviour is the
same.

Best wishes,

Chris.



More information about the sheepdog mailing list