[Sheepdog] Sheepdog 0.3.0 schedule and 0.4.0 plan
Chris Webb
chris at arachsys.com
Fri Nov 25 11:26:57 CET 2011
I tested again with the latest stable release of corosync, version 1.4.2.
In this case, the behaviour is different, but still odd!
I start with a completely blank cluster on 002{6,7,8}, three O_DIRECT sheep
daemons per host:
0026# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
6 - 172.16.101.11:7000 64 191172780
7 - 172.16.101.11:7001 64 191172780
8 - 172.16.101.11:7002 64 191172780
0026# collie cluster format --copies=2
0026# collie vdi create test 1G
0026# collie vdi create test2 1G
Now I kill the network on 0028:
0028# ip link set eth1 down
0028# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
[HANG]
^C
0028# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
6 - 172.16.101.11:7000 64 191172780
7 - 172.16.101.11:7001 64 191172780
8 - 172.16.101.11:7002 64 191172780
Hmm, hasn't noticed it's partitioned. Meanwhile, back on 0026:
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# sleep 60
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
However, if I wait a bit longer:
0026# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
test 1 1.0 GB 0.0 MB 0.0 MB 2011-11-25 10:12 7c2b25
test2 1 1.0 GB 0.0 MB 0.0 MB 2011-11-25 10:12 fd3815
...it's okay again. Time to bring back the machine with the missing network:
0028# ip link set eth1 up
0028# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has an old epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has an old epoch
failed to read a inode header
[wait a bit]
0028# collie vdi list
there is no active sheep daemons [sic]
but they haven't really exited:
0028# ps ax | grep sheep
1798 ? Ssl 0:00 sheep -D -p 7000 /mnt/sheep-0028-00
1801 ? Ss 0:00 sheep -D -p 7000 /mnt/sheep-0028-00
1819 ? Ssl 0:00 sheep -D -p 7001 /mnt/sheep-0028-01
1822 ? Ss 0:00 sheep -D -p 7001 /mnt/sheep-0028-01
1840 ? Ssl 0:00 sheep -D -p 7002 /mnt/sheep-0028-02
1842 ? Ss 0:00 sheep -D -p 7002 /mnt/sheep-0028-02
Presumably they're not forwarding properly, though, if they're not responding
to collie vdi list?
I've popped the log files from this test session at
http://cdw.me.uk/tmp/sheep-0026-00.log
http://cdw.me.uk/tmp/sheep-0026-01.log
http://cdw.me.uk/tmp/sheep-0026-02.log
http://cdw.me.uk/tmp/sheep-0027-00.log
http://cdw.me.uk/tmp/sheep-0027-01.log
http://cdw.me.uk/tmp/sheep-0027-02.log
http://cdw.me.uk/tmp/sheep-0028-00.log
http://cdw.me.uk/tmp/sheep-0028-01.log
http://cdw.me.uk/tmp/sheep-0028-02.log
There doesn't seem to be much helpful in there unfortunately.
I'll try with the latest 1.3.x corosync next to see if the behaviour is the
same.
Best wishes,
Chris.
More information about the sheepdog
mailing list