I tested again with the latest stable release of corosync, version 1.4.2.
In this case, the behaviour is different, but still odd!
I start with a completely blank cluster on 002{6,7,8}, three O_DIRECT sheep
daemons per host:
0026# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
6 - 172.16.101.11:7000 64 191172780
7 - 172.16.101.11:7001 64 191172780
8 - 172.16.101.11:7002 64 191172780
0026# collie cluster format --copies=2
0026# collie vdi create test 1G
0026# collie vdi create test2 1G
Now I kill the network on 0028:
0028# ip link set eth1 down
0028# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
[HANG]
^C
0028# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
6 - 172.16.101.11:7000 64 191172780
7 - 172.16.101.11:7001 64 191172780
8 - 172.16.101.11:7002 64 191172780
Hmm, hasn't noticed it's partitioned. Meanwhile, back on 0026:
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
0026# sleep 60
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has a new epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has a new epoch
failed to read a inode header
However, if I wait a bit longer:
0026# collie node list
Idx - Host:Port Vnodes Zone
---------------------------------------------
0 - 172.16.101.7:7000 64 124063916
1 - 172.16.101.7:7001 64 124063916
2 - 172.16.101.7:7002 64 124063916
3 - 172.16.101.9:7000 64 157618348
4 - 172.16.101.9:7001 64 157618348
5 - 172.16.101.9:7002 64 157618348
0026# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
test 1 1.0 GB 0.0 MB 0.0 MB 2011-11-25 10:12 7c2b25
test2 1 1.0 GB 0.0 MB 0.0 MB 2011-11-25 10:12 fd3815
...it's okay again. Time to bring back the machine with the missing network:
0028# ip link set eth1 up
0028# collie vdi list
name id size used shared creation time vdi id
------------------------------------------------------------------
failed to read object, 807c2b2500000000 Remote node has an old epoch
failed to read a inode header
failed to read object, 80fd381500000000 Remote node has an old epoch
failed to read a inode header
[wait a bit]
0028# collie vdi list
there is no active sheep daemons [sic]
but they haven't really exited:
0028# ps ax | grep sheep
1798 ? Ssl 0:00 sheep -D -p 7000 /mnt/sheep-0028-00
1801 ? Ss 0:00 sheep -D -p 7000 /mnt/sheep-0028-00
1819 ? Ssl 0:00 sheep -D -p 7001 /mnt/sheep-0028-01
1822 ? Ss 0:00 sheep -D -p 7001 /mnt/sheep-0028-01
1840 ? Ssl 0:00 sheep -D -p 7002 /mnt/sheep-0028-02
1842 ? Ss 0:00 sheep -D -p 7002 /mnt/sheep-0028-02
Presumably they're not forwarding properly, though, if they're not responding
to collie vdi list?
I've popped the log files from this test session at
http://cdw.me.uk/tmp/sheep-0026-00.log
http://cdw.me.uk/tmp/sheep-0026-01.log
http://cdw.me.uk/tmp/sheep-0026-02.log
http://cdw.me.uk/tmp/sheep-0027-00.log
http://cdw.me.uk/tmp/sheep-0027-01.log
http://cdw.me.uk/tmp/sheep-0027-02.log
http://cdw.me.uk/tmp/sheep-0028-00.log
http://cdw.me.uk/tmp/sheep-0028-01.log
http://cdw.me.uk/tmp/sheep-0028-02.log
There doesn't seem to be much helpful in there unfortunately.
I'll try with the latest 1.3.x corosync next to see if the behaviour is the
same.
Best wishes,
Chris.
|
|