Hi Chris, Okay, I'll look into this, but let me ask you some questions. At Fri, 14 Oct 2011 10:14:47 +0100, Chris Webb wrote: > > Hi Kazutaka. I pulled your vdiattr branch to test, and it does seem to have > changed the race behaviour so I think your diagnosis was correct. However, > when I remove some of my debugging code (which slows down the rate at which > collie commands are invoked because of the extensive logging), I'm now > seeing this: > > 0026# cat /tmp/collie.log > [2584] collie vdi create dc9d3806-cafd-47c5-8711-9b4f99b5b061 539545600 > Exit code: 0 > > [2584] collie vdi setattr -x dc9d3806-cafd-47c5-8711-9b4f99b5b061 lock > Exit code: 0 > > [2584] collie vdi setattr dc9d3806-cafd-47c5-8711-9b4f99b5b061 properties > Exit code: 0 > > [2584] collie vdi getattr dc9d3806-cafd-47c5-8711-9b4f99b5b061 lock > Exit code: 0 > > [2584] collie vdi setattr -d dc9d3806-cafd-47c5-8711-9b4f99b5b061 lock > Exit code: 0 > > [2584] collie vdi list --raw > Exit code: 0 > > [2584] collie vdi getattr dc9d3806-cafd-47c5-8711-9b4f99b5b061 claimed > no such attribute, claimed > Exit code: 5 > > [2584] collie vdi getattr dc9d3806-cafd-47c5-8711-9b4f99b5b061 properties > Exit code: 0 > > [2636] collie vdi list --raw > Exit code: 0 > > [2652] collie vdi setattr -x dc9d3806-cafd-47c5-8711-9b4f99b5b061 lock > Exit code: 0 > > [2652] collie vdi getattr dc9d3806-cafd-47c5-8711-9b4f99b5b061 claimed > no such attribute, claimed > Exit code: 5 > > [2652] collie vdi setattr dc9d3806-cafd-47c5-8711-9b4f99b5b061 claimed > Exit code: 0 > > [2652] collie vdi getattr dc9d3806-cafd-47c5-8711-9b4f99b5b061 lock > Exit code: 0 > > [2652] collie vdi setattr -d dc9d3806-cafd-47c5-8711-9b4f99b5b061 lock > Exit code: 0 > > [2652] collie vdi write dc9d3806-cafd-47c5-8711-9b4f99b5b061 0 > failed to write object, b1028300000000 I/O error > failed to write vdi > Exit code: 1 Before pulling vdiattr branch, didn't this bug happen? If possible, can you check what is written in sheep.log when this problem happens? > > [2652] collie vdi setattr -x dc9d3806-cafd-47c5-8711-9b4f99b5b061 lock > > There are no collie processes other than the ones listed above accessing the > vdi, and the three processes you see above run sequentially, not > concurrently. > > At this point, all my nodes have gone away! > > 0026# collie node list > Idx - Host:Port Vnodes Zone > --------------------------------------------- > > 0026# collie vdi list > name id size used shared creation time vdi id > ------------------------------------------------------------------ > Floating point exception (core dumped) Can you get a stack trace from the core? Thanks, Kazutaka > > although the sheep processes still seem to be running: > > 2512 ? Ssl 0:00 sheep -D -p 7000 /mnt/sheep-0026-00 > 2514 ? Ss 0:00 sheep -D -p 7000 /mnt/sheep-0026-00 > 2515 ? Ssl 0:00 sheep -D -p 7001 /mnt/sheep-0026-01 > 2518 ? Ss 0:00 sheep -D -p 7001 /mnt/sheep-0026-01 > 2519 ? Ssl 0:00 sheep -D -p 7002 /mnt/sheep-0026-02 > 2535 ? Ss 0:00 sheep -D -p 7002 /mnt/sheep-0026-02 > > I've reproduced this a few times from a clean > > killall sheep > rm -rf /mnt/sheep-0026-0*/* > sheep -D -p 7000 /mnt/sheep-0026-00 > sheep -D -p 7001 /mnt/sheep-0026-01 > sheep -D -p 7002 /mnt/sheep-0026-02 > collie cluster format --copies=1 > > on my test host. > > Cheers, > > Chris. > -- > sheepdog mailing list > sheepdog at lists.wpkg.org > http://lists.wpkg.org/mailman/listinfo/sheepdog |