[sheepdog-users] Panic problem with stable-0.6

Fri Jul 19 23:22:02 CEST 2013

Hi to everybody,
I'm experiencing problems with sheepdog-0.6 + cache + snapshot.
The system is 3 node cluster; 2 nodes have cache enabled.

The problem is this:

Node A - 2 VM running

1) Via a bash script I suspend ad I/O heavy loaded VM
2) After suspending that I make a snapshot with "collie vdi snapshot -s xxxxx vdiname"
3) I resume di VM
4) I copy image outside of cluster via "qemu-img convert -O qcow2 sheepdog:vdiname:1 pippo.qcow2

Sometime (but not every time) sheeps panic on node A (while on nodes B and C everything continues to work => cache problem) with this backtrace:

Jul 19 22:56:39 [gway 19609] add_to_lru_cache(660) PANIC: the object already exist
Jul 19 22:56:39 [gway 19609] crash_handler(180) sheep exits unexpectedly (Aborted).
Jul 19 22:56:39 [gway 19609] sd_backtrace(833) sheep.c:182: crash_handler
Jul 19 22:56:39 [gway 19609] sd_backtrace(847) /lib/x86_64-linux-gnu/libpthread.so.0(+0xfbcf) [0x7f9eefa56bcf]
Jul 19 22:56:39 [gway 19609] sd_backtrace(847) /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x36) [0x7f9eeefa4036]
Jul 19 22:56:39 [gway 19609] sd_backtrace(847) /lib/x86_64-linux-gnu/libc.so.6(abort+0x147) [0x7f9eeefa7697]
Jul 19 22:56:39 [gway 19609] sd_backtrace(833) object_cache.c:660: add_to_lru_cache
Jul 19 22:56:39 [gway 19609] sd_backtrace(833) object_cache.c:710: object_cache_lookup
Jul 19 22:56:39 [gway 19609] sd_backtrace(833) object_cache.c:1073: object_cache_handle_request
Jul 19 22:56:39 [gway 19609] sd_backtrace(833) ops.c:1385: do_process_work
Jul 19 22:56:40 [gway 19609] sd_backtrace(833) work.c:243: worker_routine
Jul 19 22:56:40 [gway 19609] sd_backtrace(847) /lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8d) [0x7f9eefa4ef8d]
Jul 19 22:56:40 [gway 19609] sd_backtrace(847) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6c) [0x7f9eef066e1c]
Jul 19 22:56:40 [gway 19609] __dump_stack_frames(743) cannot find gdb
Jul 19 22:56:40 [gway 19609] __sd_dump_variable(693) cannot find gdb
Jul 19 22:56:40 [main] crash_handler(487) sheep pid 14333 exited unexpectedly.

I've started experiencing this problem after applying today's PATCH stable-0.6 2/3 => sheep: delete cache objects only when they are succesfully pushed.

I think the behaviour is due to high I/O load on the vm (while being backed up) and cache flushed by the snapshot, but I'm not sure.

Ing. Luca Lazzeroni - Trend Servizi Srl
Responsabile R&D
http://www.trendservizi.it