[sheepdog-users] Questions on Sheepdog Operations

Wed Apr 9 22:37:47 CEST 2014

On 04/09/2014 03:55 PM, Scott Devoid wrote:
We are currently evaluating sheepdog in a development cluster. Here are a few questions that we've had about sheepdog operations. Please let us know if you have any thoughts.

Thank you!
~ Scott

1. We know that sheepdog expects consistent mount points for disks. For example, if I start sheep with: "sheep -c corosync:172.21.1.1 --pidfile /var/run/sheepdog.pid /meta,/disk1,disk2" I cannot shutdown the agent, remount the disk, swapping positions, and restart sheepdog. It complains loudly that it can't find the right objects in the right places. Now, if "/disk1" dies, I swap the drive, format and mount the new drive to "/disk1", will this cause a problem for sheep?
It should not.  While I've not personally had to use it, dog node md unplug/plug should allow removal and reloading of the failed disk.  Note that this occurs entirely within that node and does not alter weighting.
2. Extending on the last example, what is the best way to replace individual disks? We are using erasure-coding, so should I use "dog node md plug/unplug"? Or restart sheep? Or something else?
Yes.  dog node md unplug, replace drive, dog node md plug.  This will not trigger a cluster recovery, but will cause recovery on the specific node.
3. Is there a way to get drive-level statistics out of sheepdog? I can of course use operating system tools for the individual devices, but are there additional sheepdog-specific stats I should be interested in?
dog node stat, dog node md info are useful.  I'm working on an in-house monitor to also watch stats on io-waiting per process/per server, along with a few other things.
4. We are running a cluster with 16:4 erasure coding and multi-disk mode. How should we think about our failure domains? Here are a few tests that we have run:

    - Shutdown a single node in the cluster. We immediately see replication/recovery logs on all other nodes as objects are copied to meet redundancy requirements.
Correct
    - Unmount a single disk on a single node: No log messages and no indication of changes in redundancy state. dog cluster check indicates check&repair occurs.
Also correct.  What happens here, is that node will trigger a rebalance of data across it's remaining drives.  The actual weight of the node remains as it was at initial start.  If you want to effectively permanently "fail" that drive out of sheep, dog cluster reweight can cause a cluster wide recovery and rebalance.
    - Unmount of 4 disks across 4 nodes: No log messages and no indication of changes in redundancy state. "No object found" errors abound.

Yes.  Effectively, you have 4 nodes currently in "internal" recovery.  This should not affect overall cluster state.  Errors should resolve when the drives have been rebalanced internally.
    - Unmount of 20 disks across 20 nodes: No log messages and no indication of changes in redundancy state. "No object found" errors abound.

Yes.  To keep your sanity, repeat this while writing large amounts of data (bonnie++ benchmark for example) and ensure the virtual machine keeps chugging.
It appears that sheepdog only starts recovery on node failure when a node fails the cluster-membership (zookeeper or corosync). It is concerning that there are no active checks or logs for single-disk failures in a multi-disk setup. Are we missing some option here? Or are we misusing this feature?
This is correct behavior as we discussed not too long back.  When a drive fails in md mode, sheep rebalances data internally.  This doesn't trigger a cluster wide recovery, but it does mean that you could potentially end up in a situation where an attempt to write data to that node is performed but there's no space left on the node in spite of what is implied by the current weight.  I'm not sure exactly what happens in that case.  I try to keep us well below 80% utilized space just in case.
5. However we structure our redundancy, we would like to be able to safely offline disks that we identify as performing poorly or in SMART pre-failure / failure mode. What procedure should we use when replacing disks? Should we run "dog cluster check" after each disk? "cluster rebalance?" "cluster recover"? Can we do more than one disk at a time?

I would suggest a single drive at a time depending on utilization.  A dog cluster reweight should only be necessary if a drive will be missing for an extended period of time.  So in principle:

dog node md unplug /var/lib/sheep2
umount /var/lib/sheep2...  Whatever else you're going to do
mount /var/lib/sheep2
dog node md plug /var/lib/sheep2

All credit to:
http://www.sheepdog-project.org/doc/multidevice.html

Note the beware on that page.  You cannot unplug the metadata drive (first entry in md mode), only the object store drives.  If the metadata drive is acting up or dying, just down the node for repair.

All the other caveats about using zookeeper over corosync remain...

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ajhobbs.vcf
Type: text/x-vcard
Size: 353 bytes
Desc: ajhobbs.vcf
URL: <http://lists.wpkg.org/pipermail/sheepdog-users/attachments/20140409/856cc5d4/attachment-0005.vcf>