[sheepdog-users] Sheepdog 0.9 missing live migration feature

Liu Yuan namei.unix at gmail.com
Wed May 13 17:38:34 CEST 2015


On Wed, May 13, 2015 at 10:01:04PM +0900, Hitoshi Mitake wrote:
> At Wed, 13 May 2015 20:36:04 +0800,
> Liu Yuan wrote:
> > 
> > On Wed, May 13, 2015 at 06:57:11PM +0800, Liu Yuan wrote:
> > > On Wed, May 13, 2015 at 07:22:11PM +0900, Hitoshi Mitake wrote:
> > > > At Wed, 13 May 2015 17:49:41 +0800,
> > > > Liu Yuan wrote:
> > > > > 
> > > > > On Mon, May 11, 2015 at 09:13:10PM +0900, Hitoshi Mitake wrote:
> > > > > > On Mon, May 11, 2015 at 8:58 PM, Walid Moghrabi
> > > > > > <walid.moghrabi at lezard-visuel.com> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > >> Sorry for keeping you waiting. I'll backport the patch tonight.
> > > > > > >
> > > > > > > You're great :D
> > > > > > 
> > > > > > I released v0.9.2_rc0. Please try it out:
> > > > > > https://github.com/sheepdog/sheepdog/releases/tag/v0.9.2_rc0
> > > > > > 
> > > > > > >
> > > > > > >> Thanks a lot for your help. But I need to say that journaling and
> > > > > > >> object cache are unstable features. Please don't use them in
> > > > > > >> production.
> > > > > > >
> > > > > > > Too bad :(
> > > > > > > I was really happy to try this on my setup, I equiped every node with a separated SSD drive on which I was wanting to store Sheepdog journal and/or object cache.
> > > > > > > Why are thse features "unstable" ?
> > > > > > > What are the risks ? In which conditions shouldn't I use them ?
> > > > > > 
> > > > > > As far as we know, there are risks of sheep daemon crash under heavy load.
> > > > > > 
> > > > > > >
> > > > > > > Unless there is heavy risk, I think I'll still make a try (at least in my crash tests before moving the cluster to production) because it looks promising and anyway, Sheepdog is not considered stable until now and I'm using it with real joy since 0.6 even on production platform so ... ;)
> > > > > > >
> > > > > > > Anyway, just for my wn curiosity, here is what I'm planning to do for my setup, I'd really appreciate any comment on it :
> > > > > > >
> > > > > > > 9 nodes with each :
> > > > > > >   - 2 interfaces, one for cluster communication ("main" network) and one dedicated to Sheepdog's replication ("storage" network) with fixed IPs, completely closed and Jumbo frames enabled (mtu 9000)
> > > > > > >   - 3 600Gb SAS 15k dedicated hard drives that are not part of any RAID (standalone drives) that I was thinking using in MD mode
> > > > > > >   - 1 SSD SATA drive (on which the OS resides and that I was thinking to use for Sheepdog's journl and object cache)
> > > > > > >
> > > > > > > So that means 27 hard drives cluster that I wanted to format using Erasure Code but until now, I don't really now which settings I'll configure for this ... I'd like to find the good balance between performances, security and storage space ... any proposition mostly welcomed.
> > > > > > 
> > > > > > I think your configuration doesn't have anything bad. But I suggest
> > > > > > being conservative as much as possible. For example, don't enable
> > > > > > optimizations ( -n option, example) if your current configuration can
> > > > > > provide enough performance. Our internal testing is focusing on basic
> > > > > > components. They would be enough stable. But we cannot allocate time
> > > > > > for testing optional things (testing and benchmarking distributed
> > > > > > storage is really costly), so optional things would have more bugs
> > > > > > than the basic components.
> > > > > > 
> > > > > 
> > > > > Sorry for cutting in your conversation, based on our recent tests, I'm afraid
> > > > > that basic components aren't as stable as you think. When the data grow into
> > > > > sevaral TB, our cluster crashes even by a single command 'dog vdi delete'
> > > > > sometimes. Even we restart a sheep would cause another sheep or the whole
> > > > > cluster crashes. The good side is that, after crashes day by day, the data are
> > > > > still in good state, no loss yet found. No object cache enabled in our env.
> > > > > 
> > > > > we have pure gateway + sheep(15 nodes).
> > > > > 
> > > > > Thanks,
> > > > > Yuan
> > > > 
> > > > Could you provide logs of the crashed sheeps?
> > > 
> > > We are overwhelmed by the crashes and don't have to analyze every logs because
> > > we need to set up one working cluster for production in this week. All the best
> > > we try to workaround instead to find the root causes.
> > > 
> > > I was reporbed by several procedures that crashes the cluster:
> > > 
> > > 1 while serving batch 'dog vdi delete' some vdis, and for some reason we don't
> > >   know, one node crashed, and other nodes going to recover but later all nodes
> > >   crashes. We saw some panic on sheep, but the panic was caused by 'xrealloc',
> > >   'xvalloc' etc. I checked recovery path and delete path with valgrind, no leak
> > >   found yet. Unfortunately, we don't enable 'debug' mode, so can't get the clue
> > >   why memory allocation failed.
> > > 
> > >   one exceptoin is that, I found sheep is easy to die for restart because of
> > >   collect_cinfo(). The realloc in its worker try to allocate 40G memory and panic
> > >   sheep.
> > 
> > After some static code reading, I wonder OOM probem might be very easy to occur
> > in a large vdi set (such as 10000+ vdi, actually, not very large, since sheep
> > is said to support 1M vdis).
> > 
> > One single vdi_state entry is 1452 bytes, so 10000 vdis need more than 1G memory
> > for every vdi state sync. Looks like that malloc is easy to fail for large memory
> > allocation and cause panic on sheep (This conclusion isn't verified yet).
> 
> OK, I'll fix it later.
> 
> > 
> > By the way, for our setup, 200+ nodes (though we have only 15 storage nodes),
> > one node need transfer 200+ GB data just at startup of restart. Probably this
> > would be a big scalability problem.
> 
> What is the 200+ GB data?

        rb_for_each_entry(n, &w->nroot, rb) {
                /* We should not fetch vdi_bitmap and copy list from myself */
                if (node_is_local(n))
                        continue;

                sd_debug("try to get vdi bitmap from %s", node_to_str(n));
                ret = get_vdis_from(n);
                if (ret != SD_RES_SUCCESS)

suppose we have 1GB vdi state to sync up. get_vdis_from() will try to read every
1GB vdi state from every node. so the total data *this node* read from remote
nodes will be N*1GB, N is number of remote nodes, no? Even we will try to sync
up with pure gateways now. (We are preparing patches to drop vdi states off
completely from pure gateways)

Based on this mathmatics, if we have N nodes to sync up simutaneously, things
will go worse. That is, the total traffic would be a factor of n^2. Let denote
it as f(n^2), it indicates that the vdi scalability is largely depend on the
size of struct vdi_state.

I still not know how it happened on our test cluster, which has less than 10,000
vdis, probably several k vdis. But we found some xrealloc() try to allocate 40G
memory and panic the sheep during sheep restart, mixed with recovery.

Thanks,
Yuan



More information about the sheepdog-users mailing list