[sheepdog-users] Sheepdog 0.9 missing live migration feature

Wed May 13 14:36:04 CEST 2015

On Wed, May 13, 2015 at 06:57:11PM +0800, Liu Yuan wrote:
> On Wed, May 13, 2015 at 07:22:11PM +0900, Hitoshi Mitake wrote:
> > At Wed, 13 May 2015 17:49:41 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Mon, May 11, 2015 at 09:13:10PM +0900, Hitoshi Mitake wrote:
> > > > On Mon, May 11, 2015 at 8:58 PM, Walid Moghrabi
> > > > <walid.moghrabi at lezard-visuel.com> wrote:
> > > > > Hi,
> > > > >
> > > > >> Sorry for keeping you waiting. I'll backport the patch tonight.
> > > > >
> > > > > You're great :D
> > > > 
> > > > I released v0.9.2_rc0. Please try it out:
> > > > https://github.com/sheepdog/sheepdog/releases/tag/v0.9.2_rc0
> > > > 
> > > > >
> > > > >> Thanks a lot for your help. But I need to say that journaling and
> > > > >> object cache are unstable features. Please don't use them in
> > > > >> production.
> > > > >
> > > > > Too bad :(
> > > > > I was really happy to try this on my setup, I equiped every node with a separated SSD drive on which I was wanting to store Sheepdog journal and/or object cache.
> > > > > Why are thse features "unstable" ?
> > > > > What are the risks ? In which conditions shouldn't I use them ?
> > > > 
> > > > As far as we know, there are risks of sheep daemon crash under heavy load.
> > > > 
> > > > >
> > > > > Unless there is heavy risk, I think I'll still make a try (at least in my crash tests before moving the cluster to production) because it looks promising and anyway, Sheepdog is not considered stable until now and I'm using it with real joy since 0.6 even on production platform so ... ;)
> > > > >
> > > > > Anyway, just for my wn curiosity, here is what I'm planning to do for my setup, I'd really appreciate any comment on it :
> > > > >
> > > > > 9 nodes with each :
> > > > >   - 2 interfaces, one for cluster communication ("main" network) and one dedicated to Sheepdog's replication ("storage" network) with fixed IPs, completely closed and Jumbo frames enabled (mtu 9000)
> > > > >   - 3 600Gb SAS 15k dedicated hard drives that are not part of any RAID (standalone drives) that I was thinking using in MD mode
> > > > >   - 1 SSD SATA drive (on which the OS resides and that I was thinking to use for Sheepdog's journl and object cache)
> > > > >
> > > > > So that means 27 hard drives cluster that I wanted to format using Erasure Code but until now, I don't really now which settings I'll configure for this ... I'd like to find the good balance between performances, security and storage space ... any proposition mostly welcomed.
> > > > 
> > > > I think your configuration doesn't have anything bad. But I suggest
> > > > being conservative as much as possible. For example, don't enable
> > > > optimizations ( -n option, example) if your current configuration can
> > > > provide enough performance. Our internal testing is focusing on basic
> > > > components. They would be enough stable. But we cannot allocate time
> > > > for testing optional things (testing and benchmarking distributed
> > > > storage is really costly), so optional things would have more bugs
> > > > than the basic components.
> > > > 
> > > 
> > > Sorry for cutting in your conversation, based on our recent tests, I'm afraid
> > > that basic components aren't as stable as you think. When the data grow into
> > > sevaral TB, our cluster crashes even by a single command 'dog vdi delete'
> > > sometimes. Even we restart a sheep would cause another sheep or the whole
> > > cluster crashes. The good side is that, after crashes day by day, the data are
> > > still in good state, no loss yet found. No object cache enabled in our env.
> > > 
> > > we have pure gateway + sheep(15 nodes).
> > > 
> > > Thanks,
> > > Yuan
> > 
> > Could you provide logs of the crashed sheeps?
> 
> We are overwhelmed by the crashes and don't have to analyze every logs because
> we need to set up one working cluster for production in this week. All the best
> we try to workaround instead to find the root causes.
> 
> I was reporbed by several procedures that crashes the cluster:
> 
> 1 while serving batch 'dog vdi delete' some vdis, and for some reason we don't
>   know, one node crashed, and other nodes going to recover but later all nodes
>   crashes. We saw some panic on sheep, but the panic was caused by 'xrealloc',
>   'xvalloc' etc. I checked recovery path and delete path with valgrind, no leak
>   found yet. Unfortunately, we don't enable 'debug' mode, so can't get the clue
>   why memory allocation failed.
> 
>   one exceptoin is that, I found sheep is easy to die for restart because of
>   collect_cinfo(). The realloc in its worker try to allocate 40G memory and panic
>   sheep.

After some static code reading, I wonder OOM probem might be very easy to occur
in a large vdi set (such as 10000+ vdi, actually, not very large, since sheep
is said to support 1M vdis).

One single vdi_state entry is 1452 bytes, so 10000 vdis need more than 1G memory
for every vdi state sync. Looks like that malloc is easy to fail for large memory
allocation and cause panic on sheep (This conclusion isn't verified yet).

By the way, for our setup, 200+ nodes (though we have only 15 storage nodes),
one node need transfer 200+ GB data just at startup of restart. Probably this
would be a big scalability problem.

Thanks,
Yuan

> 
> 2 We have running iscsi + tgtd + sheepdog, so basically we have 200+ purge
>   gateways and 15 storage nodes. When we restart storage nodes, in the current
>   design we have to restart all the other gateways(some were shutdown some were
>   crahed due to unknown reasons). While we restart gateway, it might cause other
>   node to crash. Then crash causes another crash in a chain in worset case.
> 
>   By the way, iscsi + tgtd always put volume into readonly becaue some timeouts.
>   we always found 'abort_task_ret' or 'abort_task' in tgtd logs even we turn off
>   timeouts or tune it as high as possible in iscsi.conf.
> 
>   sbd + sheepdog can survive the fio tests, which put volume into readonly with tgtd,
>   so I might confirm that iscsi + tgtd have some timeout problems inherent.
> 
> 3 others...umm.. I might report back with more infos.
> 
> Thanks,
> Yuan