[Sheepdog] sheepdog Digest, Vol 23, Issue 11

Fri Aug 12 13:32:56 CEST 2011

At Wed, 10 Aug 2011 16:16:51 -0300,
Gustavo Callou wrote:
> Dear Kazutaka,
> 
> 
>  Just to let you know that I am working together with Rubens testing the
> Sheepdog environment.
> 
> We reproduced the issue in which all nodes of the Sheepdog cluster were
> crashed when there was an energy cut off.

Thanks for your report!

> 
> The test performed is simple. We configured two machines running Sheepdog
> with the newer developer version (kazum-sheepdog-v0.2.3-35-g31f9a75.tar.gz)
> available at
> https://github.com/kazum/sheepdog/tree/31f9a75f828634681261144c406eb4ca359dd90c.
> Besides, the Ubuntu Server edition (ubuntu-10.04.3-server-i386.iso) was
> installed in the Alice vdi. Fig 1 shows the previous configuration
> mentioned. After running the qemu with Alice's OS, we turned off the two
> machines at the same time.
> 
> 
>  The Sheepdog results obtained when we turned on the machine sheep2 are
> shown in Fig2. After that, in the other machine (sheep1), we tried to start
> the Sheepdog without success as presented in Fig3.

It seems that the epoch of the sheep1 is 2.  I think if you started
the sheep1 first and added the sheep2 second, Sheepdog would work
correctly.  Probably, the sheep2 was turned off a bit earlier than the
sheep1, and there was a epoch that the sheep1 is the only one in
Sheepdog.

> 
> 
>  We performed other test, in which we shutdown the cluster (both machines);
> deleted all content of the Sheepdog storage directory of sheep1 (the one
> that were running qemu) machine; turned on the sheepdog in sheep2 and, after
> it has recovered, we ran the sheepdog on sheep1. Although the cluster spent
> some time performing the synchronization on sheep1, no one machine was able
> to start again the OS from Alice as shown in Fig4 since the Alice's vdi was
> not available anymore.

After rebooting, Sheepdog expects that all machines will come back
without data loss.  If you cleaned the data in the sheep1, you cannot
restart Sheepdog automatically.

Those two problems are because of the epoch inconsistency of Sheepdog.
The command 'collie cluster check', which will be appeared soon, would
solve them, I think.

Thanks,

Kazutaka

> 
> 
>  Do you have any suggestion about what may be causing that problem? Besides,
> I would like to know if the configuration running the experiment was ok.
> 
> 
>  Best regards,
> 
> Gustavo
> 
> 
> On Sat, Aug 6, 2011 at 7:00 AM, <sheepdog-request at lists.wpkg.org> wrote:
> 
> > Send sheepdog mailing list submissions to
> >        sheepdog at lists.wpkg.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >        http://lists.wpkg.org/mailman/listinfo/sheepdog
> > or, via email, send a message with subject or body 'help' to
> >        sheepdog-request at lists.wpkg.org
> >
> > You can reach the person managing the list at
> >        sheepdog-owner at lists.wpkg.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of sheepdog digest..."
> >
> >
> > Today's Topics:
> >
> >   1. Re: Power supply interruption crashes data stored in      sheepdog
> >      (Fernando Frediani (Qube))
> >   2. Re: Power supply interruption crashes data stored in      sheepdog
> >      (Rubens Matos)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Fri, 5 Aug 2011 10:52:14 +0000
> > From: "Fernando Frediani (Qube)" <fernando.frediani at qubenet.net>
> > To: 'Rubens Matos' <rubens.matos at gmail.com>
> > Cc: "'sheepdog at lists.wpkg.org'" <sheepdog at lists.wpkg.org>
> > Subject: Re: [Sheepdog] Power supply interruption crashes data stored
> >        in      sheepdog
> > Message-ID:
> >        <
> > 6EC7489C49252F4F823EAE91E3A939391C4F098E at QUBE-TR2-EXC01.qube.qubenet.net>
> >
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Rubens,
> >
> > Do you mean you recovered it ?
> > What have you do to get it working again ?
> >
> > Obrigado
> >
> > Fernando
> >
> > From: sheepdog-bounces at lists.wpkg.org [mailto:
> > sheepdog-bounces at lists.wpkg.org] On Behalf Of Rubens Matos
> > Sent: 05 August 2011 04:12
> > To: MORITA Kazutaka
> > Cc: sheepdog at lists.wpkg.org
> > Subject: Re: [Sheepdog] Power supply interruption crashes data stored in
> > sheepdog
> >
> > I have already cleaned the damaged cluster. I guess it is possible to
> > reproduce the error, and then capture the output from collie cluster info.
> >
> > Anyway, the upcoming  "collie cluster check" command is a very good news.
> >
> > Rubens de Souza Matos J?nior
> >
> > 2011/8/4 MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp<mailto:
> > morita.kazutaka at lab.ntt.co.jp>>
> > At Thu, 4 Aug 2011 16:28:50 -0300,
> > Rubens Matos wrote:
> > > Hi everyone,
> > >
> > > I am testing sheepdog and everything was working, but after an
> > interruption
> > > in power supply, that affected all nodes, the cluster was damaged so that
> > > the nodes didn't join again, and I can't recover the data that was stored
> > in
> > > a VDI.
> > >
> > > Have you already noticed a similar behavior? Is sheepdog protected
> > against
> > > such kind of failure, in which all nodes are abruptly disconnected?
> > Sheepdog should handle the total node failure, but I think some bugs
> > still exist in it.  The error handling has not been tested enough.
> >
> > If you have not cleaned the damaged cluster yet, can you give me the
> > outputs of "collie cluster info" on all the nodes?  Those info would
> > be helpful to find the error reason.
> >
> > I'm implementing a "collie cluster check" command, which works like
> > fsck for Sheepdog.  This command would be helpful for recovering the
> > damaged cluster.
> >
> >
> > Thanks,
> >
> > Kazutaka
> >
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> > http://lists.wpkg.org/pipermail/sheepdog/attachments/20110805/817a6502/attachment-0001.html
> > >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Fri, 5 Aug 2011 09:46:39 -0300
> > From: Rubens Matos <rubens.matos at gmail.com>
> > To: "Fernando Frediani (Qube)" <fernando.frediani at qubenet.net>
> > Cc: "sheepdog at lists.wpkg.org" <sheepdog at lists.wpkg.org>
> > Subject: Re: [Sheepdog] Power supply interruption crashes data stored
> >        in      sheepdog
> > Message-ID:
> >        <CAP2mMMntGe1s1Jq5=suiyKUS4shruc0Dx61xgWy1ZdGLhY_Qeg at mail.gmail.com
> > >
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Fernando, I didn't recovered the stored data. I removed the directory and
> > started sheepdog again.
> >
> > Rubens
> >
> >
> > 2011/8/5 Fernando Frediani (Qube) <fernando.frediani at qubenet.net>
> >
> > >  Rubens,****
> > >
> > > ** **
> > >
> > > Do you mean you recovered it ?****
> > >
> > > What have you do to get it working again ?****
> > >
> > > ** **
> > >
> > > Obrigado****
> > >
> > > ** **
> > >
> > > Fernando****
> > >
> > > ** **
> > >
> > > *From:* sheepdog-bounces at lists.wpkg.org [mailto:
> > > sheepdog-bounces at lists.wpkg.org] *On Behalf Of *Rubens Matos
> > > *Sent:* 05 August 2011 04:12
> > > *To:* MORITA Kazutaka
> > > *Cc:* sheepdog at lists.wpkg.org
> > > *Subject:* Re: [Sheepdog] Power supply interruption crashes data stored
> > in
> > > sheepdog****
> > >
> > > ** **
> > >
> > > I have already cleaned the damaged cluster. I guess it is possible to
> > > reproduce the error, and then capture the output from collie cluster
> > info.
> > > ****
> > >
> > > ** **
> > >
> > > Anyway, the upcoming  "collie cluster check" command is a very good
> > news.*
> > > ***
> > >
> > >
> > > Rubens de Souza Matos J?nior
> > >
> > > ****
> > >
> > > 2011/8/4 MORITA Kazutaka <morita.kazutaka at lab.ntt.co.jp>****
> > >
> > > At Thu, 4 Aug 2011 16:28:50 -0300,****
> > >
> > > Rubens Matos wrote:
> > > > Hi everyone,
> > > >
> > > > I am testing sheepdog and everything was working, but after an
> > > interruption
> > > > in power supply, that affected all nodes, the cluster was damaged so
> > that
> > > > the nodes didn't join again, and I can't recover the data that was
> > stored
> > > in
> > > > a VDI.
> > > >
> > > > Have you already noticed a similar behavior? Is sheepdog protected
> > > against
> > > > such kind of failure, in which all nodes are abruptly disconnected?****
> > >
> > > Sheepdog should handle the total node failure, but I think some bugs
> > > still exist in it.  The error handling has not been tested enough.
> > >
> > > If you have not cleaned the damaged cluster yet, can you give me the
> > > outputs of "collie cluster info" on all the nodes?  Those info would
> > > be helpful to find the error reason.
> > >
> > > I'm implementing a "collie cluster check" command, which works like
> > > fsck for Sheepdog.  This command would be helpful for recovering the
> > > damaged cluster.
> > >
> > >
> > > Thanks,
> > >
> > > Kazutaka****
> > >
> > > ** **
> > >
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> > http://lists.wpkg.org/pipermail/sheepdog/attachments/20110805/0734688c/attachment-0001.html
> > >
> >
> > ------------------------------
> >
> > _______________________________________________
> > sheepdog mailing list
> > sheepdog at lists.wpkg.org
> > http://lists.wpkg.org/mailman/listinfo/sheepdog
> >
> >
> > End of sheepdog Digest, Vol 23, Issue 11
> > ****************************************
> >
> 
> 
> 
> -- 
> PhD Candidate in Computer Science
> Federal University of Pernambuco
> http://www.cin.ufpe.br/~grac
> http://www.modcs.org
> [1.2  <text/html; ISO-8859-1 (quoted-printable)>]
> 
> [2 Figures.zip <application/zip (base64)>]
> 
> [3  <text/plain; us-ascii (7bit)>]
> -- 
> sheepdog mailing list
> sheepdog at lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog