[stgt] [BUG] Tgt-1.0.8 exited unexpectedly

Tue Sep 28 05:31:00 CEST 2010

On Mon, 27 Sep 2010 21:58:04 +0900 (JST)
Hirokazu Takahashi <taka at valinux.co.jp> wrote:

> > Yeah, but even real disk could return bogus data (that is, silent data
> > corruption). So this issue (returning bogus data) is about the
> > possibility.
> 
> The modern disks and HBAs can detect bogus data in most cases, but

You are talking about SCSI DIF or high-end storage systems that use
checksumming internally?

I'm not sure the modern SATA disk can detect such failure.

> there are still possibilities. Yes.
> 
> > In addition, as you know, the recent file systems can handle such
> > failure.
> 
> Yes, I know some filesystem got such a feature. But there is no point
> to return bogus data instead of an EIO error.

Yeah, but returning EIO in such cases makes an implementation more
complicated.

> > > VastSky updates all the mirrors synchronously. And only after all
> > > the I/O requests are completed, it tells the owner that the request
> > > is done.
> > 
> > Undoerstood. Sheepdog works in the same way.
> > 
> > How does Vastsky detect old data?
> > 
> > If one of the mirror nodes is down (e.g. the node is too busy),
> > Vastsky assigns a new node?
> 
> Right.
> Vastsky makes the node that seems to be down deleted from the group
> and assigns a new one. Then, no one can access to the old one after that.

How Vastsky stores the information of the group? For example, Vastsky
assigns a new node, updates the data on all the replica nodes, and
returns the success to the client, right after that, all nodes are
down due to a power failure. After all the nodes boot up again,
Vastsky can still detect the old data?

> > Then if a client issues READ and all the
> > mirrors are down but the old mirror node is alive, how does Vastsky
> > prevent returning the old data?
> 
> About this issue, we have a plan:
> When a node is down and it's not because of a hardware error,
> we will make VastSky try to re-synchronize the node again.

Yeah, that's necessary especially each nodes has huge data. Sheepdog
can do that.

> This will be done in a few minutes because VastSky traces all write
> I/O requests to know which sectors of the node aren't synchronized.

How Vastsky stores the trace log safely (I guess that the trace log is
saved on multiple hosts). Vastsky updates the log per WRITE request?

> And you should know VastSky won't easily give up a node which seems
> to be down. VastSky tries to reconnect the session and even tries to
> use another path to access the node.

Hmm, but it just means that a client I/O request takes long. Even if
VastSky doesn't give up, a client (i.e. application) doesn't want to
wait for long.
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html