[stgt] [PATCH v2] redirect: protect again tgtd process hang as of cluster software hang
FUJITA Tomonori
fujita.tomonori at lab.ntt.co.jp
Fri Mar 18 02:19:08 CET 2011
On Thu, 17 Mar 2011 15:53:39 +0200
Or Gerlitz <ogerlitz at mellanox.com> wrote:
> commit fb0bd886021239aac72bb34a497af0cca946f1e7
> Author: Or Gerlitz <ogerlitz at mellanox.com>
> Date: Thu Mar 17 15:22:30 2011 +0200
>
> redirect: protect again tgtd process hang as of cluster software hang
>
> If the child process spawned to run the redirect callback script hangs, e.g
> as of load/bug in the application which this script is dealing with, tgt
> can hang forever. Protect against that by selecting the fd from which tgt
> is expected to read, for up to 100ms, if the timeout expires, tgt terminates
> the child process and fail the initiator login attempt. While not being the
> ultimate solution, of using tgt_event_add et al. and making the redirect
> code fully asynchronous, this patch adds protection and makes things much
> better in that respect.
>
> Signed-off-by: Alexander Nezhinsky <alexandern at mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz at mellanox.com>
> -----
The temporary solution is fine by me.
> added a comment and few formatting changes
>
> diff --git a/usr/tgtd.c b/usr/tgtd.c
> index 946cd3e..4f71e1e 100644
> --- a/usr/tgtd.c
> +++ b/usr/tgtd.c
> @@ -319,7 +319,24 @@ int call_program(const char *cmd, void (*callback)(void *data, int result),
> eprintf("execv failed for: %s, %m\n", cmd);
> exit(-1);
> } else {
> + struct timeval tv;
> + fd_set rfds;
> + int ret_sel;
> +
> close(fds[1]);
> + /* 0.1 second is okay, as the initiator will retry anyway */
> + do {
> + FD_ZERO(&rfds);
> + FD_SET(fds[0],&rfds);
> + tv.tv_sec = 0;
> + tv.tv_usec = 100000;
> + ret_sel = select(fds[0]+1, &rfds, NULL, NULL, &tv);
> + } while (ret_sel < 0 && errno == EINTR);
> + if (ret_sel <= 0) { /* error or timeout */
> + eprintf("timeout on redirect callback, \
> + terminating child pid %d\n", pid);
> + kill(pid, SIGTERM);
> + }
Why this is necessary before waitpid()?
I thought that we need to make sure that fds[0] is readable (with
a timeout) before calling read() for fds[0].
> do {
> ret = waitpid(pid, &i, 0);
> } while (ret < 0 && errno == EINTR);
> @@ -328,11 +345,13 @@ int call_program(const char *cmd, void (*callback)(void *data, int result),
> close(fds[0]);
> return ret;
> }
> - ret = read(fds[0], output, op_len);
> - if (ret < 0) {
> - eprintf("failed to get the output from: %s\n", cmd);
> - close(fds[0]);
> - return ret;
> + if (ret_sel > 0) {
> + ret = read(fds[0], output, op_len);
> + if (ret < 0) {
> + eprintf("failed to get the output from: %s\n", cmd);
> + close(fds[0]);
> + return ret;
> + }
> }
>
> if (callback)
>
> --
> To unsubscribe from this list: send the line "unsubscribe stgt" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
More information about the stgt
mailing list