[stgt] [PATCH v2] redirect: protect again tgtd process hang as of cluster software hang

Fri Mar 18 02:19:08 CET 2011

On Thu, 17 Mar 2011 15:53:39 +0200
Or Gerlitz <ogerlitz at mellanox.com> wrote:

> commit fb0bd886021239aac72bb34a497af0cca946f1e7
> Author: Or Gerlitz <ogerlitz at mellanox.com>
> Date:   Thu Mar 17 15:22:30 2011 +0200
> 
>     redirect: protect again tgtd process hang as of cluster software hang
> 
>     If the child process spawned to run the redirect callback script hangs, e.g
>     as of load/bug in the application which this script is dealing with, tgt
>     can hang forever. Protect against that by selecting the fd from which tgt
>     is expected to read, for up to 100ms, if the timeout expires, tgt terminates
>     the child process and fail the initiator login attempt. While not being the
>     ultimate solution, of using tgt_event_add et al. and making the redirect
>     code fully asynchronous, this patch adds protection and makes things much
>     better in that respect.
> 
>     Signed-off-by: Alexander Nezhinsky <alexandern at mellanox.com>
>     Signed-off-by: Or Gerlitz <ogerlitz at mellanox.com>
> -----

The temporary solution is fine by me.

> added a comment and few formatting changes
> 
> diff --git a/usr/tgtd.c b/usr/tgtd.c
> index 946cd3e..4f71e1e 100644
> --- a/usr/tgtd.c
> +++ b/usr/tgtd.c
> @@ -319,7 +319,24 @@ int call_program(const char *cmd, void (*callback)(void *data, int result),
>  		eprintf("execv failed for: %s, %m\n", cmd);
>  		exit(-1);
>  	} else {
> +		struct timeval tv;
> +		fd_set rfds;
> +		int ret_sel;
> +
>  		close(fds[1]);
> +		/* 0.1 second is okay, as the initiator will retry anyway */
> +		do {
> +			FD_ZERO(&rfds);
> +			FD_SET(fds[0],&rfds);
> +			tv.tv_sec = 0;
> +			tv.tv_usec = 100000;
> +			ret_sel = select(fds[0]+1, &rfds, NULL, NULL, &tv);
> +		} while (ret_sel < 0 && errno == EINTR);
> +		if (ret_sel <= 0) { /* error or timeout */
> +			eprintf("timeout on redirect callback, \
> +					terminating child pid %d\n", pid);
> +			kill(pid, SIGTERM);
> +		}

Why this is necessary before waitpid()?

I thought that we need to make sure that fds[0] is readable (with
a timeout) before calling read() for fds[0].

>  		do {
>  			ret = waitpid(pid, &i, 0);
>  		} while (ret < 0 && errno == EINTR);
> @@ -328,11 +345,13 @@ int call_program(const char *cmd, void (*callback)(void *data, int result),
>  			close(fds[0]);
>  			return ret;
>  		}
> -		ret = read(fds[0], output, op_len);
> -		if (ret < 0) {
> -			eprintf("failed to get the output from: %s\n", cmd);
> -			close(fds[0]);
> -			return ret;
> +		if (ret_sel > 0) {
> +			ret = read(fds[0], output, op_len);
> +			if (ret < 0) {
> +				eprintf("failed to get the output from: %s\n", cmd);
> +				close(fds[0]);
> +				return ret;
> +			}
>  		}
> 
>  		if (callback)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe stgt" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html