fail_timeout in upstream not rescpeted?

Mon Jan 30 08:32:52 UTC 2017

On Mon, Jan 30, 2017 at 02:41:06AM -0500, plrunner wrote:
> Hi everybody,
> 
> I am running nginx v1.11 and I noticed something pretty weird in my
> error.log.
> 
> I have fail_timeout=1800s along with max_fails=1 in my upstream and
> proxy_next_upstream is set to "error timeout", so I expect an upstream host
> to be taken off the list for 30 minutes just after the first failed
> connection.
> 
> Here is what I unexpectedly get in the error.log
> 
> 2017/01/23 09:49:48 [error] 30676#30676: *2202666 connect() failed (111:
> Connection refused) while connecting to upstream, client: 93.XX.YYY.228,
> server: *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0",
> upstream: "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006
> host: "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 2017/01/23 09:49:48 [warn] 30676#30676: *2202666 upstream server temporarily
> disabled while connecting to upstream, client: 93.XX.YYY.228, server:
> *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0", upstream:
> "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006 host:
> "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 2017/01/23 09:57:53 [error] 30695#30695: *2205681 connect() failed (111:
> Connection refused) while connecting to upstream, client: 93.XX.YYY.228,
> server: *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0",
> upstream: "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006
> host: "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 2017/01/23 09:57:53 [warn] 30695#30695: *2205681 upstream server temporarily
> disabled while connecting to upstream, client: 93.XX.YYY.228, server:
> *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0", upstream:
> "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006 host:
> "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 
> The host is reused after just 8 minutes, instead of 30 minutes.
> 
> Is there anything wrong in my conf or something I forgot to take into
> account?

Without the "zone" directive in the "upstream" block, each worker process
has its own view on the state of upstream servers, including "max_fails"
and "fail_timeout".