fail_timeout in upstream not rescpeted?

Mon Jan 30 12:45:59 UTC 2017

Hello!

On Mon, Jan 30, 2017 at 02:41:06AM -0500, plrunner wrote:

> Hi everybody,
> 
> I am running nginx v1.11 and I noticed something pretty weird in my
> error.log.
> 
> I have fail_timeout=1800s along with max_fails=1 in my upstream and
> proxy_next_upstream is set to "error timeout", so I expect an upstream host
> to be taken off the list for 30 minutes just after the first failed
> connection.
> 
> Here is what I unexpectedly get in the error.log
> 
> 2017/01/23 09:49:48 [error] 30676#30676: *2202666 connect() failed (111:
> Connection refused) while connecting to upstream, client: 93.XX.YYY.228,
> server: *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0",
> upstream: "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006
> host: "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 2017/01/23 09:49:48 [warn] 30676#30676: *2202666 upstream server temporarily
> disabled while connecting to upstream, client: 93.XX.YYY.228, server:
> *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0", upstream:
> "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006 host:
> "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 2017/01/23 09:57:53 [error] 30695#30695: *2205681 connect() failed (111:
> Connection refused) while connecting to upstream, client: 93.XX.YYY.228,
> server: *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0",
> upstream: "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006
> host: "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 2017/01/23 09:57:53 [warn] 30695#30695: *2205681 upstream server temporarily
> disabled while connecting to upstream, client: 93.XX.YYY.228, server:
> *.foobar.com, request: "GET /generic/api/v1/tag/1006 HTTP/2.0", upstream:
> "http://[beaf:beaf:1001:a001::003D:4]:8080/generic/api/v1/tag/1006 host:
> "cy1.foobar.com", referrer: "https://web.foobar.com/"
> 
> The host is reused after just 8 minutes, instead of 30 minutes.
> 
> Is there anything wrong in my conf or something I forgot to take into
> account?

As can be seen from "30676#" and "30695#", these messages are from 
different worker processes.  By default each worker process uses 
its own run-time state for the upstream servers.  If you want 
worker processes to use shared state, you can configure this using 
the "zone" directive in the "upstream" block, see details here:

http://nginx.org/en/docs/http/ngx_http_upstream_module.html#zone

-- 
Maxim Dounin
http://nginx.org/