Upstream max_fails, fail_timeout and proxy_read_timeout

Fri Nov 16 16:32:44 UTC 2012

Hello!

On Fri, Nov 16, 2012 at 10:54:51AM -0500, pliljenberg wrote:

> Thanks for the reply.
> 
> >> What we're actually seeing is that if a a request takes 300+ seconds,
> the
> >> backend is immediately set as disabled and all further requests are send
> to
> >> the other backend...
> >> Are we missing something or is this the correct behaviour for nginx?
> 
> >Are you looking at the normally working backend server, or a
> >server which was already considered down?
> 
> One server X receives a request which takes 300+ seconds to complete . That
> request gets dropped by nginx due to the read timeout (as expected).
> When this happens the server X is disabled and all upcoming request are sent
> to server Y instead.
> My interpretation of the configuration was that the server X would still get
> requests since it only had 1 failure (and it 3 as configured) during the
> last 30 seconds?

The intresing part is what happens _before_ "one server X receives 
a request...".  Is it working normally and handles other requests?  
Or it was already considered dead and the request in question is 
one to check if it's alive?

To illustrate, here is what happens with normally working server 
(one server on port 9999 is dead, and one at 8080 is responding 
normally, fail_timeout=30s, max_fails=3, ip_hash, just started 
nginx):

2012/11/16 20:23:29 [debug] 35083#0: *1 connect to 127.0.0.1:9999, fd:17 #2
2012/11/16 20:23:29 [debug] 35083#0: *1 connect to 127.0.0.1:8080, fd:17 #3
2012/11/16 20:23:29 [debug] 35083#0: *5 connect to 127.0.0.1:9999, fd:17 #6
2012/11/16 20:23:29 [debug] 35083#0: *5 connect to 127.0.0.1:8080, fd:17 #7
2012/11/16 20:23:30 [debug] 35083#0: *9 connect to 127.0.0.1:9999, fd:17 #10
2012/11/16 20:23:30 [debug] 35083#0: *9 connect to 127.0.0.1:8080, fd:17 #11
2012/11/16 20:23:31 [debug] 35083#0: *13 connect to 127.0.0.1:8080, fd:17 #14
2012/11/16 20:23:31 [debug] 35083#0: *16 connect to 127.0.0.1:8080, fd:17 #17
2012/11/16 20:23:32 [debug] 35083#0: *19 connect to 127.0.0.1:8080, fd:17 #20
2012/11/16 20:23:33 [debug] 35083#0: *22 connect to 127.0.0.1:8080, fd:17 #23
2012/11/16 20:23:34 [debug] 35083#0: *25 connect to 127.0.0.1:8080, fd:17 #26
2012/11/16 20:23:34 [debug] 35083#0: *28 connect to 127.0.0.1:8080, fd:17 #29
2012/11/16 20:23:35 [debug] 35083#0: *31 connect to 127.0.0.1:8080, fd:17 #32

As you can see, first 3 requests try to reach port 9999 - because 
of max_fails=3.

On the other hand, as long as fail_timeout=30s passes, only one 
request try to reach 9999:

2012/11/16 20:24:37 [debug] 35083#0: *34 connect to 127.0.0.1:9999, fd:16 #35
2012/11/16 20:24:37 [debug] 35083#0: *34 connect to 127.0.0.1:8080, fd:16 #36
2012/11/16 20:24:38 [debug] 35083#0: *38 connect to 127.0.0.1:8080, fd:16 #39
2012/11/16 20:24:39 [debug] 35083#0: *41 connect to 127.0.0.1:8080, fd:16 #42

That's because situations of "normal working server" and "dead 
server we are trying to use again" are a bit different.

-- 
Maxim Dounin
http://nginx.com/support.html