Upstream max_fails, fail_timeout and proxy_read_timeout

Fri Nov 16, 2012


On Fri, Nov 16, 2012 at 09:15:01AM -0500, pliljenberg wrote:

> We're using nginx as a loadbalancer and we're seeing some strange behaviour
> when one of our backend servers takes a long time to respond to a request.
> We have a configuration like this:
> upstream handlehttp {
>         ip_hash;
>         server XXX max_fails=3 fail_timeout=30s;
>         server YYY max_fails=3 fail_timeout=30s;
> }
> server {
>  location / {
>     try_files $uri @backend;
>   }
>   location @backend {
>     proxy_pass http://handlehttp;
>     proxy_set_header Host $host;
>     proxy_set_header X-Real-IP $remote_addr;
>     proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>     proxy_next_upstream error timeout invalid_header http_500 http_502
> http_503;
>     proxy_read_timeout 300;
>   }
> }
> What we thought we had configured was: 
>   If one backend server fails more than 3 times within 30 seconds it would
> be considered disabled and all requests sent to the other backend server
> (the original server getting request after 30 seconds again).

This is what's expected.  Note though, that after the problem was 
detected things are handled a bit differently, see below.

> What we're actually seeing is that if a a request takes 300+ seconds, the
> backend is immediately set as disabled and all further requests are send to
> the other backend...
> Are we missing something or is this the correct behaviour for nginx?

Are you looking at the normally working backend server, or a 
server which was already considered down?

Note that after nginx 1.1.6 at least one request per worker have 
to succeed before "3 times withing 30 seconds" will start to apply 

    *) Change: if a server in an upstream failed, only one request will be
       sent to it after fail_timeout; the server will be considered alive if
       it will successfully respond to the request.

Maxim Dounin

