Load balancing server retry interval

Wed Jan 5 07:59:15 MSK 2011

Hello!

On Tue, Jan 04, 2011 at 10:41:13AM -0500, ehudros2 wrote:

> Hi everyone,
> I'm setting up a load balancer with 2 backend servers using the
> following configuration:
> [code]
> upstream backend{
>         ip_hash;
>         server 10.0.0.1:3000 max_fails=3  fail_timeout=20s;
>         server 10.0.0.2:3000 max_fails=3  fail_timeout=20s;
> }
> 
> location / {
>         proxy_set_header Host $host;
>         proxy_set_header X-Real-IP $remote_addr;
>         proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
>         proxy_pass http://backend;
>         proxy_read_timeout 20;
>         proxy_connect_timeout 10;
>         proxy_pass_header Set-Cookie;
> }
> [/code]
> 
> When I bring one of the servers down for testing, I get a behavior that
> is a bit odd. The first 3 requests timeout, and transfer to the second
> server.
> After that, for 20 seconds, all requests automatically transfer to the
> second server (as the first one is flagged as down).
> However, after these 20 seconds pass, the server status is reset and all
> requests are directed to it until it fails again.
> 
> The documentation says fail_timeout is also used as the interval before
> another check is made, and that leads me to believe this behavior is by
> design. The problem is, when 20 seconds have passed (and ip_hash is
> activated) all users who were assigned to the failed server are directed
> to it again, having their requests hang until they timeout again. 
> 
> Am I supposed to use a larger fail_timeout to imply a longer failed
> status? It seems odd that nginx assumes a server is back up and waits
> until it fails again (instead of checking in the background if it's back
> online and only then set it to healthy).

Yes, this is how it currently works.  Not very good algorithm 
(awfull one, actually) and this should be changed somehow.  
Providing good patch may be beneficial.

Maxim Dounin