Question about failure and fail-over

Maxim Dounin mdounin at mdounin.ru
Thu Jul 18 14:28:08 UTC 2013


Hello!

On Thu, Jul 18, 2013 at 07:10:27AM -0400, Branden Visser wrote:

> Hi all, I have a general question about server failure and failover
> within an upstream group to ensure I understand it correctly.
> 
> Lets say I have the configuration:
> 
> proxy_next_upstream timeout;
> proxy_connect_timeout 5;
> ...
> upstream {
>   127.0.0.1 max_fails=3 fail_timeout=10s
>   127.0.0.2 max_fails=3 fail_timeout=10s
>   127.0.0.3 max_fails=3 fail_timeout=10s
> }
> 
> And then the server 127.0.0.1 starts "hanging" indefinitely on
> connection attempts.
> 
> a) Once 3 connection attempts timeout after 5 seconds on 127.0.0.1, it
> will be marked down. However, during that 5 second timeout, it is
> possible that 30, or N connections / requests may be in process of
> timing out as well, so you may end up with 30 internal connection
> failures as a result of 127.0.0.1's issue. Although they all are
> retried on the next available upstream, 30 end-users noticed a 5
> second hang in their request as a result of waiting for the timeout to
> occur.

Yep.  Use least_conn balancer to mitigate such kind of backend  
problems, see http://nginx.org/r/least_conn.

Additionally, it's usually good idea to make sure your backends 
return RST on listen queue overflow.  On most Linux systems 
default seems to be just to drop SYN packets on listen queue 
overflow, which will result in an unbound number of connections 
waiting for a timeout.  Changing 
/proc/sys/net/ipv4/tcp_abort_on_overflow might be good idea, see 
here for details:

http://man7.org/linux/man-pages/man7/tcp.7.html 

> b) After 10 seconds, if the server is still hanging, a) basically
> repeats in the same manner.

No.  As of 1.1.6+, only single request will be routed to the 
server after fail_timeout.  The server will be considered up only 
if it will be able to respond to this request.

> Is this correct? If I add "keepalive 64;" into the upstream block,
> does the above scenario change? If a server is marked down as a result
> of no new connections being able to connect, are all persistent
> connections destroyed as well?

Balancing doesn't know anything about cached connections.  If a 
server is marked down, no attempts to use cached connections to 
the server will be made, and eventually all connections to the 
server will be replaced with connections to other servers, as per 
LRU algorthm.

-- 
Maxim Dounin
http://nginx.org/en/donation.html



More information about the nginx mailing list