Nginx as Load Balancer Connection Issues

Mon Jan 23 23:00:20 UTC 2012

gtuhl Wrote:
-------------------------------------------------------
> Initially we were seeing a ton of "connect()
> failed (110: Connection timed out)", 1 every
> couple seconds.  I added these to sysctl.conf and
> that seemed to solve the problem:
> 
> net.ipv4.tcp_syncookies = 1
> net.ipv4.tcp_fin_timeout = 20    
> net.ipv4.tcp_max_syn_backlog = 20480
> net.core.netdev_max_backlog = 4096
> net.ipv4.tcp_max_tw_buckets = 400000
> net.core.somaxconn = 4096
> 
> Now things generally run fine but every once in
> awhile we get a huge burst of "upstream
> prematurely closed connection while reading
> response header from upstream" followed by a "no
> live upstreams".  Again, no apparent load on the
> machines involved.  These bursts only last a
> minute or so.  We also still get an occasional
> "connect() failed (110: Connection timed out)" but
> they are far less frequent, perhaps 1 or 2 per
> hour.
> 

On looking at this again recently, we made two adjustments that
eliminated the connection issues completely:

net.nf_conntrack_max = 262144
net.ipv4.ip_local_port_range = 1024  65000

After making those two changes things became quite stable.  However, we
still have massive numbers of TIME_WAIT connections both on the nginx
machine and on the upstream apache machines.

The nginx machine is accepting roughly 1000 requests/s, and has 40,000
connections in TIME_WAIT.
The apache machines are each accepting roughly 250 requests/s, and have
15,000 connections in TIME_WAIT.

We tried setting net.ipv4.tcp_tw_reuse to 1 and restarting networking. 
That did not cause any trouble, but also didn't drop the TIME_WAIT
count.  I have read that net.ipv4.tcp_tw_recycle is dangerous but we may
try that if others have had good experiences.

Is there a way to have these cleaned up more quickly?  My concern is
that even with the expanded ip_local_port_range 40k is cutting it rather
close.  Before we bumped ip_local_port_range the whole system was
falling down right as the TIME_WAIT count approached 32k.  Is it normal
for nginx to cause this many TIME_WAIT connections?  If we're only doing
1k requests/s and nearly exhausting the available port range what would
sites with heavier volume do?

Posted at Nginx Forum: http://forum.nginx.org/read.php?2,220894,221550#msg-221550