Strange $upstream_response_time latency spikes with reverse proxy
jay at kodewerx.org
Mon Mar 18 21:19:26 UTC 2013
On Sun, Mar 17, 2013 at 4:42 AM, Maxim Dounin <mdounin at mdounin.ru> wrote:
> On "these hosts"? Note that listen queue aka backlog size is
> configured in _applications_ which call listen(). At a host level
> you may only configure somaxconn, which is maximum allowed listen
> queue size (but an application may still use anything lower, even
> just 1).
"These hosts" means we have a lot of servers in production right now, and
they all exhibit the same issue. It hasn't been a showstopper, but it's
been occurring for as long as anyone can remember. The total number of
upstream servers on a typical day is 6 machines (each running 3 service
processes), and hosts running nginx account for another 4 machines. All of
these are Ubuntu 12.04 64-bit VMs running on AWS EC2 m3.xlarge instance
I was under the impression that /proc/sys/net/ipv4/tcp_max_syn_backlog was
for configuring the maximum queue size on the host. It's set to 1024, here,
and increasing the number doesn't change the frequency of the missed
/proc/sys/net/core/somaxconn is set to 500,000
Make sure to check actual listen queue sizes used on listen
> sockets involved. On Linux (you are using Linux, right?) this
> should be possible with "ss -nlt" (or "netstat -nlt").
According to `ss -nlt`, send-q on these ports is set to 128. And recv-q on
all ports is 0. I don't know what this means for recv-q, use default? And
would default be 1024?
But according to `netstat -nlt` both queues are 0?
> > > 2) Some other queue in the network stack is exhausted. This
> > > might be nontrivial to track (but usually possible too).
> > This is interesting, and could very well be it! Do you have any
> > suggestions on where to start looking?
> I'm not a Linux expert, but quick search suggests it should be
> possible with dropwatch, see e.g. here:
Thanks for the tip! I'll take some time to explore this some more. And
before anyone asks, I'm not using iptables or netfilter. That appears to be
a common cause for TCP overhead when investigating similar issues.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the nginx