Strange $upstream_response_time latency spikes with reverse proxy

Mon Mar 18 21:19:26 UTC 2013

Hi Maxim,

On Sun, Mar 17, 2013 at 4:42 AM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On "these hosts"?  Note that listen queue aka backlog size is
> configured in _applications_ which call listen().  At a host level
> you may only configure somaxconn, which is maximum allowed listen
> queue size (but an application may still use anything lower, even
> just 1).
>

"These hosts" means we have a lot of servers in production right now, and
they all exhibit the same issue. It hasn't been a showstopper, but it's
been occurring for as long as anyone can remember. The total number of
upstream servers on a typical day is 6 machines (each running 3 service
processes), and hosts running nginx account for another 4 machines. All of
these are Ubuntu 12.04 64-bit VMs running on AWS EC2 m3.xlarge instance
types.

I was under the impression that /proc/sys/net/ipv4/tcp_max_syn_backlog was
for configuring the maximum queue size on the host. It's set to 1024, here,
and increasing the number doesn't change the frequency of the missed
packets.

/proc/sys/net/core/somaxconn is set to 500,000

Make sure to check actual listen queue sizes used on listen
> sockets involved.  On Linux (you are using Linux, right?) this
> should be possible with "ss -nlt" (or "netstat -nlt").

According to `ss -nlt`, send-q on these ports is set to 128. And recv-q on
all ports is 0. I don't know what this means for recv-q, use default? And
would default be 1024?

But according to `netstat -nlt` both queues are 0?

> > > 2) Some other queue in the network stack is exhausted.  This
> > > might be nontrivial to track (but usually possible too).
> >
> > This is interesting, and could very well be it! Do you have any
> > suggestions on where to start looking?
>
> I'm not a Linux expert, but quick search suggests it should be
> possible with dropwatch, see e.g. here:
>
>
> http://prefetch.net/blog/index.php/2011/07/11/using-netstat-and-dropwatch-to-observe-packet-loss-on-linux-servers/

Thanks for the tip! I'll take some time to explore this some more. And
before anyone asks, I'm not using iptables or netfilter. That appears to be
a common cause for TCP overhead when investigating similar issues.

Jay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20130318/2092852c/attachment.html>