Server very delayed in sending SYN/ACK

Sun Sep 4 00:06:09 UTC 2016

Hello,
I have run into a very interesting issue.  I am replacing a set of nginx
reverse proxy servers with a new set running an updated OS/nginx. These
nginx servers front a popular API that's mostly used by mobile apps, but
also a website that's hosted on a nearby subnet. I put the new servers into
service last night, and this morning as traffic picked up (only a couple
thousand requests per second), I got alerts from my DNS provider that
requests to the new server were starting to timeout in the Connect phase.
I hopped into New Relic, and I could see tons of requests from my website
to the nginx reverse proxy timing out after it hit our limit of 10s. I did
some curl requests with timing information, and I could see long times only
in the time_connect level, confirming the issue was only in the connection
phase. I hopped on the new nginx server and started a packet capture
filtered to a machine on a nearby subnet, did the curl from there, got it
taking a 9+ seconds in the connect phase, stopped the packet capture, and
moved the traffic over to my old setup. No issues over there.

Here's everything I know/think is relevant:

* In the packet capture from the server, I see the SYN packet come in, then
3 more retransmits of that same syn come in before the server sent back the
SYN/ACK. To me this indicates the issue in kernel or nginx side.

* There's absolutely no slowdown in the backends as measured from the same
nginx server.

* There's nothing in the nginx error log

* There's nothing from the kernel in dmesg when this is happening

* NIC duplex is fine, no dropped queues from ethtool -S (but, again, it
doesn't seem like a networking issue, we got the SYNs just fine, we just
didn't send the syn/ack)

* I tried to artificially load test afterwords using ab and loader.io,
doing 3x as many requests, but couldn't replicate the issue. I'm not sure
if it's some weird issue due to misbehaving mobile clients and SSL filling
up some sort of queue, but whatever it is, I can't replicate the issue on
demand.

* Load on the box was fine (<4) and no crazy I/O.

* Keepalives were turned on

* Some relevant sysctl values:

cat /proc/sys/net/core/somaxconn (backlog is set to the same in the nginx
config)
16384

cat /proc/sys/net/core/netdev_max_backlog
15000

cat /proc/sys/net/ipv4/tcp_max_syn_backlog
262144

NGINX: 1.11.3
OS: Ubuntu 16.04.1 x64
Kernel: 4.4.0-36-generic

It seems to me the issue is at the kernel/app level, but I can't think of
where to go from here.

If anybody has any ideas for me try, or if I've forgotten to mention
something relevant, please let me know.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20160904/3675555d/attachment.html>