Strange $upstream_response_time latency spikes with reverse proxy

Sun Mar 17 09:17:52 UTC 2013

Hello Andrew,

On Mar 16, 2013, at 8:05 AM, Andrew Alexeev <andrew at nginx.com> wrote:

> Jay,
> 
> On Mar 16, 2013, at 12:37 PM, Jay Oster wrote:
> 
>> Hi Maxim,
>> 
>> Thanks for the suggestion! It looks like packet drop is the culprit here. The initial SYN packet doesn't receive a corresponding SYN-ACK from the upstream servers, so after a 1-second timeout (TCP Retransmission TimeOut), the packet is retransmitted. The question is still *why* this only occurs through nginx.
>> 
>> To further narrow down the root cause, I moved my upstream server to the same machine with nginx. The issue can still be replicated there. To eliminate my upstream server as the cause (it's written in C with libevent, by the way) I used the nodejs hello world demo; nodejs has trouble with the 5,000 concurrent connections (go figure) but the connections that are successful (without nginx reverse proxying) all complete in less than one second. When I place nginx between ApacheBench and nodejs, that 1-second TCP RTO shows up again.
> 
> You mean you keep seeing SYN-ACK loss through loopback?

That appears to be the case, yes. I've captured packets with tcpdump, and load them into Wireshark for easier visualization. I can see a very clear gap where no packets are transmitting for over 500ms, then a burst of ~10 SYN packets. When I look at the TCP stream flow on these SYN bursts, it shows an initial SYN packet almost exactly 1 second earlier without a corresponding SYN-ACK. I'm taking the 1-second delay to be the RTO. I can provide some pieces of the tcpdump capture log on Monday, to help illustrate.

> That might sound funny, but what's the OS and the overall environment of that strangely behaving machine with nginx? Is it a virtualized one? Is the other machine any different? The more details you can provide, the better :)

It's a 64-bit Ubuntu 12.04 VM, running on an AWS m3.xlarge. Both VMs are configured the same.

>> To reiterate, this is all happening on a single machine; the TCP stack is involved, but not a physical network. The only common denominator is nginx.
> 
> Can you try the same tests on the other machine, where you originally didn't have any problems with your application? That is, can you repeat nginx+app on the other machine and see if the above strange behavior persists?

Same configuration. I'm investigating this issue because it is common across literally dozens of servers we have running in AWS. It occurs in all regions, and on all instance types. This "single server" test is the first time the software has been run with nginx load balancing to upstream processes on the same machine.

>> On Fri, Mar 15, 2013 at 1:20 AM, Maxim Dounin <mdounin at mdounin.ru> wrote:
>> Hello!
>> 
>> On Thu, Mar 14, 2013 at 07:07:20PM -0700, Jay Oster wrote:
>> 
>> [...]
>> 
>> > The access log has 10,000 lines total (i.e. two of these tests with 5,000
>> > concurrent connections), and when I sort by upstream_response_time, I get a
>> > log with the first 140 lines having about 1s on the upstream_response_time,
>> > and the remaining 9,860 lines show 700ms and less. Here's a snippet showing
>> > the strangeness, starting with line numbers:
>> >
>> >
>> > 1:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83
>> > 2:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83
>> > 3:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 1.026 1.025 234 83
>> > ...
>> > 138:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 1.000 0.999 234 81
>> > 139:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81
>> > 140:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81
>> > 141:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
>> > 142:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
>> > 143:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
>> > ...
>> > 9998:  127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81
>> > 9999:  127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81
>> > 10000: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
>> > 200 19 "-" "ApacheBench/2.3" 0.122 0.002 234 81
>> >
>> >
>> >
>> > The upstream_response_time difference between line 140 and 141 is nearly
>> > 500ms! The total request_time also displays an interesting gap of almost
>> > 300ms. What's going on here?
>> 
>> I would suggests there are packet loss and retransmits for some
>> reason.  Try tcpdump'ing traffic between nginx and backends to see
>> what goes on in details.
>> 
>> --
>> Maxim Dounin
>> http://nginx.org/en/donation.html
>> 
>> _______________________________________________
>> nginx mailing list
>> nginx at nginx.org
>> http://mailman.nginx.org/mailman/listinfo/nginx
>> 
>> _______________________________________________
>> nginx mailing list
>> nginx at nginx.org
>> http://mailman.nginx.org/mailman/listinfo/nginx
> 
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20130317/b32ded04/attachment.html>