Strange $upstream_response_time latency spikes with reverse proxy

Tue Mar 19 09:49:47 UTC 2013

Hello Jay,

On Mar 19, 2013, at 2:09 , Jay Oster <jay at kodewerx.org> wrote:

> Hi again!
> 
> On Sun, Mar 17, 2013 at 2:17 AM, Jason Oster <jay at kodewerx.org> wrote:
> Hello Andrew,
> 
> On Mar 16, 2013, at 8:05 AM, Andrew Alexeev <andrew at nginx.com> wrote:
>> Jay,
>> 
>> You mean you keep seeing SYN-ACK loss through loopback?
> 
> That appears to be the case, yes. I've captured packets with tcpdump, and load them into Wireshark for easier visualization. I can see a very clear gap where no packets are transmitting for over 500ms, then a burst of ~10 SYN packets. When I look at the TCP stream flow on these SYN bursts, it shows an initial SYN packet almost exactly 1 second earlier without a corresponding SYN-ACK. I'm taking the 1-second delay to be the RTO. I can provide some pieces of the tcpdump capture log on Monday, to help illustrate.
> 
> I double-checked, and the packet loss does *not* occur on loopback interface. It does occur when hitting the network with a machine's own external IP address, however. This is within Amazon's datacenter; the packets bounce through their firewall before returning to the VM.

If I understand you right, issue can be repeated in the following cases:

1) client and server are on different EC2 instances, public IPs are used;
2) client and server are on different EC2 instances, private IPs are used;
3) client and server are on a single EC2 instance, public IP is used.

And there are no problems when:

1) client and server are on a single EC2 instance, either loopback or private IP is used.

Please correct me if I'm wrong.

What about EC2 security group - do the client and the server use the same group?
How many rules are present in this group? Have you tried to either decrease
a number of rules used, or create "pass any to any" simple configuration?

And just to clarify the things - under "external IP address" do you mean EC2
instance's public IP, or maybe Elastic IP?

>  
>> That might sound funny, but what's the OS and the overall environment of that strangely behaving machine with nginx? Is it a virtualized one? Is the other machine any different? The more details you can provide, the better :)
> 
> It's a 64-bit Ubuntu 12.04 VM, running on an AWS m3.xlarge. Both VMs are configured the same.
> 
>> Can you try the same tests on the other machine, where you originally didn't have any problems with your application? That is, can you repeat nginx+app on the other machine and see if the above strange behavior persists?
> 
> Same configuration. I'm investigating this issue because it is common across literally dozens of servers we have running in AWS. It occurs in all regions, and on all instance types. This "single server" test is the first time the software has been run with nginx load balancing to upstream processes on the same machine.
> 
> Here is some additional information in the form of screenshots from Wireshark!
> 
> 10.245.2.254 is the VM's eth0 address. 50.112.82.196 is the VM's external IP, as assigned by Amazon. All of these packets are being routed through Amazon's firewall.
> 
> This first screenshot shows the "gap" that ends with a SYN burst. This was all captured during a single run of AB.
> 
> 
> <Screen Shot 2013-03-18 at 11.58.49 AM.png>
> 
> The gap is about 500ms where the server is idle. :(
> 
> If I use "follow TCP stream" on the highlighted packet, I get this:
> 
> <Screen Shot 2013-03-18 at 11.59.18 AM.png>
> 
> The initial SYN packet was sent almost exactly 1 second prior, and a SYN-ACK was not received for it.
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx