<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Jay,<br><div>
</div>
<br><div><div>On Mar 16, 2013, at 12:37 PM, Jay Oster wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">Hi Maxim,<div><br></div><div>Thanks for the suggestion! It looks like packet drop is the culprit here. The initial SYN packet doesn't receive a corresponding SYN-ACK from the upstream servers, so after a 1-second timeout (TCP Retransmission TimeOut), the packet is retransmitted. The question is still *why* this only occurs through nginx.</div>
<div><br></div><div style="">To further narrow down the root cause, I moved my upstream server to the same machine with nginx. The issue can still be replicated there. To eliminate my upstream server as the cause (it's written in C with libevent, by the way) I used the nodejs hello world demo; nodejs has trouble with the 5,000 concurrent connections (go figure) but the connections that are successful (without nginx reverse proxying) all complete in less than one second. When I place nginx between ApacheBench and nodejs, that 1-second TCP RTO shows up again.</div></div></blockquote><div><br></div>You mean you keep seeing SYN-ACK loss through loopback?<br><div><br></div><div>That might sound funny, but what's the OS and the overall environment of that strangely behaving machine with nginx? Is it a virtualized one? Is the other machine any different? The more details you can provide, the better :)</div><br><blockquote type="cite"><div dir="ltr">
<div style="">To reiterate, this is all happening on a single machine; the TCP stack is involved, but not a physical network. The only common denominator is nginx.</div></div></blockquote><div><br></div><div>Can you try the same tests on the other machine, where you originally didn't have any problems with your application? That is, can you repeat nginx+app on the other machine and see if the above strange behavior persists?</div><br><blockquote type="cite"><div class="gmail_extra">On Fri, Mar 15, 2013 at 1:20 AM, Maxim Dounin <span dir="ltr"><<a href="mailto:mdounin@mdounin.ru" target="_blank">mdounin@mdounin.ru</a>></span> wrote:<br><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello!<br>
<br>
On Thu, Mar 14, 2013 at 07:07:20PM -0700, Jay Oster wrote:<br>
<br>
[...]<br>
<div><div class="h5"><br>
> The access log has 10,000 lines total (i.e. two of these tests with 5,000<br>
> concurrent connections), and when I sort by upstream_response_time, I get a<br>
> log with the first 140 lines having about 1s on the upstream_response_time,<br>
> and the remaining 9,860 lines show 700ms and less. Here's a snippet showing<br>
> the strangeness, starting with line numbers:<br>
><br>
><br>
> 1: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83<br>
> 2: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83<br>
> 3: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 1.026 1.025 234 83<br>
> ...<br>
> 138: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 1.000 0.999 234 81<br>
> 139: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81<br>
> 140: 127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81<br>
> 141: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83<br>
> 142: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83<br>
> 143: 127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83<br>
> ...<br>
> 9998: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81<br>
> 9999: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81<br>
> 10000: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"<br>
> 200 19 "-" "ApacheBench/2.3" 0.122 0.002 234 81<br>
><br>
><br>
><br>
> The upstream_response_time difference between line 140 and 141 is nearly<br>
> 500ms! The total request_time also displays an interesting gap of almost<br>
> 300ms. What's going on here?<br>
<br>
</div></div>I would suggests there are packet loss and retransmits for some<br>
reason. Try tcpdump'ing traffic between nginx and backends to see<br>
what goes on in details.<br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Maxim Dounin<br>
<a href="http://nginx.org/en/donation.html" target="_blank">http://nginx.org/en/donation.html</a><br>
<br>
_______________________________________________<br>
nginx mailing list<br>
<a href="mailto:nginx@nginx.org">nginx@nginx.org</a><br>
<a href="http://mailman.nginx.org/mailman/listinfo/nginx" target="_blank">http://mailman.nginx.org/mailman/listinfo/nginx</a><br>
</font></span></blockquote></div><br></div>
_______________________________________________<br>nginx mailing list<br><a href="mailto:nginx@nginx.org">nginx@nginx.org</a><br>http://mailman.nginx.org/mailman/listinfo/nginx</blockquote></div><br></body></html>