Strange $upstream_response_time latency spikes with reverse proxy

Jay Oster jay at kodewerx.org
Sat Mar 16 08:37:22 UTC 2013


Hi Maxim,

Thanks for the suggestion! It looks like packet drop is the culprit here.
The initial SYN packet doesn't receive a corresponding SYN-ACK from the
upstream servers, so after a 1-second timeout (TCP Retransmission TimeOut),
the packet is retransmitted. The question is still *why* this only occurs
through nginx.

To further narrow down the root cause, I moved my upstream server to the
same machine with nginx. The issue can still be replicated there. To
eliminate my upstream server as the cause (it's written in C with libevent,
by the way) I used the nodejs hello world demo; nodejs has trouble with the
5,000 concurrent connections (go figure) but the connections that are
successful (without nginx reverse proxying) all complete in less than one
second. When I place nginx between ApacheBench and nodejs, that 1-second
TCP RTO shows up again.

To reiterate, this is all happening on a single machine; the TCP stack is
involved, but not a physical network. The only common denominator is nginx.


On Fri, Mar 15, 2013 at 1:20 AM, Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On Thu, Mar 14, 2013 at 07:07:20PM -0700, Jay Oster wrote:
>
> [...]
>
> > The access log has 10,000 lines total (i.e. two of these tests with 5,000
> > concurrent connections), and when I sort by upstream_response_time, I
> get a
> > log with the first 140 lines having about 1s on the
> upstream_response_time,
> > and the remaining 9,860 lines show 700ms and less. Here's a snippet
> showing
> > the strangeness, starting with line numbers:
> >
> >
> > 1:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83
> > 2:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83
> > 3:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 1.026 1.025 234 83
> > ...
> > 138:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 1.000 0.999 234 81
> > 139:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81
> > 140:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81
> > 141:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
> > 142:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
> > 143:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
> > ...
> > 9998:  127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81
> > 9999:  127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81
> > 10000: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
> > 200 19 "-" "ApacheBench/2.3" 0.122 0.002 234 81
> >
> >
> >
> > The upstream_response_time difference between line 140 and 141 is nearly
> > 500ms! The total request_time also displays an interesting gap of almost
> > 300ms. What's going on here?
>
> I would suggests there are packet loss and retransmits for some
> reason.  Try tcpdump'ing traffic between nginx and backends to see
> what goes on in details.
>
> --
> Maxim Dounin
> http://nginx.org/en/donation.html
>
> _______________________________________________
> nginx mailing list
> nginx at nginx.org
> http://mailman.nginx.org/mailman/listinfo/nginx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20130316/4fb119b6/attachment.html>


More information about the nginx mailing list