Transient, Load Related Slow response_time / upstream_response_time vs App Server Reported Times

Thu Oct 29 18:02:57 UTC 2020

Hello -

I am hoping someone on the community list can help steer me in the right
direction for troubleshooting the following scenario:

I am running a cluster of 4 virtualized nginx open source 1.16.0 servers
with 4 vCPU cores and 4 GB of RAM each. They serve HTTP (REST API) requests
to a pool of about 40 different upstream clusters, which range from 2 to 8
servers within each upstream definition. The upstream application servers
themselves have multiple workers per server.

I've recently started seeing an issue where the reported response_time and
typically the reported upstream_response_time the nginx access log are
drastically different from the reported response on the application servers
themselves. For example, on some requests the typical average response_time
would be around 5ms with an upstream_response_time of 4ms. During these
transient periods of high load (approximately 1200 -1400 rps), the reported
nginx response_time and upstream_response_time spike up to somewhere around
1 second, while the application logs on the upstream servers are still
reporting the same 4ms response time.

The upstream definitions are very simple and look like:
upstream rest-api-xyz {
    least_conn;
    server 10.1.1.33:8080 max_fails=3 fail_timeout=30; #
production-rest-api-xyz01
    server 10.1.1.34:8080 max_fails=3 fail_timeout=30; #
production-rest-api-xyz02
}

One avenue that I've considered but does not seem to be the case from the
instrumentation on the app servers is that they're accepting the requests
and queueing them in a TCP socket locally. However, running a packet
capture on both the nginx server and the app server actually shows the http
request leaving nginx at the end of the time window. I have not looked at
this down to the TCP handshake to see if the actual negotiation is taking
an excessive amount of time. I can produce this queueing scenario
artificially, but it does not appear to be what's happening in my
production environment in the scenario described above.

Does anyone here have any experience sorting out something like this? The
upstream_connect_time is not part of the log currently, but if that number
was reporting high, I'm not entirely sure what would cause that. Similarly,
if the upstream_connect_time does not account for most of the delay, is
there anything else I should be looking at?

Thanks
Jordan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20201029/f01ba3ca/attachment.htm>