Transient, Load Related Slow response_time / upstream_response_time vs App Server Reported Times

Fri Oct 30 21:01:20 UTC 2020

Hello!

On Thu, Oct 29, 2020 at 01:02:57PM -0500, Jordan von Kluck wrote:

> I am hoping someone on the community list can help steer me in the right
> direction for troubleshooting the following scenario:
> 
> I am running a cluster of 4 virtualized nginx open source 1.16.0 servers
> with 4 vCPU cores and 4 GB of RAM each. They serve HTTP (REST API) requests
> to a pool of about 40 different upstream clusters, which range from 2 to 8
> servers within each upstream definition. The upstream application servers
> themselves have multiple workers per server.
> 
> I've recently started seeing an issue where the reported response_time and
> typically the reported upstream_response_time the nginx access log are
> drastically different from the reported response on the application servers
> themselves. For example, on some requests the typical average response_time
> would be around 5ms with an upstream_response_time of 4ms. During these
> transient periods of high load (approximately 1200 -1400 rps), the reported
> nginx response_time and upstream_response_time spike up to somewhere around
> 1 second, while the application logs on the upstream servers are still
> reporting the same 4ms response time.
> 
> The upstream definitions are very simple and look like:
> upstream rest-api-xyz {
>     least_conn;
>     server 10.1.1.33:8080 max_fails=3 fail_timeout=30; #
> production-rest-api-xyz01
>     server 10.1.1.34:8080 max_fails=3 fail_timeout=30; #
> production-rest-api-xyz02
> }
> 
> One avenue that I've considered but does not seem to be the case from the
> instrumentation on the app servers is that they're accepting the requests
> and queueing them in a TCP socket locally. However, running a packet
> capture on both the nginx server and the app server actually shows the http
> request leaving nginx at the end of the time window. I have not looked at
> this down to the TCP handshake to see if the actual negotiation is taking
> an excessive amount of time. I can produce this queueing scenario
> artificially, but it does not appear to be what's happening in my
> production environment in the scenario described above.
> 
> Does anyone here have any experience sorting out something like this? The
> upstream_connect_time is not part of the log currently, but if that number
> was reporting high, I'm not entirely sure what would cause that. Similarly,
> if the upstream_connect_time does not account for most of the delay, is
> there anything else I should be looking at?

Spikes to 1 second suggests that this might be SYN retransmit 
timeouts.

Most likely, this is what happens: your backend cannot cope with 
load, so listen queue on the backend overflows.  Default behaviour 
on most Linux boxes is to drop SYN packets on listen queue 
overflows (tcp.ipv4.abort_on_overflow=0).  Dropped SYN packets 
eventually - after an initial RTO, initial retransmission timeout, 
which is 1s on modern Linux systems - result in retransmission and 
connection being finally established, but with 1s delay.

Consider looking at network stats to see if there are actual 
listen queue overflows on your backends, something like "nstat -az 
TcpExtListenDrops" should be handy.  You can also use "ss -nlt" to 
see listen queue sizes in real time.

In many cases such occasional queue overflows under load simply 
mean that listen queue size is too low, so minor load fluctuations 
might occasionally result in overflows.  In this case, using a 
larger listen queue might help.

Also, if the backend servers in question are solely the backend 
ones, and there are multiple load-balanced servers as your 
configuration suggests, it might be a good idea to configure these 
servers to send RST on listen queue overflows, that is, to set 
tcp.ipv4.abort_on_overflow to 1.  This way nginx will immediately 
know that the backend'd listen queue is full and will be able to 
try the next upstream server instead.

-- 
Maxim Dounin
http://mdounin.ru/