<div dir="ltr">Hello -<div><br></div><div>I am hoping someone on the community list can help steer me in the right direction for troubleshooting the following scenario:</div><div><br></div><div>I am running a cluster of 4 virtualized nginx open source 1.16.0 servers with 4 vCPU cores and 4 GB of RAM each. They serve HTTP (REST API) requests to a pool of about 40 different upstream clusters, which range from 2 to 8 servers within each upstream definition. The upstream application servers themselves have multiple workers per server.</div><div><br></div><div>I've recently started seeing an issue where the reported response_time and typically the reported upstream_response_time the nginx access log are drastically different from the reported response on the application servers themselves. For example, on some requests the typical average response_time would be around 5ms with an upstream_response_time of 4ms. During these transient periods of high load (approximately 1200 -1400 rps), the reported nginx <font face="monospace">response_time</font> and <font face="monospace">upstream_response_time</font> spike up to somewhere around 1 second, while the application logs on the upstream servers are still reporting the same 4ms response time. </div><div><br></div><div>The upstream definitions are very simple and look like:<br><font face="monospace">upstream rest-api-xyz {<br>    least_conn;<br>    server <a href="http://10.1.1.33:8080">10.1.1.33:8080</a> max_fails=3 fail_timeout=30; # production-rest-api-xyz01<br>    server <a href="http://10.1.1.34:8080">10.1.1.34:8080</a> max_fails=3 fail_timeout=30; # production-rest-api-xyz02<br>}</font><br></div><div><br></div><div>One avenue that I've considered but does not seem to be the case from the instrumentation on the app servers is that they're accepting the requests and queueing them in a TCP socket locally. However, running a packet capture on both the nginx server and the app server actually shows the http request leaving nginx at the end of the time window. I have not looked at this down to the TCP handshake to see if the actual negotiation is taking an excessive amount of time. I can produce this queueing scenario artificially, but it does not appear to be what's happening in my production environment in the scenario described above.</div><div><br></div><div>Does anyone here have any experience sorting out something like this? The <span style="color:rgb(0,0,0);font-family:monospace;font-size:medium">upstream_connect_time </span>is not part of the log currently, but if that number was reporting high, I'm not entirely sure what would cause that. Similarly, if the <font face="monospace">upstream_connect_time</font> does not account for most of the delay, is there anything else I should be looking at?</div><div><br></div><div>Thanks</div><div>Jordan</div><div><br></div><div><br></div><div>  <br></div></div>