[PATCH] Introduced worker_shutdown_idle_delay

Wed Nov 8 04:58:29 UTC 2023

Hi,

Sending a patch in previous email, I fully understand that similar
proposals have already been discussed and rejected (those were
implemented in a different approach though):
https://trac.nginx.org/nginx/ticket/1022
https://mailman.nginx.org/pipermail/nginx-devel/2019-January/011804.html
https://mailman.nginx.org/pipermail/nginx-devel/2021-December/014644.html

Thus I understand that my patch might be rejected for similar reasons,
so I'll try to be specific and clarify what problems I'm trying to
solve.

Problems:

While it's perfectly fine to close idle HTTP/1.1 connections at any
time according to RFC 9112, it can still cause unnecessary and
potentially avoidable problems during configuration reloads, such as:

- Retries for requests that are problematic for automatic retries
(non-idempotent requests). Unfortunately, it's not always possible to
make all requests idempotent and safe for automatic retries. As a
result, some of the closed idle connections will cause errors in client
applications that can only be resolved manually. This, in turn, leads
to a poor user experience.

- Unnecessary retries for all other types of requests. While the client
should be prepared for this, it will still result in additional latency
for retried requests that could be avoided (the more nginx reloads, the
more unnecessary retries).

- According to rfc9110 [1] it's not allowed to retry non-idempotent
requests by proxy. So it's not possible to retry them by some
intermediate proxy before nginx, and all those errors should be
returned to the client - this will lead to an additional latency even
if the client supports automatic retries for non-idempotent requests.

- According to rfc9110 [1] a client should not automatically retry a
failed automatic retry. There is a high probability that during nginx
reloads, a client will retry some requests via closing connections
because they all close simultaneously. This, again, will lead to a
portion of errors that can only be resolved manually.

The main idea: while clients should be prepared for unexpected errors
(such as server crashes, lack of free connections, etc.), there is no
point in deliberately creating such situations (especially in
environments with frequent nginx reloads). With properly configured
clients:

client_idle_timeout < min(server_shutdown_delay, server_idle_timeout)

this patch will allow to minimize the number of unnecessary retries and
errors for environments where it's critical.

Specifics of the environment:

- Nginx hosts as frontend L7 balancers for all incoming external L7
traffic in the datacenter, and in some cases for east-west traffic
between microservices.

- Almost all of http connections are keep-alive HTTP/1.1 with 300s
idle_client_timeout.

- Frequent nginx configuration reloads.

- Large amount of requests with non-idempotent methods.

- Large amount of long-lived "in-progress" connections such as
websockets (so in this situation there is no point to immediately
shutdown idle HTTP/1.1 connections, because old nginx workers will
continue to work anyway).

- High diversity of client applications (different mobile platforms,
different desktop platforms, different client libraries, different dev
teams for those client applications).

- We're currently using nginx-plus, so we can't just patch it, that's
why it's important for us to propagate this or a similar change in the
official release (we've also opened a ticket through the nginx-plus
support channel, in case it matters).

Additional comments:

This patch doesn't change the default behavior and tries to be as
unobtrusive as possible. With the option enabled, the configured
shutdown delay will only be applied to idle keepalive HTTP/1.1
connections - all other protocols will continue to work as usual (for
example, there is no point in delaying shutdown for HTTP/2 or HTTP/3
protocols - they have built-in graceful shutdown mechanisms). However,
if needed, the same behavior could easily be added for other protocols
in the future.

If you don't mind adding the proposed behavior, but don't like the way
it was done in the patch, then please advise how it could be reworked.

I also understand that additional tests should be added to the
nginx-tests repo. If you accept the patch, I will prepare the tests a
bit later.

Also, let me know if you need more details about use cases where this
patch might be useful.

[1] https://www.rfc-editor.org/rfc/rfc9110#section-9.2.2-7

--
Artem Pogartsev