[PATCH] Added keepalive_async_fails command

Mon Apr 3 05:15:02 UTC 2023

Hello Maxim,

Thank you for the feedback.

I think the points you made are fair - a new directive is possibly 
overkill for
this issue.

A single peer going through all of it's (many) cached connections
when there is is a non-asynchronous close connection error is where I've
personally seen the current behavior be most problematic - so this make 
sense
and your direction would still resolve this.

Point 3 would increase overhead/add latency for the average case, as a new
connection would need to be created (assuming additional cached connections
are likely to exist, and to not fail). I'd prefer to avoid that unless 
there are strong
objections, although implementing this would be fairly trivial if there are.

However, for point 2 -  I do have one question that I'd like your 
opinion on
regarding multi-peer upstreams - should a (suspected) asynchronous close
event count as a failure in terms of the logic in upstream_round_robin's 
free?

In lieu of double trying for multi-peer, it seems like it may be 
desirable to avoid
counting these as 'real' failures given all the effects imparted through 
passive
health checks  - such as triggering a failure increment (and/or timeout) 
as well as
adjusting weights downwards if the weighs are set for that upstream peer.

On the other hand, the ambiguity in the cause of the error means not 
counting
failures at all for connection errors that involve a cached connection.

On 03/04/2023 02:42, Maxim Dounin wrote:
> Hello!
>
> On Sun, Apr 02, 2023 at 06:57:03PM +0100, J Carter wrote:
>
>> I've also attached an example nginx.conf and test script that
>> simulates the asynchronous close events.
>> Two different test cases can be found within that, one with path
>> /1 for single peer upstream and /2 for multi-peer.
>>
>> You should see 2 upstream addresses repeated in a row
>> per-upstream-server in the access log by default, as it fails
>> through the cached connections & next performs next upstream
>> tries.
>>
>> Any feedback would be appreciated.
>>
>> # HG changeset patch
>> # User jordanc.carter at outlook.com
>> # Date 1680457073 -3600
>> #      Sun Apr 02 18:37:53 2023 +0100
>> # Node ID 9ec4d7a8cdf6cdab00d09dff75fa6045f6f5533f
>> # Parent  5f1d05a21287ba0290dd3a17ad501595b442a194
>> Added keepalive_async_fails command to keepalive load balancer module.
>> This value determines the number suspected keepalive race events
>> per-upstream-try that will be tolerated before a subsequent network connection
>> error is considered a true failure.
> It looks like you are trying to address issues with re-trying
> requests to upstream servers when errors not distinguishable from
> asynchronous close events happens, as outlined in ticket #2421
> (https://trac.nginx.org/nginx/ticket/2421).
>
> I don't think that introducing another directive for this is a
> good idea.  Rather, I would consider modifying the behaviour to
> better behave in case of such errors.  In particular, the
> following approaches look promising:
>
> - Allow at most one additional try.
>
> - Allow an additional try only if there is a single peer, so
>    normally request is not going to be retried at all.
>
> - Don't use cached connections if an error considered to be an
>    asynchronous close event happens.
>
> Given that since nginx 1.15.3 we have keepalive_timeout in the
> upstream blocks to mitigate potential asynchronous close events,
> even something simple like combination of (1) and (2) might be
> good enough.  With (3), things are going to be as correct as it
> can be.
>
> [...]
>