[PATCH 5 of 6] Upstream: allow any worker to resolve upstream servers

Fri Feb 10 00:04:59 UTC 2023

On 09/02/2023 16:45, Aleksei Bavshin wrote:
> On 2/5/2023 7:01 PM, J Carter wrote:
>> Hi Aleksei,
>>
>> Why not permanently assign the task of resolving a given upstream 
>> server group (all servers/peers within it) to a single worker?
>>
>> It seems that this approach would resolve the SRV issues, and remove 
>> the need for the shared queue of tasks.
>>
>> The load would still be spread evenly for the most realistic 
>> scenarios - which is where there are many upstream server groups of 
>> few servers, as opposed to few upstream server groups of many servers.
>
> The intent of the change was exactly opposite, to avoid any permanent 
> assignment of periodic tasks to a worker and allow another processes 
> to resume resolving if the original assignee exits, no matter if 
> normally or abnormally. I'm not even doing enough for that -- I 
> should've kept in-progress tasks at the end of the queue with expires 
> = resolver timeout + a small constant, and retry from another process 
> when the timeout is reached, but the idea was abandoned for a 
> minuscule improvement of insertion time. I expect to be asked to 
> reconsider, as patch 6/6 does not cover all the possible situations 
> where we want to recover a stale task.

Makes sense.

> A permanent assignment of a whole upstream would also require 
> notifying another processes that the upstream is no longer assigned if 
> the worker exits or consistently recovering that assignment over a 
> restart of single worker (e.g. after a crash - not a regular 
> situation, but one we should take into account nonetheless).

It's a good point, in my mind I had rendezvous hashing + a notification 
sent to all workers when a fellow worker dies - the next worker in the 
rendezvous 'list' would simply assume the dead worker's upstreams while 
the new one inits, and share it back once the replacement worker is up 
(would still use some locks).

Or to keep it simple, just wait for the dead worker's replacement to 
reinit, and pick up the former's stale upstreams.

> And the benefit is not quite obvious - I mentioned that resolving SRVs 
> with a lot of records may take longer to update the list of peers, but 
> the situation with contention is not expected to change significantly* 
> if we pin these tasks to a single worker as another worker may be 
> doing the same for another upstream. Most importantly, this isn't even 
> a bottleneck. It only slightly exacerbates an existing problem with 
> certain balancers that already suffer from the overuse of locks, in a 
> configuration that was specifically crafted to amplify and highlight 
> the difference and is far from these most realistic scenarios.
> * Pending verification on a performance test stand.

Well the benefit is that it would prevent the disadvantage you listed, 
and remove at least one other contended lock throughout normal 
operations (the priority queue). But fair enough, yes it makes sense to 
profile it in a wide range of scenarios to see if it's any of those are 
legitimate worries first.

> _______________________________________________
> nginx-devel mailing list
> nginx-devel at nginx.org
> https://mailman.nginx.org/mailman/listinfo/nginx-devel