[PATCH 5 of 6] Upstream: allow any worker to resolve upstream servers

Thu Feb 9 16:45:11 UTC 2023

On 2/5/2023 7:01 PM, J Carter wrote:
> Hi Aleksei,
> 
> Why not permanently assign the task of resolving a given upstream server 
> group (all servers/peers within it) to a single worker?
> 
> It seems that this approach would resolve the SRV issues, and remove the 
> need for the shared queue of tasks.
> 
> The load would still be spread evenly for the most realistic scenarios - 
> which is where there are many upstream server groups of few servers, as 
> opposed to few upstream server groups of many servers.

The intent of the change was exactly opposite, to avoid any permanent 
assignment of periodic tasks to a worker and allow another processes to 
resume resolving if the original assignee exits, no matter if normally 
or abnormally. I'm not even doing enough for that -- I should've kept 
in-progress tasks at the end of the queue with expires = resolver 
timeout + a small constant, and retry from another process when the 
timeout is reached, but the idea was abandoned for a minuscule 
improvement of insertion time. I expect to be asked to reconsider, as 
patch 6/6 does not cover all the possible situations where we want to 
recover a stale task.

A permanent assignment of a whole upstream would also require notifying 
another processes that the upstream is no longer assigned if the worker 
exits or consistently recovering that assignment over a restart of 
single worker (e.g. after a crash - not a regular situation, but one we 
should take into account nonetheless). And the benefit is not quite 
obvious - I mentioned that resolving SRVs with a lot of records may take 
longer to update the list of peers, but the situation with contention is 
not expected to change significantly* if we pin these tasks to a single 
worker as another worker may be doing the same for another upstream.
Most importantly, this isn't even a bottleneck. It only slightly 
exacerbates an existing problem with certain balancers that already 
suffer from the overuse of locks, in a configuration that was 
specifically crafted to amplify and highlight the difference and is far 
from these most realistic scenarios.

* Pending verification on a performance test stand.