[PATCH 0 of 6] Upstream: re-resolvable servers.

Wed Feb 1 01:36:55 UTC 2023

The series is a compilation of patches with the upstream re-resolve
feature from the Nginx Plus.  The original commits were rebased on top
of the current OSS code, grouped by features introduced and squashed.
Some formatting quirks and other minor oddities could be attributed to a
conscious effort to reduce divergence with the source branch.

The last couple of patches in the series is a new code that allows
sharing name resolution tasks between all the workers.

Known issues and TODOs:
===

- The whole series is known to be broken on win32 with multiple worker
  processes, as it relies on the ngx_worker value to keep track of the
  locality of data.  Initializing ngx_worker to a correct value should
  address that.
  'noreuse' zones also seem to be unsupported on this platform, so
  configuration reload may fail.

- The functionality requires shared zone of a sufficient size to be
  configured in the upstream block. A rough estimation is 2k for a
  configured server entry + 2k for each resolved address.
  The zone requirement could be lifted with local allocation of the
  resolved peer data, but implementing that was out of scope.

- Resolved peer addresses are not carried over to a new generation of
  workers during configuration reload (see below).

- Tests still require some cleanup and will be published later.

Peer list population delay
===

In the cases of a cold start, a reload or a binary upgrade, the
upstreams that contain only resolvable servers will have an empty list
of peers.  This leads to a short delay before Nginx is able to send the
traffic to upstream.  There's no perfect solution for that: if the
server list in the configuration has changed, it's no longer compatible
with the data we collected for a previous config.  If the resolver
parameters were modified, we may get an entirely different set of
servers.

The following options were considered:

- Publishing the preresolve code from the Nginx Plus as is.
  The solution involves copying peer states from the non-reusable zone
  of a previous generation of workers.  This only addresses the reload
  case and may result in a stale peer data if the configuration
  changes.
  The advantage of this code is that it is heavily tested and has been
  running in multiple production environments for many years.

- Sharing the zone between all generations of workers.
  This requires some changes in the code, notably improving reference
  counting and cleanup for peer data in the shared zone (as we're no
  longer able to discard the old zone with all the allocated data) and
  tracking the upstream configuration compatibility.  It also doesn't
  work when the zone size has changed in the config.

  The approach leads to increased memory requirements: zone size should
  be configured to accomodate multiple generations of workers, and we
  are aware of deployments that have lots of those due to long-living
  connections.  Nginx OSS does not offer any means to monitor shared
  memory usage at the moment, so I fear this approach will hurt a lot of
  unsuspecting users who haven't reserved enough memory.

  There are also performance concerns, as access to the same list of
  peers from multiple generations of workers would increase lock
  contention (and the situation is already not looking well with
  round-robin lb).  We can copy the peers instead of attempting to
  reuse, but that prevents us from optimizing the memory usage.

- Queueing the requests until we finish the initial cycle of name
  resolution ('queue' directive of the ngx_http_upstream_module).
  This option adds a latency spike at the moment of configuration
  reload.  There's also an issue with propagation of the upstream
  readiness state to all the worker processes - we need an event
  passing channel to be able to resume queued requests immediately.
  On the positive side, this would mitigate downtime for all 3
  scenarios, as long as the queue capacity is sufficient.
  Given the latency spike, it doesn't seem to be a good standalone
  solution.  But it might be a nice addition to one of the options
  above.

Alternatives like pre-resolving servers during configuration load were
not considered due to complexity and significant disadvantages.

Maxim, from the list archives I understand that you had a negative
opinion on the current approach with noreuse zones and pre-resolve,
but I'm afraid there wasn't enough context to understand all the sides
of that discussion.  I'd appreciate if you share your thoughts on the
problem and on the approach you consider architecturally correct.