feature request: warn when domain name resolves to several addresses

Thu Nov 21 20:53:17 UTC 2019

On Wed, Nov 20, 2019 at 12:28 PM Maxim Dounin <mdounin at mdounin.ru> wrote:
>
> Hello!
>
> On Tue, Nov 19, 2019 at 07:26:35PM -0700, Roger Pack wrote:
>
> > On Tue, Nov 19, 2019 at 12:01 PM Maxim Dounin <mdounin at mdounin.ru> wrote:
> >
> > > On Tue, Nov 19, 2019 at 10:47:01AM -0700, Roger Pack wrote:
> > >
> > > > I noticed that in ngx_http_proxy_module
> > > >
> > > > proxy_pass http://localhost:8000/uri/;
> > > > "If a domain name resolves to several addresses, all of them will be
> > > > used in a round-robin fashion. In addition, an address can be
> > > > specified as a server group."
> > > >
> > > > However this can be confusing for end users who innocently put the
> > > > domain name "localhost" then find that round-robin across ipv6 and
> > > > ipv4 is occurring, ref:
> > > > https://stackoverflow.com/a/58924751/32453
> > >
> > > This seems to be your own answer, and it looks incorrect to me.
> > > In particular, the 499 error is logged when the client closes
> > > connection, and there is no need to have more than one backend
> > > server specified to see 499 errors.
> >
> > True, those cases were covered in some other answers to that question,
> > but I'll add a note. :)
> > It can also be logged when the backend server times out, at least
> > empirically that seems to be the case...
> > see also https://serverfault.com/questions/523340/post-request-is-repeated-with-nginx-loadbalanced-server-status-499/783624#783624
>
> It is logged when the client closes the connection, only.  But
> reasons why the client closes the connect might be different.
>
> In particular, when the backend server times out, it means that
> processing the request takes a long time.  And if processing
> takes time, it is likely that the client will give up waiting and
> will close the connection, resulting in 499.

OK you're right, thank you for the hint, turns out our client had a
60s timeout, so basically we'd see "connection timed out" error log
and "499 response" in quick succession and thought it was related.

Thank you that helped me figure out what was going on with my system!

> > > > https://stackoverflow.com/a/52550758/32453
> > >
> > > Changing "localhost" to "127.0.0.1" here "works" because having just
> > > one address triggers slightly different logic in the upstream
> > > code: with just one address, max_fails / fail_timeout logic is
> > > disabled, and nginx always uses the (only) address available, even
> > > if there are errors.
> >
> > Right.  The confusion in my mind is that people configuring Nginx will
> > use one backend "localhost", and assume they have set it up for a
> > "single server" type server group.
> > Since they have listed only one host.  But it has not...
> > See for instance https://stackoverflow.com/a/52550758
> >
> > > The underlying problem is still the same though: backends cannot
> > > cope with the load, and there are errors.
> >
> > Right.  However with the "single server" scenario this behavior is
> > handled differently (it doesn't exhaust the server group of available
> > servers and begin to return with 502's exclusively for a time, as it
> > did in my instance...).
> >
> > Basically if, while setting it up, you happen to forward to 127.0.0.1,
> > it will work fine, no "periods of 502's" (though you may get some
> > 504's).
> >
> > But if you forward it to "localhost" you may be surprised one day to
> > discover that you are getting "periods of 502's" if any connections
> > timeout (> 60s) for any reason.  Since only 2 of those and your entire
> > server group has been exhausted.
>
> I don't think people know and/or expect the difference in handling
> between single address and multiple addresses, regardless of
> whether they know there are multiple addresses, or not.  As such,
> a configuration-time warning won't help.
>
> Rather, we can consider explaining the difference.  Alternatively,
> we can make it go away - either by changing the single-address case
> to be identical to the multiple-addresses one, or vice versa.  Or even
> by making this configurable.
> (Actually, previously multiple-addresses case was handled
> differently, closer to the single-address approach, and resulted
> in just one 502, with "quick recovery" of all servers on the first
> request.  But some time ago this was changed to follow
> fail_timeout instead, as quick recovery of all servers seems to
> cause more harm than good in most configurations.)

Yeah, it might make sense to make the behavior similar.  Maybe never
disable the "last server marked as available" (of a server group) or
to enforce the 10s fail_timeout for single server (if it was useful
for multiple...then again maybe single is supposed to be a simpler
configuration?).

Or maybe add a warning to the documentation near where it says "If a
domain name resolves to several addresses, all of them will be used in
a round-robin fashion."

If you specify a hostname like 'localhost' and your system supports
both IPv4 and IPv6, the hostname can be interpreted to mean two
different servers.  Specify an exact IP address if you wish to avoid
this ambiguity, like '127.0.0.1' (or something like that).

Also the documentation for max_fails, fail_timeout and slow_start
maybe could add a note in them that they are ignored in the case of
single server.

> > > (And no, it's not a DNS failure - DNS is only used when nginx
> > > resolves the name in the proxy_pass directive while parsing
> > > configuration on startup.)
> > >
> > > > Suggestion/feature request: If a domain name resolves to several
> > > > addresses, log a warning in error.log file somehow, or at least in the
> > > > output of -T, to warn  somehow.  Then there won't be unexpected
> > > > round-robins occurring and "supposedly single" servers being
> > > > considered unavailable due to timeouts, surprising people like myself.
> > >
> > > Multiple addresses are fairy normal, and I don't think that
> > > logging a warning is a good idea.
> >
> > I'm just saying...it might help somebody like me out, in the future.
> > There be dragons...or maybe the default error log could be configured
> > to make it more obvious to people what is going on?
> > (https://stackoverflow.com/a/52550758)
>
> From the error log things are expected to be pretty obvious -
> nginx logs the original errors, and it also logs when it cannot
> pick an upstream server to use ("no live upstreams", which means
> "all upstream servers are disabled due to errors").  Further, it
> also logs when it disables a server, though it happens on the
> "warn" level.

Might be nice to log that at the error level, or possibly add it to
the " upstream timed out " log error message like " upstream timed
out, marking server as unavailable" or something like that (if easier
:).

A few more thoughts/ideas for that error message.
maybe could enhance it a bit, ex "upstream timed out after x seconds"
and "trying next server" (or "giving up") depending on what it does
next.  Just for quicker understanding of what decisions are being made
(and which configs being respected).

> The main problem is that people hardly look into error logs at
> all.  For example, the answer you are referring to only provides
> access log information, and this is what makes it confusing.  On
> the other hand, another answer to the same question is based on
> the "no live upstreams" error message from the question, and
> correctly refers to the max_fails/fail_timeout parameters.

I looked at the error logs when problems started happening (502's), so
the error logs are useful! :)

My answer references the error log (or at least does now, with some
recent changes):

https://stackoverflow.com/a/58924751/32453
Some others don't :)

Thanks for your thoughtful replies.
Cheers!
-Roger-