upstream - behavior on pool exhaustion

Wed Apr 19 08:51:39 UTC 2017

On Tue, Apr 18, 2017 at 02:39:27PM +0200, B.R. via nginx wrote:
> 'down' should not translate into any kind of attempt, so nothing should
> really appear for the servers in that static state.
> For 'unavailable' servers, for the most part the content of the variables
> should be the same.
> 
> Starting from the example I provided, here is what I expected to see:
> - $upstream_addr: <IP address 1>:<port>, <IP address 2>:<port>, <IP address
> 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP address
> 6>:<port>
> - $upstream_response_time: 0.000, 0.000, 0.000, 0.000, 0.001, 0.000
> 
> That, associated with the 502 response from the HTTP language, is
> sufficient to interpret the log entry as: the request failed to find a
> proper backend after having attempted communication with the 6 specified
> active backends. It is pretty straightforward.

And what about the next request when all of servers are either "down"
or "unavailable"?

I'm just trying to give you a simple example of when the value of the
$upstream_addr variable has a special value equal to the upstream
group name, and why it can be useful.

> If you want to add something to explicitely states the whole upstream group
> is down, this should go to the error log.
> At the very least, if the current way of working is kept, the grammar of
> the content of the $upstream_* variables should be specified.
> 
> Does not that seem reasonable?
> ---
> *B. R.*
> 
> On Mon, Apr 17, 2017 at 6:09 PM, Ruslan Ermilov <ru at nginx.com> wrote:
> 
> > On Sat, Apr 15, 2017 at 03:55:20AM +0200, B.R. via nginx wrote:
> > > Let me be clear here:
> > > I got 6 active servers (not marked down), and the logs show 1 attempt on
> > > each. They all failed for a known reason, and there is no problem there.
> > > Subsequently, the whole pool was 'down' and the response was 502.
> > > Everything perfectly normal so far.
> > >
> > > What is unclear is the feature (as you classified it) of having a fake
> > > node named after the pool appearing in the list of tried upstream
> > servers.
> > > It brings confusion more than anything else: having a 502 response + the
> > > list of all tried (and failed) nodes corresponding with the list of
> > active
> > > nodes is more than enough to describe what happened.
> > > The name of the upstream group does not corresponding to any real asset,
> > it
> > > is purely virtual classification. It thus makes no sense at all to me to
> > > have it appearing as a 7th 'node' in the list... and how do you interpret
> > > its response time (where you got also a 7th item in the list)?
> > > Moreover, it is confusing, since proxy_pass handles domain names and one
> > > could believe nginx treated the upstream group name as such.
> >
> > Without the six attempts, if all of the servers are unreachable (either
> > "down" or "unavailable" because they have failed previously) at the time
> > the request starts, what do you expect to see in $upstream_*?
> >
> > > On Fri, Apr 14, 2017 at 10:21 AM, Ruslan Ermilov <ru at nginx.com> wrote:
> > >
> > > > On Fri, Apr 14, 2017 at 09:41:36AM +0200, B.R. via nginx wrote:
> > > > > Hello,
> > > > >
> > > > > Reading from upstream
> > > > > <https://nginx.org/en/docs/http/ngx_http_upstream_module.htm
> > l#upstream>
> > > > > docs, on upstream pool exhaustion, every backend should be tried
> > once,
> > > > and
> > > > > then if all fail the response should be crafted based on the one
> > from the
> > > > > last server attempt.
> > > > > So far so good.
> > > > >
> > > > > I recently faced a server farm which implements a dull nightly
> > restart of
> > > > > every node, not sequencing it, resulting in the possibility of
> > having all
> > > > > nodes offline at the same time.
> > > > >
> > > > > However, I collected log entries which did not match what I was
> > expected.
> > > > > For 6 backend nodes, I got:
> > > > > - log format: $status $body_bytes_sent $request_time $upstream_addr
> > > > > $upstream_response_time
> > > > > - log entry: 502 568 0.001 <IP address 1>:<port>, <IP address
> > 2>:<port>,
> > > > > <IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>,
> > <IP
> > > > > address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000,
> > > > 0.000
> > > > > I got 7 entries for $upstream_addr & $upstream_response_time,
> > instead of
> > > > > the expected 6.
> > > > >
> > > > > Here are the interesting parts of the configuration:
> > > > > upstream php-fpm {
> > > > >     server <machine 1>:<port> down;
> > > > >     server <machine 2>:<port> down;
> > > > >     [...]
> > > > >     server <machine N-5>:<port>;
> > > > >     server <machine N-4>:<port>;
> > > > >     server <machine N-3>:<port>;
> > > > >     server <machine N-2>:<port>;
> > > > >     server <machine N-1>:<port>;
> > > > >     server <machine N>:<port>;
> > > > >     keepalive 128;
> > > > > }
> > > > >
> > > > > server {
> > > > >     set $fpm_pool "php-fpm$fpm_pool_ID";
> > > > >     [...]
> > > > >         location ~ \.php$ {
> > > > >             [...]
> > > > >             fastcgi_read_timeout 600;
> > > > >             fastcgi_keep_conn on;
> > > > >             fastcgi_index index.php;
> > > > >
> > > > >             include fastcgi_params;
> > > > >             fastcgi_param SCRIPT_FILENAME
> > > > > $document_root$fastcgi_script_name;
> > > > >             [...]
> > > > >             fastcgi_pass $fpm_pool;
> > > > >         }
> > > > > }
> > > > >
> > > > > The question is:
> > > > > php-fpm being an upstream group name, how come has it been tried as a
> > > > > domain name in the end?
> > > > > Stated otherwise, is this because the upstream group is considered
> > > > 'down',
> > > > > thus somehow removed from the possibilities, and nginx trying one
> > last
> > > > time
> > > > > the name as a domain name to see if something answers?
> > > > > This 7th request is definitely strange to my point of view. Is it a
> > bug
> > > > or
> > > > > a feature?
> > > >
> > > > A feature.
> > > >
> > > > Most $upstream_* variables are vectored ones, and the number of entries
> > > > in their values corresponds to the number of tries made to select a
> > peer.
> > > > When a peer cannot be selected at all (as in your case), the status is
> > > > 502 and the name equals the upstream group name.
> > > >
> > > > There could be several reasons why none of the peers can be selected.
> > > > For example, some peers are marked "down", and other peers were failing
> > > > and are now in the "unavailable" state.
> > > >
> > > > The number of tries is limited by the number of servers in the group,
> > > > unless futher restricted by proxy_next_upstream_tries.  In your case,
> > > > since there are two "down" servers, and other servers are unavailable,
> > > > you reach the situation when a peer cannot be selected.  If you comment
> > > > out the two "down" servers, and try a few requests in a row when all
> > > > servers are physically unavailable, the first log entry will list all
> > > > of the attempted servers, and then for the next 10 seconds (in the
> > > > default config) you'll see only the upstream group name and 502 in
> > > > $upstream_status, until the servers become available again (see
> > > > max_fails/fail_timeout).
> > > >
> > > > Hope this makes things a little bit clearer.
> > > > _______________________________________________
> > > > nginx mailing list
> > > > nginx at nginx.org
> > > > http://mailman.nginx.org/mailman/listinfo/nginx
> >
> >
> > --
> > Ruslan Ermilov
> > Assume stupidity not malice
> > _______________________________________________
> > nginx mailing list
> > nginx at nginx.org
> > http://mailman.nginx.org/mailman/listinfo/nginx
> >

-- 
Ruslan Ermilov
Assume stupidity not malice