upstream - behavior on pool exhaustion

Fri Apr 14 08:21:47 UTC 2017

On Fri, Apr 14, 2017 at 09:41:36AM +0200, B.R. via nginx wrote:
> Hello,
> 
> Reading from upstream
> <https://nginx.org/en/docs/http/ngx_http_upstream_module.html#upstream>
> docs, on upstream pool exhaustion, every backend should be tried once, and
> then if all fail the response should be crafted based on the one from the
> last server attempt.
> So far so good.
> 
> I recently faced a server farm which implements a dull nightly restart of
> every node, not sequencing it, resulting in the possibility of having all
> nodes offline at the same time.
> 
> However, I collected log entries which did not match what I was expected.
> For 6 backend nodes, I got:
> - log format: $status $body_bytes_sent $request_time $upstream_addr
> $upstream_response_time
> - log entry: 502 568 0.001 <IP address 1>:<port>, <IP address 2>:<port>,
> <IP address 3>:<port>, <IP address 4>:<port>, <IP address 5>:<port>, <IP
> address 6>:<port>, php-fpm 0.000, 0.000, 0.000, 0.000, 0.001, 0.000, 0.000
> I got 7 entries for $upstream_addr & $upstream_response_time, instead of
> the expected 6.
> 
> Here are the interesting parts of the configuration:
> upstream php-fpm {
>     server <machine 1>:<port> down;
>     server <machine 2>:<port> down;
>     [...]
>     server <machine N-5>:<port>;
>     server <machine N-4>:<port>;
>     server <machine N-3>:<port>;
>     server <machine N-2>:<port>;
>     server <machine N-1>:<port>;
>     server <machine N>:<port>;
>     keepalive 128;
> }
> 
> server {
>     set $fpm_pool "php-fpm$fpm_pool_ID";
>     [...]
>         location ~ \.php$ {
>             [...]
>             fastcgi_read_timeout 600;
>             fastcgi_keep_conn on;
>             fastcgi_index index.php;
> 
>             include fastcgi_params;
>             fastcgi_param SCRIPT_FILENAME
> $document_root$fastcgi_script_name;
>             [...]
>             fastcgi_pass $fpm_pool;
>         }
> }
> 
> The question is:
> php-fpm being an upstream group name, how come has it been tried as a
> domain name in the end?
> Stated otherwise, is this because the upstream group is considered 'down',
> thus somehow removed from the possibilities, and nginx trying one last time
> the name as a domain name to see if something answers?
> This 7th request is definitely strange to my point of view. Is it a bug or
> a feature?

A feature.

Most $upstream_* variables are vectored ones, and the number of entries
in their values corresponds to the number of tries made to select a peer.
When a peer cannot be selected at all (as in your case), the status is
502 and the name equals the upstream group name.

There could be several reasons why none of the peers can be selected.
For example, some peers are marked "down", and other peers were failing
and are now in the "unavailable" state.

The number of tries is limited by the number of servers in the group,
unless futher restricted by proxy_next_upstream_tries.  In your case,
since there are two "down" servers, and other servers are unavailable,
you reach the situation when a peer cannot be selected.  If you comment
out the two "down" servers, and try a few requests in a row when all
servers are physically unavailable, the first log entry will list all
of the attempted servers, and then for the next 10 seconds (in the
default config) you'll see only the upstream group name and 502 in
$upstream_status, until the servers become available again (see
max_fails/fail_timeout).

Hope this makes things a little bit clearer.