[BUG?]fail_timeout/max_fails: code doesn't do what doc says

Dmitry Popov dp at highloadlab.com
Sun May 19 21:34:26 UTC 2013


Hi.

http://wiki.nginx.org/HttpUpstreamModule says
max_fails = NUMBER - number of unsuccessful attempts at communicating with the 
server within the time period (assigned by parameter fail_timeout) after which 
it is considered inoperative ...
fail_timeout = TIME - the time during which must occur *max_fails* number of 
unsuccessful attempts at communication with the server that would cause the 
server to be considered inoperative ...

However, as we may see from code (ngx_http_upstream_get_peer and 
ngx_http_upstream_free_round_robin_peer
from src/http/ngx_http_upstream_round_robin.c) the logic is not as described:
(simplified code)
get_peer:
  if (fails >= max_fails && now <= fail_timeout + checked)
    skip
  ...
  checked = now
free_peer:
  if (request_failed)
    fails++
    accessed = now
    checked = now
  else
    if (accessed < checked)
      fails = 0

1) So, fail_timeout is never used while peer is gaining fails (until 
fails >= max_fails);
2) This algorithm always resets fails count if first request inside new second
succeeds; it always increases fails count if first request fails. So, a lot
depends on first (inside a second) request; I don't think it's a desired 
behaviour.
3) I'm not sure if "accessed" is a good name for a field that contains last 
fail timestamp.

I don't know where an error is (in doc or code) and I don't know how you (nginx 
devs) wanted it to work so I don't have any constructive ideas, sorry.

--
Dmitry Popov
Highloadlab



More information about the nginx-devel mailing list