Achieving strong client -> upstrem_server affinity

Sat Sep 1 00:19:59 UTC 2012

Hello!

On Fri, Aug 31, 2012 at 07:58:52PM +0530, Srirang Doddihal wrote:

[...]

> > In addition to "proxy_next_upstream off" you should at least
> > specify max_fails=0 for upstream servers.  Else a server might
> > be considered down and request from a client will be re-hashed to
> > another one.
> 
> Got it. How about the following scenario :
> 
> Request - 1]  "client-1" is forwarded to "server-1".
> Request - 2]  "server-1" does not respond properly and hence is
> considered down. "client-1" gets an error message
> Request - 3]  "client-1" is now hashed to "server-2" and is forwarded
> to "server-2"
> Request - 4]  Now will "client-1" continue to be forwarded to
> "server-2" or will it be come back to "server-1"?
> 
> i.e Whether the re-hash is permanent or a temporary?

The only information kept is upstream server status.  As soon as 
it will be considered alive again (after fail_timeout) - client-1 
will be again forwarded to server-1.

> > Note that docs might be a bit misleading here as they say one
> > should refer to proxy_next_upstream setting to see what is
> > considered to be server failure.  This isn't exactly true: if
> > upstream server fails to return correct http answer (i.e. on
> > error, timeout, invalid_header in proxy_next_upstream terms) the
> > failure is always counted.
> 
> Understood till here.
> 
> > What can be considered to be failure
> > or not is valid http responses, i.e. http_500 and so on.
> >
> 
> This was confusing. Are you saying that any only HTTP 1xx, 2xx or 3xx
> responses from the upstream server will not count towards failure
> count and any 4xx or 5xx responses will be considered as failures?

No.  The proxy_next_upstream directive have the following valid 
values (http://nginx.org/r/proxy_next_upstream):

: error
: timeout
: invalid_header
: http_500
: http_502
: http_503
: http_504
: http_404
: off

By default valid http response regardless of it's status code is 
just a valid http response and it isn't counted as a failure.

But if you'll write "http_500" in proxy_next_upstream, then 
(perfectly valid and in many cases expected) HTTP response with 
status code 500 will be counted as a server failure.  (And nginx 
will thow away the response and will try ask another upstream 
server for a response.)

The same applies to http_502, http_503, http_504.

[...]

> > This is strange, and you may want to provide more info, see
> > http://wiki.nginx.org/Debugging.
> >
> 
> I will try to get a debug log. Currently I am using the Ubuntu
> package. I probably will have do a custom build for this.
> 
> > I would suggest this is likely some configuration error though
> > (requests are handled in another location, without
> > proxy_next_upstream set to off?).
> 
> All requests to the concerned upstream servers are sent from only one
> location and that location has proxy_next_upstream set to off.

Anything like "keepalive" ommitted in the upstream block snippet 
posted for clarity?

The only case in the code which allows request to be passed to 
multiple upstream servers with proxy_next_upstream set to off is a 
failed request in a cached connection (this case isn't quite 
correct either, but it's (a) known issue and (b) rather uncommon 
under normal conditions).

> I am setting up a test environment to isolate this issue. I will get
> back with more details a little later.
> Is there anything specific that you want to capture?

Nothing special, just nginx -V (and it's better to make sure there 
are no 3rd party modules/patches), a minimal full config to 
reproduce the problem, and a debug log.  Much like the link above 
suggests.

Once you'll set a test environment you may also want to check if 
the problem is still reproduceable with latest nginx.  As 
far as I understand Ubuntu currently ships nginx 1.1.19, which 
isn't as old as it can be keeping in mind Ubuntu is Debian 
derivative ;), but there were several important changes in  
upstream infrastructure since then.  While I don't recall anything 
which might affect "proxy_next_upstream off" case, it's always 
good idea to test latest version.

Maxim Dounin