[PATCH] Locality-based Least connection with optional randomization

Sun Feb 22 03:05:31 UTC 2015

Good weekend, everyone!
Let me start by describing my problem first and then moving to proposed solution.

Problem:
Currently we have number of PoPs (Points-of-Presence) around the world with Linux/nginx doing TCP/TLS/HTTP termination. There we re-encrypt traffic and proxy_pass it to the upstream block with HUGE set of servers. Whole idea of those PoP nginxes is to have pool of keepalive connections with enormous tcp windows to upstreams.
But in reality we can not use any of nginx’es connection balancing methods because they almost never reuse connections (yet again, our upstream list is huge). Also each worker has it’s own keepalive pool which makes situation even worse. Of cause we can generate per-server config files and give each server in each PoP different(and small) set of upstream servers, but that solution sounds awfully “clunky”.

Solution:
IPVS for example, among it's numerous job scheduling modes has Locality-Based Least-Connection Scheduling[1], that looks quite close to what we want. The only problem is that if all the worker processes on all our boxes around the world will use same list of upstreams they will quickly overload first upstream, then second, etc, therefore I’ve added randomized mode in which each worker starts by filling upstreams w.r.t. some random starting point. That should give good locality for tcp connection reuse and as law of large numbers implies - good enough load distribution across upstreams globally.

Implementation:
PoC:
	coloured: https://gist.github.com/SaveTheRbtz/d6a505555cd02cb6aee6
	raw: https://gist.githubusercontent.com/SaveTheRbtz/d6a505555cd02cb6aee6/raw/5aba3b0709777d2a6e99217bd3e06e2178846dc4/least_conn_locality_randomized.diff

It basically tries to find first(starting from per-worker-random for randomized variant) not fully loaded peer and if it fails then it falls back to normal least_conn.

Followup questions:
Does anyone in the community have similar use cases? CloudFlare maybe?
Is Nginx Inc interested in incorporating something patch like that, or is that too specific to our workflow? Should I prettify that PoC or should I just throw the ball your way?

Alternative solution:
Original upstream keepalive module[2] had “single” keyword, that also suites our needs, though it was removed because, let me quote Maxim Dounin:
	The original idea was to optimize edge cases in case of interchangeable
	backends, i.e. don't establish a new connection if we have any one
	cached.  This causes more harm than good though, as it screws up
	underlying balancer's idea about backends used and may result in
	various unexpected problems.

[1] http://kb.linuxvirtualserver.org/wiki/Locality-Based_Least-Connection_Scheduling
[2] http://mdounin.ru/hg/ngx_http_upstream_keepalive/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mailman.nginx.org/pipermail/nginx-devel/attachments/20150221/8fe6dd2b/attachment.bin>