improve the first selection of SWRR algorithm

Tue Nov 10 16:41:04 UTC 2020

Hello!

On Wed, Nov 04, 2020 at 12:58:51PM +0000, 陈洁 Cjhust Chen wrote:

> Hi:
>      We improve the Smooth Weighted Round-Robin（SWRR） 
>      algorithm to successfully resolve the problem in the 
>      following situations.
> 
> Situation 1：
> upstream backend-server {
>      server 1.1.1.1:8000  weight=100;
>      server 2.2.2.2:8000  weight=101;
>      server 3.3.3.3:8000  weight=100;
> }
> 
> 1. When each machine in the cluster mode executes "-s reload" at 
> the same time , the first selection of each machine is the 
> machine 2.2.2.2:8000 having higher weight , which will lead to 
> 300%+ increase of 2.2.2.2:8000 traffic.
> 2. More and more companies are implementing service discovery 
> based on nginx. Adding or removing machine will also lead to 
> 300%+ increase of 2.2.2.2:8000 traffic.
> 
> 
> 
> Situation 2：
> upstream backend-server {
>      server 1.1.1.1:8000  weight=100;
>      server 2.2.2.2:8000  weight=100;
>      server 3.3.3.3:8000  weight=100;
> }
> 
> 1. When each machine in the cluster mode executes "-s reload" at 
> the same time , the first selection of each machine is the first 
> machine 1.1.1.1:8000, which will lead to 300%+ increase of 
> 1.1.1.1:8000 traffic.
> 2. More and more companies are implementing service discovery 
> based on nginx. Adding or removing machine will also lead to 
> 300%+ increase of 1.1.1.1:8000 traffic.
> 
> 
> 
> 
> 
> # HG changeset patch
> # User Jie Chen <cherrychenjie at didiglobal.com<mailto:cherrychenjie at didiglobal.com>>
> # Date 1599813602 -28800
> #      Fri Sep 11 16:40:02 2020 +0800
> # Node ID 931b0c055626657d68f886781c193ffb09245a2e
> # Parent  da5e3f5b16733167142b599b6af3ce9469a07d52
> improve the first selection of SWRR algorithm
> 
> diff -r da5e3f5b1673 -r 931b0c055626 src/http/ngx_http_upstream_round_robin.c
> --- a/src/http/ngx_http_upstream_round_robin.c  Wed Sep 02 23:13:36 2020 +0300
> +++ b/src/http/ngx_http_upstream_round_robin.c  Fri Sep 11 16:40:02 2020 +0800
> @@ -91,7 +91,7 @@
>                  peer[n].name = server[i].addrs[j].name;
>                  peer[n].weight = server[i].weight;
>                  peer[n].effective_weight = server[i].weight;
> -                peer[n].current_weight = 0;
> +                peer[n].current_weight = 0 - ngx_random() % peers->total_weight;
>                  peer[n].max_conns = server[i].max_conns;
>                  peer[n].max_fails = server[i].max_fails;
>                  peer[n].fail_timeout = server[i].fail_timeout;
> @@ -155,7 +155,7 @@
>                  peer[n].name = server[i].addrs[j].name;
>                  peer[n].weight = server[i].weight;
>                  peer[n].effective_weight = server[i].weight;
> -                peer[n].current_weight = 0;
> +                peer[n].current_weight = 0 - ngx_random() % peers->total_weight;
>                  peer[n].max_conns = server[i].max_conns;
>                  peer[n].max_fails = server[i].max_fails;
>                  peer[n].fail_timeout = server[i].fail_timeout;
> 
> 

Thank you for your patch.
In no particular order:

- Traffic on a particular server is not expected to be noticeably 
  increased after nginx restart / configuration reload unless 
  there are very few requests.

- Further, given that a reload happens at some random time, adding 
  another random is not going to help.  That is, the patch seems 
  to only improve things if nginx is reloaded after a small non-random 
  amount of requests.

- Using "peers->total_weight" for backup peers is wrong.

- Using the same current_weight for all worker processes is 
  essentially the same problem as the one you are trying to solve.

- The patch breaks the "sum of all current weights is 0" 
  invariant.  This is not fatal, yet complicates things for no 
  obvious reasons.

- In general, it might be a better idea to use the random balancer 
  if you are indeed facing the problems described 
  (http://nginx.org/r/random).

-- 
Maxim Dounin
http://mdounin.ru/