UDP Load balancer does not scale

Tue May 16 07:10:47 UTC 2017

Hi

I am trying to set up a UDP load balancer using Nginx. Initially, I
configured 4 usptream servers with two server processes running on each of
them.
It gave a throughput of around 24000 query per second when tested with
dnsperf. When I try to add two more upstreams servers, the throughput is not
increasing as expected. In fact, it deteriorates to the range of 5000 query
per second with the following error:

[warn] 5943#0: *10433175 upstream server temporarily disabled while proxying
connection, udp client: xxx.xxx.xxx.29, server: 0.0.0.0:53, upstream:
"xxx.xxx.xxx.224:53", bytes from/to client:80/0, bytes from/to
upstream:0/80
[error] 5943#0: *10085077 no live upstreams while connecting to upstream,
udp client: xxx.xxx.xxx.224, server: 0.0.0.0:53, upstream: "dns_upstreams",
bytes from/to client:80/0, bytes from/to upstream:0/0

I understood that the above error appears when Nginx doesn't receive
responses from upstream on time, and it is marked as unavailable
temporarily. I used to get this error before even with 4 upstream servers,
but after adding the following additional configuration, it had got
resolved:

user nginx;
worker_processes 4;
worker_rlimit_nofile 65535;

load_module "/usr/lib64/nginx/modules/ngx_stream_module.so";

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
    worker_connections  10240;
}

stream {
    upstream dns_upstreams {
              server xxx.xxx.xxx.0:53 max_fails=2000 fail_timeout=30s;
              server xxx.xxx.xxx.0:6363 max_fails=2000 fail_timeout=0s;
              server xxx.xxx.xxx.187:53 max_fails=2000 fail_timeout=30s;
              server xxx.xxx.xxx.187:6363 max_fails=2000 fail_timeout=30s;
              server xxx.xxx.xxx.183:53 max_fails=2000 fail_timeout=30s;
              server xxx.xxx.xxx.183:6363 max_fails=2000 fail_timeout=30s;
              server xxx.xxx.xxx.212:53 max_fails=2000 fail_timeout=30s;
              server xxx.xxx.xxx.212:6363 max_fails=2000 fail_timeout=30s;  

    }

    server {
        listen 53 udp;
        proxy_pass dns_upstreams;
        proxy_timeout 1s;
        proxy_responses 1;
    }
}

Even though this configuration works fine with 4 upstream servers, it
doesn't help when I increase the number of servers.

The Nginx server has enough memory and CPU capacity remaining when running
with 4 upstream servers as well as 6 upstream servers. And the dnsperf
client is not a bottleneck here because it can send much more load in a
different setup. Also, the individual upstream server can serve a bit more
than 5000 request per second.

I am trying to get some hints about why I am observing more upstream
failures and eventual unavailability when I add more servers. If anybody has
faced a similar issue in the past and can give me some pointers to solve it,
that would of great help.

Thanks,
Ajmal

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,274257,274257#msg-274257