DNS Load Balancing keeps getting upstream errors

Thu Aug 31 01:04:30 UTC 2017

Hello!

I was excited to learn that nginx is one of the few load balnacing software
supporting DNS. In my EC2 setup, I have nginx running on an m4.large
instance, my DNS test load comes from a t2.micro one. I have two nameservers
to be load balanced, each running on t2.medium.

Here is my config:
$ cat /etc/nginx/nginx.conf
# For more information on configuration, see:
#   * Official English Documentation: http://nginx.org/en/docs/
#   * Official Russian Documentation: http://nginx.org/ru/docs/

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
worker_rlimit_nofile 65536;

# Load dynamic modules. See /usr/share/nginx/README.dynamic.
include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 4096;
}

http {
        server {
                listen 80 default_server;
                location / {
                        stub_status on;
                        access_log   off;
                }
        }
}

stream {
        upstream dns_servers {
                server 10.67.32.10:53 max_fails=2000 fail_timeout=30;
                server 10.67.16.10:53 max_fails=2000 fail_timeout=30;
        }

        server {
                listen 53 udp;
                proxy_pass dns_servers;
                error_log /var/log/nginx/dns.log warn;
                proxy_responses 1;
                proxy_timeout   1s;
        }
}

For the test load, I use dnsperf as follows (on the other instance):
dnsperf -s <nginx_host_ip> -d query.txt -l 60 -c 100 -Q 10000

(that is simulating 100 clients collectively making 10k requests/second to
the nginx load balancer, for 60 seconds)
query.txt contains just a single CNAME managed in Route53. So the test
basically repeatedly asks to resolve this CNAME.

During the tests, nginx would start to throttle the upstream servers,
printing out messages such as these:
2017/08/31 00:45:46 [warn] 31728#0: *605752 upstream server temporarily
disabled while proxying connection, udp client: 10.67.15.238, server:
0.0.0.0:53, upstream: "10.67.16.10:53", bytes from/to client:43/0, bytes
from/to upstream:0/43
2017/08/31 00:45:46 [warn] 31728#0: *605774 upstream server temporarily
disabled while proxying connection, udp client: 10.67.15.238, server:
0.0.0.0:53, upstream: "10.67.16.10:53", bytes from/to client:43/0, bytes
from/to upstream:0/43
2017/08/31 00:45:46 [warn] 31728#0: *605786 upstream server temporarily
disabled while proxying connection, udp client: 10.67.15.238, server:
0.0.0.0:53, upstream: "10.67.16.10:53", bytes from/to client:43/0, bytes
from/to upstream:0/43
2017/08/31 00:45:46 [error] 31728#0: *605805 no live upstreams while
connecting to upstream, udp client: 10.67.15.238, server: 0.0.0.0:53,
upstream: "dns_servers", bytes from/to client:43/0, bytes from/to
upstream:0/0

dnsperf would print lots of requests timing out (limit is 5 seconds), and
the overall performance is bad:
  Queries sent:         94790
  Queries completed:    94450 (99.64%)
  Queries lost:         340 (0.36%)

  Response codes:       NOERROR 94450 (100.00%)
  Average packet size:  request 43, response 106
  Run time (s):         60.997054
  Queries per second:   1548.435438 

  Average Latency (s):  0.043772 (min 0.000493, max 1.011284)
  Latency StdDev (s):   0.202529

As you can see, the queries/s is a mere 1.5k requests/second, instead of
10k/sec as desired.

I've verified that each nameserver itself can handle the traffic just fine
(running the test against the nameserver directly from the same test
instance):
dnsperf -s 10.67.16.10 -d query.txt -l 60 -c 100 -Q 10000
[...]
  Queries sent:         599999
  Queries completed:    599581 (99.93%)
  Queries lost:         418 (0.07%)

  Response codes:       NOERROR 599581 (100.00%)
  Average packet size:  request 43, response 106
  Run time (s):         60.000539
  Queries per second:   9992.926897

  Average Latency (s):  0.000794 (min 0.000645, max 0.026699)
  Latency StdDev (s):   0.000750

>From I can tell, it seems nginx is throttling the nameservers because of
perceived failures in getting responses from them. How can I troubleshoot
this further?

Also, has anyone tried using nginx for DNS load balancing in production? I'd
appreciate learning about your setup as well. Anything special to do to
handle the possible TCP traffic when the response is large? 

Thanks for reading! I greatly appreciate any reply. :")

Regards,
mangysushi

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,276196,276196#msg-276196