Upstream get stuck in disabled state for websocket load balancing

Fri Jul 9 13:13:37 UTC 2021

My current nginx configuration looks like:

worker_processes 1;
error_log /dev/stdout debug;
events { worker_connections 1024; }
http {
  upstream back {
      server backend1 max_fails=1 fail_timeout=10;
      server backend2 max_fails=1 fail_timeout=10;
  }
  server {
      listen 80;
      location / {
          proxy_pass          http://back;
          proxy_http_version  1.1;
          proxy_set_header    Host $host;
          proxy_set_header    Upgrade $http_upgrade;
          proxy_set_header    Connection "Upgrade";
          proxy_read_timeout  3600s;
          proxy_send_timeout  3600s;
      }
  }
}

nginx version: nginx/1.21.1

Backend is a simple websocket server, which accepts incoming connection and
then does nothing with it.

---

# Test scenario:

1. Run nginx and backend1 server, backend2 should stay down.
2. Run several websocket clients
   Some of them try to connect to backend2 upstream, and nginx writes
("connect() failed (111: Connection refused) while connecting to upstream"
and "upstream server temporarily disabled while connecting to upstream") to
log, which is expected.
3. Run backend2 and wait some time (to outwait fail_timeout).
4. Close websocket connections on backend1 side and wait for clients to
reconnect.

Then a strange thing happens. I expect that new websocket connections will
be distributed evenly among two backends. But most of them land on backend1,
as if backend2 is still disabled. Sometimes there is one client that
connects to backend2, but it's the only one.
Further attempts to close connections on server side show the same picture.
I found that setting max_fails=0 solves the problem with distribution.

Is this correct behavior? If so, how to assure proper distribution of
websocket connection while using max_fails in such scenarios? Is there any
documentation for it?

---

Client/server code and docker-compose file used to reproduce this behavior
are below. Websocket clients can be disconnected by a command inside backend
container: ps aux | grep server.js | awk '{print $2}' | xargs kill -sHUP

# server.js

const WebSocket = require('ws');
const Url = require('url');
const Process = require('process');

console.log("Starting Node.js websocket server");
const wss = new WebSocket.Server({ port: 80 });

wss.on('connection', function connection(ws, request) {
  const uid = Url.parse(request.url, true).query.uid;

  ws.on('message', function incoming(message) {
    console.log('Received: %s', message);
  });

  ws.on('close', function close() {
    console.log('Disconnected: %s', uid)
  });

});

Process.on('SIGHUP', () => {
  console.log('Received SIGHUP');
  for (const client of wss.clients) client.terminate();
});

---

# client.js

const WebSocket = require('ws');
const UUID = require('uuid')

function client(uid) {
    const ws = new WebSocket('ws://balancer/?uid=' + uid);
    ws.on('open', function open() {
      ws.send(JSON.stringify({'id': uid}));
    });
    ws.on('close', function close() {
      setTimeout(client, 2, uid)
    });
}

for (let i = 0; i < 100; i++){
    uid = UUID.v4();
    client(uid)
}

--- 

# Dockerfile.node

FROM node
WORKDIR app
RUN npm install -y ws uuid
COPY *.js /app
CMD node server.js

---

# docker-compose.yaml

version: '3.4'
services:
  balancer:
    image: nginx:latest
    depends_on:
      - backend1
      - backend2
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
  backend1:
    build:
      dockerfile: Dockerfile.node
      context: .
  backend2:
    build:
      dockerfile: Dockerfile.node
      context: .
    command: bash -c "sleep 30 && node server.js"
  clients:
    build:
      dockerfile: Dockerfile.node
      context: .
    depends_on:
      - balancer
    command: node client.js

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,292014,292014#msg-292014