In-flight HTTP requests fail during hot configuration reload (SIGHUP)

Mon Jun 1 14:40:12 UTC 2015

We have recently migrated across from HAProxy to Nginx because it supports true zero-downtime configuration reloads.   However, we are occasionally getting 502 and 504 errors from our monitoring systems during deployments.  Looking into this, I have been able to consistently replicate the 502 and 504 errors as follows.  I believe this is an error in how Nginx handles in-flight requests, but wanted to ask the community in case I am missing something obvious.

Note the set up of Nginx is as follows:
* Ubuntu 14.04
* Nginx version 1.9.1
* Configuration for an HTTP listener:
  map $http_upgrade $connection_upgrade {
    default upgrade;
    ''      close;
  }
   server {
    listen 8080;
     # pass on real client's IP
    proxy_set_header  X-Real-IP         $remote_addr;
    proxy_set_header  X-Forwarded-For   $proxy_add_x_forwarded_for;
     access_log /var/log/nginx/access.ws-8080.log combined;

     location / {
      proxy_pass http://server-ws-8080;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection $connection_upgrade;
    }
  }

  upstream server-ws-8080 {
    least_conn;
    server 172.17.0.51:8080 max_fails=0;
  }

1. Telnet to the Nginx server on the HTTP port it is listening on.

2. Send a HTTP/1.1 request to the upstream server (172.17.0.51):
GET /health HTTP/1.1
Host: localhost
Connection: Keep-Alive

This request succeeds and the response is valid

3. Start a new HTTP/1.1 request but don’t finish the request i.e. send the following line using telnet:
GET /health HTTP/1.1

4. Whilst that request is now effectively in-flight because it’s not finished and Nginx is waiting for the request to be completed, reconfigure Nginx with a SIGHUP signal.  The only difference in the config preceding the SIGHUP signal is that the upstream server has changed i.e. we intentionally want all new requests to go to the new upstream server.

5. Terminate the old upstream server 172.17.0.51

6. Complete the in-flight HTTP/1.1 request started in point 3 above with:
Host: localhost
Connection: Keep-Alive

7. Nginx will consistently respond with a 502 if the old upstream server rejects the request, or a 504 if there is no response on that IP and port.  

I believe this behaviour is incorrect as Nginx, once it receives the complete request, should direct the request to the current available upstream server.  However, it seems that that Nginx is instead deciding which upstream server to send the request to before the request is completed and as such is directing the request to a server that no longer exists.

Any advice appreciated.

BTW. I tried to raise an issue on http://trac.nginx.com/ <http://trac.nginx.com/>, however it seems that the authentication system is completely broken.  I tried logging in with Google, My Open Id, Wordpress and Yahoo, and all of those OpenID providers no longer work. 

Thanks,
Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20150601/56da7906/attachment.html>